SuperDeduper: How it finds and eliminates duplicate references

SuperDeduper, a part of Laser AI tool, features a deduplication module designed to help you efficiently manage your reference lists by identifying and removing duplicate entries. This article provides a technical explanation of how our deduplication module works. If you want to know how to perform deduplication process in Laser AI, click here.

The core of our deduplication process is an rule-based algorithm that compares various details (metadata) of each reference from the uploaded RIS file. Compared metadata include:

Title
DOI (Digital Object Identifier)
Authors
Journal
Pages
Volume
Abstract

This approach is transparent and reliable - a crucial feature in healthcare where precision is non-negotiable.

How the Deduplication Process Works

Our deduplication module follows a step-by-step methodology to identify potential duplicate references:

1. Data Preparation - first, the module prepares the data for comparison. It normalizes text by making it consistent.

Example: module standardizes author names and ensures all text is in the same case. This step allows the system to recognize that "John Smith" and "J. Smith" are the same person and that "The Lancet" and "Lancet" refer to the same journal.

2. Grouping References - to work efficiently, the module first groups similar references together. It does this by sorting them based on key fields like title or author name.

3. Calculating Similarity - within each group, the module compares pairs of references. It calculates a similarity score for each pair by looking at several metadata fields.

The scores from comparing each of these fields are combined to produce a final, single score for the pair. If this final score is above a certain level (threshold) - the pair is flagged as a potential duplicate. Importantly, the algorithm is sensitive not only to different data within a single field, but also to the absence of data.

Confidence Levels: Your Guide to Review

Once potential duplicates are identified, SuperDeduper goes a step further by assigning a confidence level to each group of duplicates. This feature helps you to decide if you should perform manual review of duplicated or can accept them automatically - instead of having to review every single potential match with the same level of scrutiny, you can focus your attention where it's most needed.

We have developed a four-level scale for assessing duplicates confidence:

Very High Confidence (Score ≥ 0.9): These are matches the system is almost certain are duplicates. You can often approve these quickly.
High Confidence (Score ≥ 0.7): These are very likely duplicates, but still can be approved by batch.
Medium Confidence (Score ≥ 0.4): This is where you'll want to pay closer attention.
Low and Mixed Confidence (Score ≥ 0.01): These are the most uncertain matches. It's recommended to carefully review these to avoid accidentally removing a valid reference.

Case example
Consider a group of three records: A, B, and C.
  • Record A is identified as the primary reference.
  • Record B is flagged as a duplicate of A with very high confidence.
  • Record C is flagged as a duplicate of A with low confidence.

Even though B is a strong match to A, the group also includes C, which has only a low-confidence match. Because the group as a whole must be treated consistently, it is assigned to the "Low and Mixed Confidence" bucket.

Key rule: A deduplication group is always categorized based on the lowest confidence level among its pairwise matches.

By providing these confidence levels, SuperDeduper empowers you to conduct your review process with greater efficiency and precision, ensuring the integrity of your research without getting bogged down by unnecessary work. For instance, you can focus more on "Low" or "Medium" confidence clusters, while potentially streamlining the review of "High" and "Very High" confidence matches.

SuperDeduper - is it safe and effective way to remove duplicates?

The SuperDeduper module has been rigorously tested and validated on both prospective and retrospective approach. The retrospective validation results have been presented on the ISPOR 2024 conference. However, to ensure the highest possible quality of evidence and test changes made in the algorithm since the release of abstract, we also conducted prospective validation using our own representative benchmark dataset. These datasets include a diverse sample of three systematic reviews, various in size and covered different topics, and including records from a wide range of databases. To create our "gold standard" for comparison, two independent experts manually performed the deduplication on these datasets.

Our testing on the prospective sample shows that SuperDeduper correctly identified the vast majority of duplicates that were previously found by human reviewers and, in some cases, even uncovered a few that were missed. Simultaneously, it was highly accurate in avoiding errors, with only a total of 12 records across all datasets (over 40 000 records altogether) being incorrectly marked as duplicates (false positives).

On average, SuperDeduper achieved an accuracy of 99.4% and average sensitivity 98.2%, while its ability to correctly identify non-duplicates (specificity) was over 99.9%.

It is important to note that none of the records incorrectly flagged as duplicates were automatically removed by the tool. Instead, they were assigned a low confidence rating, ensuring that users retain full control and the final decision is always in the hands of the human reviewer. The complete methodology and results will be presented in a publication that is currently being prepared.

RELATED ARTICLES

1. Deduplication process