Handling Duplicate and Near-Duplicate Records

Modified on Thu, 31 Jul at 3:21 PM

Need help with anything in this article or have other questions? Contact us at support@noticiasolutions.com

Large datasets collected from multiple custodians often contain a range of exact and near duplicate documents. Exact duplicates can be the same email collected from multiple inboxes, or the same document stored across different drives. Near duplicates might include different formats of the same content (e.g. PDF and Word versions), partial content such as an email reply that includes a prior message, or slightly modified visuals like cropped vs. uncropped images.

There are a number of strategies for managing duplicates—each with tradeoffs that depend on the type of duplication, the review stage, and your overall risk tolerance. This article outlines the most common types of duplication and provides practical guidance for how to handle them in a defensible, efficient manner.

For more information on how to use the Nuix find/remove duplicate fields, click here to access their knowledge base.

Common Duplicate Types

Exact Duplicates (Hash Duplicates): These are documents with exactly the same content and metadata. Their hash values are identical.
Family Duplicates (Family Hash Duplicates): These refer to email families—parent emails with attachments or embedded content—where the full set of documents is identical to another set. The family hash will be the same.
Thread Duplicates: These are emails that are wholly contained within a later message in the thread. A reply, for example, may include all prior messages inline.
Near Duplicates: These are conceptually or visually similar documents that do not share an exact hash. This may include format shifts (Word vs PDF), partial edits, or similar content with small variations.

Family Deduplication

Whenever we add new data to a case, we generally identify and deduplicate family-level duplicates—whether through coding or deletion. These often arise when the same archive is collected from multiple custodians or systems. Because the content is 100% identical, these can be safely removed without loss of information.

Thread and Near Duplicate Strategy

Other duplicate types—particularly email thread and near duplicates—require more nuanced decision-making, as they may contain important contextual or marginal differences. Here are some common strategies, depending on your review goals and risk appetite:

Thread Deduplication: We automatically identify thread duplicates—emails that are wholly included within other emails later in the thread. If a document was *not* coded as a thread duplicate, it likely contains content not present in the subsequent messages. Reviewing just the "pivot" or final email in the thread is often sufficient, though this depends on the scope of your review.
Using Thread View for Speed: In the review workspace, collapsing child messages within the thread view can speed up first-pass relevance review. If the thread-level document is not relevant, its embedded messages are typically not reviewed either.
Comparing Near Duplicates: Workspace C’s Compare-Related dashboard offers a similarity score between related documents. This is particularly useful for spotting minor differences in similar versions, such as contract iterations or translated content.
Risk Considerations: While it’s tempting to apply blanket decisions across similar documents, we generally advise caution—especially if privilege, redaction, or nuanced factual differences are at stake. Similar documents can lead to different coding decisions. Your team’s level of risk tolerance should guide whether you treat near duplicates as fully reviewable or apply selective sampling.

If you’re unsure which path to take, we’re happy to help walk through the options and determine the best fit for your review strategy.