How to Handle Duplicate and Near-Duplicate Documents Throughout Discovery and Review

Duplicate Documents So Far and Yet So Near

Every good litigator or litigation support professional knows that one way to reduce a discovery collection is to use e-discovery technology to remove irrelevant data, such as executable files, system data and “junk mail,” right from the start. But, upon us now is a new wave of software functionalities and integrations that can be employed to cut to the chase and get relevant data in front of reviewers even faster than ever before. Effective e-discovery software has proven to be instrumental in processing the increased volume of electronic data and has proven to help lower the cost of discovery and review overall.

Duplicate documents are determined in a review set by using forensically sound MD-5 and SHA-1 hashing algorithms. What about the files that have a 1-2 byte change from an original document to a new document? Would a review team need to review the document because it wasn’t an exact duplicate? Common sense would tell you no. These are classified as near-duplicate documents that can be found forensically, and therefore, set aside along with the duplicate documents that do not need to be subject to attorney review. Without proper detection of duplicate or near-duplicate data, not only does the cost of review increase as the volume of data is increased, but the risk of inadvertent production increases as well. It is highly desirable to segregate this data so that it may be effectively managed.

Near-Duplicate Detection

Near-duplicate detection is the process of describing the similarity of data elements within data sets. Data elements may be grouped to indicate that elements within a given group have common content. For purposes of illustration, consider the following sentences:
A: The quick brown fox jumped over the lazy dog.
B: The fox jumped over the dog.
C: Near-duplicate detection is important.

Plainly, sentence B is similar to sentence A and sentence C is not similar to sentences A or B. Sentences A and B are similar because they share similar content: “The fox,” “jumped,” “over the,” and “dog.” Sentences A and B do not share similar content with sentence C: sentences A and B are likewise grouped and could be considered as near-duplicate documents.

Implementation of Near-Duplicate Detection

Application of near-duplicate detection to forensically reduce the number of documents to be reviewed must rigorously and demonstrably conform to the Federal Rules of Civil Procedure and should follow the Electronic Discovery Reference Model and Sedona Conference and its Working Groups’ best practices. A proven algorithm that is both high performance and conformant to forensic data processing standards is known as “Context Triggered Piecewise Hashing,” and is based upon the work of computer forensics researcher Jesse Kornblum and award-winning computer scientist Andrew Tridgell.

The CTPH algorithm may be used to find near-duplicate documents. It provides a percentage match from 1 percent to 99 percent and allows the reviewer to establish the “threshold of similarity” for the collection and to either include or exclude documents and data based on percentage matching.

Each legal team can specify the threshold value (most likely 75 percent) that controls the grouping of nearduplicate documents. Some technologies use a default threshold of 80% similarity for near-duplicate detection. For larger review teams, it may be beneficial to tune this number down somewhat to reduce the risk of similar documents being assigned across different review resources. Any value of less than 50% is not typically useful for review but could be of benefit in reporting on the scope of a document collection.

Once near-duplicates are identified within the assigned threshold, they can be kept nearby in folders as part of the collection, but not part of the active attorney review.

Classification and Foldering

There are SaaS and web server-based review tools, such as CT Summation CaseVantage and iCONECT that allow you to allocate documents to specific folders and help navigate through data quickly. Culling tools help classify the documents as duplicate documents or nearduplicate documents; review tools help segregate this data into sub-groups in order to expedite review. Segregated data are still part of the collection, but not part of attorney review. This is important because if the court determines that a first keyword search was inadequate or was not a reasonable search, it is good to have the data nearby in folders for when more specific disclosure orders are received.


Identifying and addressing the issue of duplicate and near-duplicate documents early in the case has helped reduce review costs downstream. Sophisticated technology is now being employed to help cull and effectively manage and reduce collections prior to attorney review. While the market is becoming more tech-savvy and tech-dependent, legal professionals can be sure that there will be a continued new wave of technology that will help them meet the challenges of discovery.

Robert Childress of Wave Software

Robert Childress of Wave Software

About the author: Robert Childress, Wave Software president and co-founder, has held senior positions with key technology corporations including Lexis-Nexis, Thomson & Thomson, Elsevier Science, CXA Inc., McGraw-Hill and Shepard’s.

This article originally appeared in the Legal Technology Update.

~ by CDLB on May 20, 2009.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: