The Problem with Duplicate Files and How To Find Them

By John Bald

 

Step 1: The Problem with Duplicate Files

There are many problems associated with duplicate files:

  • Unnecessary and redundant.
  • Take up valuable storage space.
  • May contain sensitive information and are stored in an unmanaged location.
  • Important changes made to the original document will not be made to the duplicate rendering the contents invalid and obsolete.
  • Legal holds and retention rules may not be applied to duplicates creating a liability.
  • Search results are muddied by multiple returns of the same documents.
  • Data analytics and metrics can be skewed by an over abundance of the same information.

Finding all of your duplicates and getting them under control can be a huge undertaking. But it is possible, and well worth doing.

 

Step 2: How do you find them?

A popular technique that we use to find duplicate files is through hashing. The process of hashing generates a hash code for each file. A hash code is a fixed length code that represents a binary source of any length.

Here is an example:

9466c304e74cbe6ebb7abf2eab0cf903

If any file has the exact same hash code as another file, we can be 100% sure that the files being compared are exact duplicates.

Step 3: How is a Hash Code Generated?

A hash code is generated by passing any sized string (the contents of a file) to a hash function that returns the hash value or code. The hash function treats the string as bytes to generate the code.

hashing-function-structure


Step 4: The Hash of a File Is Only the Contents

The hash of a file only applies to the actual contents. In the search results shown below you will see that 4 of the files all have the exact same hash code, even though 1 has a different name, 1 has a different extension, and one is an exact copy.

The changed document from above only had one period removed. The slightest change to a document will result in the generation of an entirely different hash code. 

Step 5: Visualizing Duplicates

By hashing all of your data and recording it in an index, you can quickly visualize and begin to understand the amount of duplicates that are present in your organization. It is not uncommon to find out that over half of the files in any given dataset are exact duplicates.

duplicate visualization

Next Steps:

Learn how to find similar files in our video Discovery Desktop - Functions and Settings

Advanced Articles:

For current Shinydocs customers, visit our Help Desk for access to complete product manuals, how-to and troubleshooting articles, use cases and instructional videos.

Related Help Desk Articles: 

Tags: Records Management, ROT, Search, Discovery Desktop, Cognitive Suite

Download a Sample Shinydocs Metadata Report

Get actionable recommendations to understand your unstructured data and start moving information from liability to viability.

metadatareport