6 Data Deduplication Resources for Foundation Models

Removing data duplicates can 1) reduce the likelihood of memorizing undesirable pieces of information such as boilerplate text, copyrighted data, and personally identifiable information, 2) improves training efficiency by reducing the total dataset size. Practitioners should always determine whether duplicated data will harm or help the model for their use case.

Data Deduplication Resources for Foundation Models

Data Deduplication

Text 5 Speech 1 Vision 2
  • Apricot

    apricot implements submodular optimization for the purpose of summarizing massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualizing the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.

    Text Speech Vision
  • Datacomp image dedup

    Data to deduplicate vision datasets for the Datacomp challenge.

  • Dolma Dedupe Tool

    Dolma’s text deduplication tool for pretraining data

  • Google Text Deduplication

    A repository to deduplicate language model datasets. They release the ExactSubstr deduplication implementation (written in Rust) along with scripts to perform ExactSubstr deduplication and inspect the results (written in Python). They also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.

  • RedPajama-Data

    Tools for: exact deduplication with bloom filter, fuzzy deduplication with LSH, calculating quality scores

  • Pile

    A set of tools for deduplication with MinHashLSH