6 Data Deduplication Resources for Foundation Models

Data Deduplication

Text 5 Speech 1 Vision 2

Apricot
apricot implements submodular optimization for the purpose of summarizing massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualizing the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.
Text Speech Vision
Datacomp image dedup
Data to deduplicate vision datasets for the Datacomp challenge.
Vision
Dolma Dedupe Tool
Dolma’s text deduplication tool for pretraining data
- GitHub
Text
Google Text Deduplication
A repository to deduplicate language model datasets. They release the ExactSubstr deduplication implementation (written in Rust) along with scripts to perform ExactSubstr deduplication and inspect the results (written in Python). They also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.
Text
RedPajama-Data
Tools for: exact deduplication with bloom filter, fuzzy deduplication with LSH, calculating quality scores
- GitHub
Text
Pile
A set of tools for deduplication with MinHashLSH
Text