12 Data Search, Analysis, & Exploration Resources for Foundation Models

Data Exploration

Add Resource

Text 10 Speech 1 Vision 2

AI2 C4 Search Tool
A search tool that lets users to execute full-text queries to search Google’s C4 Dataset.
- Website
Text
Data Finder
A tool to help build search over academic datasets given a natural language description of the idea.
Text
Data Provenance Explorer
An explorer tool for selecting, filtering, and visualizing popular finetuning, instruction, and alignment training datasets from Hugging Face, based on their metadata such as source, license, languages, tasks, topics, among other properties.
Text
GAIA Search Tool
A search tool over C4, the Pile, ROOTS, and the text captions of LAION, developed with Pyserini (https://github.com/castorini/pyserini) .
Text
Hugging Face Data Measurements Tool
A tool to analyze, measure, and compare properties of text finetuning data, including their distributional statistics, lengths, and vocabularies.
- Hugging Face
Text
Know your data
A tool for exploring over 70 vision datasets
Vision
LAION search
Nearest neighbor search based on CLIP embeddings
Text Vision
NVIDIA Speech Data Explorer
Tool for exploring speech data
- Website
Speech
ROOTS Search Tool
A tool, based on a BM25 index, to search over text for each language or group of languages included in the ROOTS pretraining dataset.
- Hugging Face
Text
What's In My Big Data?
A platform for analyzing large text datasets at scale
Text
WIMBD
A dataset analysis tool to count, search, and compare attributes across several massive pretraining corpora at scale, including C4, The Pile, and RedPajama.
Text
Nomic
A proprietary service to explore data with embedding maps.
- Website
Text