15 Finetuning Data Catalogs for Foundation Models

Finetuning Data Catalogs

Add Resource

Text 10 Speech 11 Vision 2

AI4Bhārat Indic NLP
A repository of Indian language text and speech resources, including datasets.
Text Speech
Arabic NLP Data Catalogue
A catalogue of hundreds of Arabic text and speech finetuning datasets, regularly updated.
Text Speech
CHiME-5
Speaker Diarization dataset comprising over 50 hours of conversational speech recordings collected from twenty real dinner parties that have taken place in real homes
Speech
Data Provenance Collection
A repository and explorer tool for selecting popular finetuning, instruction, and alignment training datasets from Hugging Face, based on data provenance and characteristics criteria.
Text
ImageNet
An image classification dataset with 1.3M samples and 1000 classes
Vision
Indonesian NLP Data Catalogue
A respository of hundreds of Indonesian language datasets.
Text Speech
Lanfrica
An online catalogue that provides links to African language resources (papers and datasets) in both texts and speech
- Website
Text Speech
Masakhane NLP
A repository of African language text and speech resources, including datasets.
Text Speech
MS COCO
Object detection, segmentation, captioning and retrieval dataset
Text Vision
OpenSLR
A collection of user-contributed datasets for various speech processing tasks
- Website
Speech
SEACrowd
A repository of hundreds of South East Asian language datasets.
Text Speech
VoxCeleb
Speaker Identification dataset comprising of YouTube interviews from thousands of celebrities
Speech
VoxLingua107
Spoken language identification dataset created using audio extracted from YouTube videos retrieved using language-specific search phrases
Speech
Zenodo AfricaNLP Community
An online catalogue that provides African language resources (data and models) in both texts and speech
- Website
Text Speech
Aya Dataset
A permissively licensed multilingual instruction finetuning dataset curated by the Aya Annotation Platform from Cohere For AI. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators, spanning 65 languages.
Text