10 Pretraining Repositories for Foundation Model Training

Practitioners should consider using already-optimized codebases, especially in the pre-training phase, to ensure effective use of computational resources, capital, power, and effort. Existing open-source codebases targeted at foundation model pretraining can make pretraining significantly more accessible to new practitioners and help accumulate techniques for efficiency in model training.

Pretraining Repositories for Foundation Model Training

Pretraining Repositories

Text 7 Speech 2 Vision 3
  • Levanter

    Levanter is a framework for training large language models (LLMs) and other foundation models that strives for legibility, scalability, and reproducibility:

  • GPT-NeoX

    A library for training large language models, built off Megatron-DeepSpeed and Megatron-LM with an easier user interface. Used at massive scale on a variety of clusters and hardware setups.

  • Kosmos-2

    For training multimodal models with CLIP backbones.

  • Lhotse

    Lhotse

    Python library for handling speech data in machine learning projects

    Speech
  • Megatron-DeepSpeed

    A library for training large language models, built off of Megatron-LM but extended by Microsoft to support features of their DeepSpeed library.

  • Megatron-LM

    One of the earliest open-source pretraining codebases for large language models. Still updated and has been used for a number of landmark distributed training and parallelism research papers by NVIDIA.

  • OpenCLIP

    Supports training and inference for over 100 CLIP models

    Text Vision
  • OpenLM

    OpenLM is a minimal language modeling repository, aimed to facilitate research on medium sized LMs. They have verified the performance of OpenLM up to 7B parameters and 256 GPUs. They only depend only on PyTorch, XFormers, or Triton.

  • Pytorch Image Models (timm)

    Hub for models, scripts and pre-trained weights for image classification models.

    Vision
  • Stable Audio Tools

    A codebase for distributed training of generative audio models.

    Speech