recipe

An open source recipe to reproduce the LLaMA training dataset RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models. This repo contains a reproducible data ingestion of RedPajama data, with the following token counts: Dataset Token Count Commoncrawl 878 Billion C4 175 Billion GitHub 59 Billion Books 26 Billion ArXiv 28 Billion Wikipedia 24 Billion StackExchange 20 Billion Total 1.2

An open source recipe to reproduce the LLaMA training dataset RedPajama-Data Read More »