【DAILY READING】Textbooks Are All You Need
Summary of total paper
This paper gives an opinion is that the quality of the data has a higher cost performance on the size of model on specific targets now. It proved this point of view by using Code as an example. A 1.3B model, pre-trained and finetuned on some textbooks-data, gives an extraordinary performance on HumanEval and MBPP.
Abstract
With the emerging of Transformer architecture, there came out a phenomenon that if we want to improve the performance of the model, just get more data and enlarge the size of the model. But we found that if the quality of the data is good enough, we will get a remarkable model in a smaller size with fewer data. By this way, we got two models. One is 1.3B, pretrained with 7B tokens of “textbook quality” data and finetuned on 200M tokens of “textbook-exercise-like” data. And another is more smaller. Both of them get a unbelievable performance on HumanEval and MBPP.
Training details and the importance of high-quality data
Some drawbacks of common code data:
- not self-contained, depend on other external files or modules.
- Typical examples don’t have logically compute, but rather consist of meaningless code such as constants, parameters, or GUI elements.
- Some good codes are often buried inside complex or poorly documented functions.
- Some examples are skewed onto certain topics or use cases, will result an unbalanced distribution of dataset. Anyway, the common code data have so many noises.
Datasets we used:
- A filtered code-language dataset from The Stack and StackOverflow. Obtained by using a language model-based classifier. (About 6B tokens)
- A Python textbook dataset generated by GPT-3.5. (About 1B tokens)
- A small synthetic exercises dataset of Python exercises and solutions. (About 180M tokens)
Filtering of existing code datasets using a transformer-based classifier
Raw data is the Python subset of the deduplicated version of The Stack and StackOverflow, which contains over 35 million files/samples, counting over 35B tokens. Then we annotate the quality of a small subset of these files using GPT-4 by giving a code snippet to it with a prompt of determine its educational value for a student whose goal is to learn basic coding concepts. We then use this annotated dataset to train a random forest classifier to predict the quality of code. Note that we use GPT-3.5 to generate data but use GPT-4 to annotations the quality only.
Creation of synthetic textbook-quality datasets
It is so important to keep the diversity of the dataset, but it is also hard. Because language models tend to follow the most probable or common paths given their training data and their priors. So they are tend to give some more related outputs. Inspired by, we look for ways to inject randomness into the prompt in a way that gives rise to the generation of a diverse dataset.
- The synthetic textbook dataset
- Here, diversity is obtained by providing constraints on topics and target audience of the generated textbook.
- The CodeExercises dataset
- A decoder only transformer model using the FlashAttention implementation of multi-head attention (MHA).
- Use MHA and MLP layers in parallel configuration following some recent models like CodeGen, PaLM, and GPT-NeoX.
- The 1.3B/350M model consists 24/20 layers, hidden dimension of 2048/1024, MLP-inner dimension of 8192/4096, 32/16 attention heads of dimension 64 each.
- Use a rotary position embedding with rotary dimension 32.
- Aside from FlashAttention, the models do not use other techniques like Fill-In-the-Middle (FIM), or Multi-Query-Attention (MQA) that could further boost performance and efficiency.
- The models are trained on sequence length of 2048 with next-token prediction loss.
- Use fp16 training with AdamW optimizer, linear-warmup-linear-decay learning rate schedule, and attention and residual dropout of 0.1.
- Trained on 8 Nvidia-A100 with about 4 days. And finetuned about 7 hours. Pretraining: Trained on the CodeTextbook dataset with batch size 1024, maximum learning rate 1e-3, with warmup over 750 steps and weight decay 0.1, for a total of 36,000 steps. (Used the checkpoint as 24,000 steps) Finetuning: Finetuned with the same setup as pretraining, but different hyper-parameters: batch size of 256, maximum learning rate 1e-4 with 50 steps warmup, and weight decay 0.01 for total 6,000 steps.
Spikes of model capability after finetuning on CodeExercises
It suggests that the finetuning process might have helped the model in reorganizing and consolidating the knowledge acquired during pretraining. Even if such knowledge is not explicitly present in the CodeExercises dataset.
Finetuning improves the model’s understanding
Finetuning improves the model’s ability to use external libraries
Even include the libraries that the exercises do not contains.
Evaluation on unconventional problems with LLM grading
We evaluated the generating code not only by running it out, but also examines the reasoning and the quality of the solution by GPT-4.
Data pruning for unbiased performance evaluation
Try to avoid the pre-training data consists the data in HumanEval, we removed the duplicated data by N-gram overlap search, and computed the similarity on Embedding and syntax-based similarity analysis. We used L2 distance on Embedding, and abstract syntax trees (ASTs) on syntax.
File