【DAILY READING】An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1:Pretraining
Conclusion By Myself
This is not a formal paper from conferences or journals, this is a blog from Yao Fu. This blog introduces a view that the data may have a more deeper influential to model than us know now. First, it introduce of grokking, grokking is the phenomenon of the model changes its performance from simple to complexity in a short time but after a long training time. I think it is a little like the concept of “emergent”? But not totally same. “Emergent” means the model get the ability to deal with the case it had never seen, the “Grokking” means the duration of the model changes better on common test data. Then, this blog elaborates how will the data influence the model in two points.
- Data factors that influence speed of learning.
- How to measure the speed of learning.
The factors of data influence the speed of learning in three ways, format of data, curriculum of data and the mix method of the data when pre-train. Then it discusses the speed measuring of model learning from micro level to macro level. OK, this is about all I can understand of this blog. I listened others’ share of this blog, and I begin to know how poor my understanding is. Let’s read more, study more.