Some simple understandings of the large language model

Some simple understandings of the large language model

The effect of the ChatGPT is stunning enough, GPT4 provides even more accurate and amazing capabilities. The entire IT industry is currently experiencing a shake-up due to the large models. Every company is trying to develop its LLM to keep up with the tech trend.

It makes me curious: What is the difference between LLM and deep learning that we knew in the past? Here’s some of my rough understanding.

From the battle of model structure to the battle of dataset and training skill

Pretrained on HUGE dataset

Technically, what changed in this new trend of large models? The answer is obvious. The huge size of models and datasets.

Here are the model sizes of GPT and other models. It can be found that the amount of parameters in all these models is outrageously large.

It’s definitely a challenge to collect and preprocess such an amount of data. A good large model must rely on the quantity and quality of input data.

Amount of params Training data Release time
GPT 117M 5GB Jun 2018
GPT2 Up to 1.5B 40GB Feb 2019
GPT3 Up to 175B 45TB May 2020
GPT3.5 details not released details not released Sep 2022
GPT4 details not released details not released Mar 2023
Amount of params Training data Release time
PaLM Up to 540B 3.6 trillion tokens Apr 2022
GLM Up to 130B 2.5TB Aug 2022
LLaMA Up to 65B 6TB Feb 2023
ERNIE3.0 over 100B details not released Mar 2023

How to get a good performance on a classical deep learning task? People used to focus on the model structure, and there are a bunch of tries of model structures. In the past development of neural network models, researchers have always focused on the improvement of the model structure. But after the famous paper of Transformer, Attention is All You Need, the change of structure gradually stabilised. The attention module is thus suitable for deep networks and makes it possible to make the model size larger and larger.

The achievement of GPT is no longer based on the improvement of model structure. Actually, the substructure of GPT1, 2, 3, 3.5 and even 4 is the same, just Transformer decoder. The difference is the training method and the size of models and datasets.

Apart from OpenAI, other tech companies and universities also delving into large models. For example, PaLM, GLM, LLaMa, ERNIE-bot and so on. When looking at these models in detail, the basic structures are not pretty different.

Model structures are no longer the main battlefield.

FT & RLHF

With such a large-scale dataset, the pretrained model has learned the patterns of natural language, but this does not mean that the model can understand human commands as well as give appropriate responses. It means that the model can only continue to do sentence completion through the statistical laws from pretrained corpus.

Therefore, further training is needed. The second stage of training is the Instruction Tuning Stage. The pretrained model has so much knowledge, but it doesn’t know how to answer your question and follow the instructions. This stage is for teaching the model to do so. Technically, it is a kind of SFT(Supervised Fine-Tuning).

However, the number of SFT dataset in the stage of Instruction Tuning is limited, and also introduce AI hallucinations and other security issues. To solve the problem, reinforcement learning methods have been applied to the field of language models. We need to train a reward model first, and the reward model will rate or rank the answers from the large language model. The reward model should learn the human preference for answers from its training data. That’s why it is called RLHF(Reinforcement Learning from Human Feedback). There are different methods to do so, e.g. BON, DPO and PPO. Researchers are trying various ways to do RLHF. The new challenge is becoming the training skill instead of model structure.

More

The better training skill is the reason why OpenAI’s model can stand out. It is crucial, but there are more challenging works to do such as model distillation, multimodal, and inference acceleration… There are many stuff worth exploring in deep learning, and I am optimistic about it.

These are my simple, childish understandings and opinions of large models. Please correct me if you notice any misinformation! I appreciate that.

References

[Paper] Attention is All You Need
https://doi.org/10.48550/arXiv.1706.03762
[Paper] GPT-4 Technical Report
https://doi.org/10.48550/arXiv.2303.08774
[Paper] Training language models to follow instructions with human feedback
https://doi.org/10.48550/arXiv.2203.02155
[Paper] ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
https://doi.org/10.48550/arXiv.2107.02137
[Paper] LLaMA: Open and Efficient Foundation Language Models
https://doi.org/10.48550/arXiv.2302.13971
[Paper] Llama 2: Open Foundation and Fine-Tuned Chat Models
https://doi.org/10.48550/arXiv.2307.09288
[Paper] GLM: General Language Model Pretraining with Autoregressive Blank Infilling
https://doi.org/10.48550/arXiv.2103.10360
[Paper] GLM-130B: An Open Bilingual Pre-trained Model
https://doi.org/10.48550/arXiv.2210.02414
[Zhihu] 从零开始训练大模型
https://zhuanlan.zhihu.com/p/636270877

Some simple understandings of the large language model

https://smallsquare.github.io/Large-model/

Author

SmallSquare

Posted on

2023-07-16

Updated on

2023-11-12

Licensed under

Comments