With the great success of generative artificial intelligence in recent months, Large Language Models are continuously being further developed and improved. These models are contributing to some notable economic and societal changes. The popular ChatGPT, developed by OpenAI, is a natural language processing model that allows users to generate meaningful text like humans. In addition, it can answer questions, summarize long paragraphs, write codes and emails, etc. Other language models like Pathways Language Model (PaLM), Chinchilla, etc. have also shown great performances in mimicking humans.
Large language models use reinforcement learning for fine-tuning. Reinforcement Learning is a feedback-driven machine learning method based on a reward system. An agent learns how to behave in an environment by performing specific tasks and observing the results of those actions. The agent gets positive feedback for every good task and a penalty for every bad action. LLMs like ChatGPT offer exceptional performance thanks to reinforcement learning.
ChatGPT uses Reinforcement Learning from Human Feedback (RLHF) to fine-tune the model by minimizing bias. But why not supervised learning? A basic reinforcement learning paradigm consists of labels used to train a model. But why can’t these labels be used directly with the supervised learning approach? Sebastian Raschka, an AI and ML researcher, shared some reasons why reinforcement learning is used in fine-tuning instead of supervised learning in his tweet.
The first reason not to use supervised learning is that it only predicts ranks. It doesn’t produce coherent answers; The model only learns to give high scores to answers that are similar to the training set, even if they are not coherent. On the other hand, RLHF is trained to estimate the quality of the generated response and not just the ranking score. Sebastian Raschka shares the idea of reformulating the task as a constrained optimization problem using supervised learning. The loss function combines the returned text loss and the reward score term. This would result in better quality of the generated answer and ranks. But this approach only works successfully if the goal is to form question-answer pairs correctly. But cumulative rewards are also necessary to enable coherent conversations between the user and ChatGPT, which SL cannot provide. The third reason not to go for SL is that it uses cross-entropy to optimize token level loss. Although at the token level for a passage of text, changing individual words in the response may have only a small impact on the overall loss, the complex task of creating coherent conversations can result in a complete change of context when a word is negated. Therefore, depending on SL may not be enough, and RLHF is necessary to take into account the context and coherence of the whole conversation. Supervised learning can be used to train a model, but it has been found that RLHF tends to perform better empirically. A 2022 article entitled “Learning to Summarize from Human Feedback” showed that RLHF performed better than SL. This is because RLHF accounts for the cumulative rewards for coherent conversations, which SL cannot capture due to its token-level loss function. LLMs like InstructGPT and ChatGPT use both supervised learning and reinforcement learning. The combination of both is crucial to achieve optimal performance. In these models, the model is first refined with SL and then further updated with RL. The SL stage allows the model to learn the basic structure and content of the task, while the RLHF stage refines the model’s responses for improved accuracy. 🚨 Be part of the fastest growing AI subreddit community with 15,000 members
Tanya Malhotra is a senior at University of Petroleum & Energy Studies, Dehradun pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking skills and a passionate interest in learning new skills, leading groups and managing work in an organized manner.
Previous ArticleDeep Learning on a Data Diet