Q* is a new model from OpenAI that emphasizes 'planning,' where the AI model is required to show its reasoning steps and is evaluated on the accuracy of each step rather than just the final answer.
There's a pattern of media overhyping AI advancements, but Q* might be a significant step in towards AGI.
This is a noteworthy development within OpenAI, and it could lead to another big OpenAI release.
Last week The Information discovered what seems like it might be the underlying story behind last week’s palace intrigue at OpenAI: A breakthrough model called Q* (pronounced Q-star) that might make GPT-4 look like child's play.
The first news cycle made it sound like a potential threat to humanity. So, is this a Terminator-level risk? Or is this just another round of AI sensationalism in the online discourse? It wouldn’t be the first time everyone freaked out over a supposedly major new AI discovery that turned out to be nothing.
In 2017, a story about Meta shutting down an AI after it developed its own language humans couldn’t understand was published by most of the major news outlets. In reality: 1) this degeneration of language was not a new phenomenon and is a common problem and challenge faced when using the model training technique called “reinforcement learning”, and 2) this AI they “shut down” is available for all to download on GitHub, so you can run this Skynet on your own computer.
In 2022, a Googler claimed LaMDA, the LLM behind Bard, was sentient. (It wasn’t.) Even OpenAI has been accused of sensationalizing its own achievements. In February 2019, OpenAI released GPT-2, but they withheld the largest version of the model until November 2019, due to concerns that the model was too powerful and dangerous. However, it has been openly debated whether there was ever any danger, whether GPT-2 was the model that posed the most danger, and whether OpenAI’s decision not to publish the model was an effective strategy for averting any such danger.
At the time, many saw GPT-2 as less useful in real-world applications than purposefully trained versions of other models, like BERT, and the decision to withhold the largest GPT-2 and the language around the decision was jokingly referred to by some as the ultimate marketing strategy.
So how much of the Q* hype is overblown? Are the news headlines “complete nonsense about Q*”, as Meta’s Chief AI Scientist put it? Q* is being overhyped and catastrophized, but it is also an expected but likely noteworthy development within OpenAI, and I think it could lead to another big OpenAI release.
Why Q* Is Different
We don’t know exactly what Q* is, but The Q* Hypothesis, written by Nathan Lambert who recently left HuggingFace as RLHF lead, gives a plausible description of what Q* could be. There are several key elements here, but there is one I consider to be the most compelling in terms of differentiated value added: planning.
What is planning? When you did math in the first years of elementary school, your teacher likely gave you a binary correct/incorrect score based on the final answer for each problem. As you moved into more complicated math, your teachers probably asked you to “show your steps”, and you were graded on the accuracy of each step rather than the final answer alone. This is important for several reasons:
- If you work on a complicated math problem, and your teacher simply tells you the final output is wrong, you don’t know where to start learning to get to the correct answer next time. You don’t know what understanding you should keep, and what you should toss.
- It is possible to get the final answer right, but for the wrong reason (even a lucky guess). If you are not made aware that your underlying logic was wrong, you will still have a flawed mathematical understanding and will not reliably produce correct answers.
- If you get 90% of the work right but mess up the final step, but your teacher grades you only on your final output, the 0% score your teacher would give you would not be an accurate representation of your real understanding.
Planning is like requiring an LLM to show its steps. When we ask the model to generate not only the answer but also the steps leading up to the answer, and we then score the model based on the individual steps within the generated answer rather than the final answer alone, this becomes powerful in the same ways it is for humans learning math.
OpenAI released an important planning paper in May of this year titled Let’s Verify Step by Step. In the paper, they asked LLMs to show their work and then graded the model based on each individual step rather than the entire output as a whole (see the figure below). They also released an open-source dataset called PRM800k, which includes 800,000 samples you can use to train a similar model in this way. This is not how popular LLMs you know and love, like GPT-4 and Llama 2, are trained. During training, they are given a single score for the entire generated output rather than the individual steps within.
In the figure above from OpenAI’s Let’s Verify Step by Step paper, all the steps are right except the last one. The paper suggests we reward the model for all the steps it got right rather than simply punishing the model for getting the final answer wrong.
This is exciting. It fundamentally changes the model’s learning process. As Meta’s Chief AI Scientist points out, “Pretty much every top lab (FAIR, DeepMind, OpenAI etc) is working on [planning] and some have already published ideas and results.” However, OpenAI has a history of scaling up research to exciting outcomes. The main innovations of GPT-2 and GPT-3 were scale. When OpenAI scaled up their RLHF paper in 2022, they gave us ChatGPT.
Q* Parallels in RLHF and the Release of ChatGPT
What makes me excited about this is it seems reminiscent of the process that led to the release of ChatGPT, which many see as the catalyst of the recent Gen AI hype, alongside the release of Stable Diffusion.
In March 2022, OpenAI released the InstructGPT paper. Its primary innovation was a training process called RLHF. Previously, during training, inputs from a dataset were sent to a model, and the model outputs were compared to sample outputs in the dataset for the dataset input. If the model outputs were textually similar to a sample output, then the model was rewarded. If the model outputs were textually different, the model was punished. However, it is entirely possible to write something in a very textually different but equally preferable or correct way, and it is entirely possible to write something in a very textually similar way that is incorrect or low-quality. Textual similarity to a static dataset is not the best metric. The key differentiation of RLHF is that, during training, rather than scoring model outputs based on textual similarity to outputs in a static dataset, model outputs are scored dynamically by another model that encodes human preferences. This means the model is trained to more accurately create outputs humans would like, to more effectively generalize what it learns, and to be more creative and open-ended, which is one of the superpowers we love from ChatGPT. The results in the paper showed that a small model trained with RLHF but 100 times smaller than GPT-3 could produce equally preferable outputs.
What are the parallels here? In March 2022, OpenAI released a paper whose key innovation was to reward models differently during training. It gave the model superpowers, and they scaled up the research to release ChatGPT later that same year, which catalyzed wild hype around Gen AI. In May 2023, OpenAI released a paper whose key innovation was to reward models differently during training. In November 2023, rumors surfaced that this innovation has been scaled up internally and has created something powerful. Could this lead to another highly impactful release like ChatGPT? We’ll have to wait and see.
Alexander Whedon is an AI architect in the A.Team network, specializing in NLP and ML.