The ability of machines to learn and improve over time is a major selling point of modern artificial intelligence. However, new research suggests that ChatGPT may actually be performing worse at certain tasks as time goes on.
A study by researchers from Stanford University and UC Berkeley found significant drift in the performance of OpenAI’s GPT-3.5 and GPT-4 models, which power ChatGPT. The researchers tested different versions of these models, from March and June 2023, and ran them through a variety of AI tasks, including math problems, answering sensitive questions, completing opinion surveys, solving multi-hop reasoning questions, generating code, passing medical exams, and visual reasoning challenges.
The results showed considerable variability in the models’ answers. In particular, GPT-4’s performance on math problems decreased between March and June. For instance, when asked to identify prime numbers using chain-of-thought (COT) prompting, GPT-4’s accuracy dropped from 84% in March to just 51.1% in June. Meanwhile, GPT-3.5’s accuracy improved from 49.6% to 76.2%. The researchers noted that GPT-4’s behavior changed: in March, it followed the steps laid out by the COT prompt, but in June, it skipped the reasoning steps and simply provided incorrect answers.
A similar decline was observed with another math task—identifying “happy” numbers. GPT-4’s accuracy dropped from 83.6% to 35.2%, while GPT-3.5’s accuracy improved, from 30.6% to 48.2%. Again, GPT-4 didn’t follow the requested COT behavior.
The researchers also found changes in the models’ willingness to answer sensitive or potentially dangerous questions. GPT-4’s response rate dropped from 21% in March to 5% in June, while GPT-3.5’s response rate increased slightly, from 2% to 5%. This suggests that GPT-4 incorporated a stronger safety filter, while GPT-3.5 became less cautious.
The opinion survey task also showed interesting shifts. GPT-4 became much less likely to give an opinion, with its response rate dropping from 97.6% in March to just 22.1% in June. At the same time, GPT-4’s responses became longer and more verbose, while GPT-3.5 maintained a consistent response rate and verbosity.
In tasks that required multi-hop reasoning, there were notable differences in performance. GPT-4’s accuracy on multi-hop questions improved significantly, going from 1.2% in March to 37.8% in June. On the other hand, GPT-3.5’s accuracy decreased from 22.8% to 14%.
When it came to generating code, both models showed a decline in the quality of their outputs. Over 50% of GPT-4’s code was executable in March, but this dropped to just 10% in June. GPT-3.5 experienced a similar decline. The researchers noted that both models began adding extra, non-code text to their Python code, which caused it to become non-executable.
There was a small drop in GPT-4’s performance on the US Medical License Exam, from 86.6% to 82.4%, while GPT-3.5 remained relatively stable, decreasing by less than 1%. Interestingly, GPT-4’s incorrect answers changed between March and June, with the model switching from correct to wrong answers as time passed.
The visual reasoning tests showed minor improvements in both models, but their accuracy rates (27.4% for GPT-4 and 12.2% for GPT-3.5) were still relatively low. The researchers also observed that the models sometimes gave incorrect answers to questions they had answered correctly before.
The study clearly shows that both GPT-3.5 and GPT-4’s performance has fluctuated significantly over a short period, with improvements in some areas and declines in others. The researchers highlight the importance of continuously evaluating and monitoring these models, especially as it’s unclear how updates to models like ChatGPT affect their behavior over time.
This research also underscores the challenges of improving large language models. Enhancing a model’s performance in one area, such as fine-tuning it with more data, can lead to unexpected side effects, causing performance to worsen in other areas. The divergent trends observed in GPT-3.5 and GPT-4 are a testament to this complexity.