Teaching an old monkey new tricks

The moment OpenAI released its ChatGPT, which was quickly followed up by competing products from Meta/Facebook and Google and ... everybody else, the question was - Who has the bigger Schwartz aka which Large Language Model (LLM) is better. A slew of benchmark tests came out and the race began.

Soon we learned that AI could pass bar exams and OpenAI posted how ChatGPT 4 is acing the Uniform Bar Exam, LSAT and many others.

Here is an overview of the top 10 LLMs and their capabilities. This is how they are described: 'excels at handling complex math problems', 'from basic arithmetic to advanced calculus.’ Or it can 'handle complex, multi-step calculations while prioritizing logical accuracy and transparency'.

There is not much left for mathematicians to do. Just keep asking questions and AI will do the rest.

Except. There is a wrinkle.

On November 8th, Epoch AI released its FrontierMath, the math benchmark designed for testing the limits of AI. And the news is no longer that great for any of the LLMs. Actually the news is really bad for all of them. All the models solved less than 2% of the problems contained in that benchmark. While the famous LLMs hit an almost perfect score on the other benchmarks like GSM-8kor MATH, here they barely register.

Why is that?

This is the fundamental problem with training algorithms. When companies train their LLMs, they throw at them every possible piece of content. Remember my post, Oil and data. There will be blood. I noted there that researchers argued and the media happily repeated the argument (wrongly) that soon we will run out of data to train AI.

The assumption is that the more we feed AI the smarter it will become. Another assumption was made that if the models become bigger and bigger, they will be able to solve bigger and more complex questions.

It appears that is not the case. The team working on the FrontierMath benchmark, created hundreds of original aka never-before-seen math problems. In their own words 'These problems span major branches of modern mathematics—from computational number theory to abstract algebraic geometry—and typically require hours or days for expert mathematicians to solve.'

These problems can also be quickly validated with a single number, which is difficult to guess or randomly calculate. To illustrate, here are two answers - 3677073 and 1876572071974094803391179.  You either know or you don't.

This is the gap between how AI is portrayed and its actual capabilities. Feeding it with all the legal documents and then asking legal questions gives you an impression that you don't need a lawyer. See it answering Grade 12 Math questions and be amazed by the answer. However, consider that the model has been trained on all the Math tests going back years. It is suddenly not that amazing. Giving it something which has not been trained on and it becomes an expensive piece of hardware.

It is still a well-trained circus monkey and regardless of how you dress it, it is still a monkey.

Once we re-discover the old pattern to evaluate technology for what it is and what can be done with it, we will move forward much faster. This is a pattern which keeps recurring.

Previous
Previous

In search of the (European) search strategy

Next
Next

Alexa's Identity Crisis