The bragging about foggy AI
How do we get around the cognitive dissonance with AI? We know it is awesome and can solve all of humanity's problems. But we also know it doesn't work and it is far from being production ready.
Here is an example. I came across two articles this week which I hope will demonstrate my point.
The first was an article 'This Startup Is Trying to Test How Well AI Models Actually Work'. The second was a blog post 'The Art of Product Management in the Fog of AI' by Tomasz Tunguz.
The first article talks about a startup, vals.ai which 'is working to build a third-party review system for vetting the performance of AI in areas like accounting, law and finance.'
(As an aside, I note that this company was started by two guys who dropped out of a masters program at Stanford university.
I always wonder why the story has to include this piece of information. Is it the fact they dropped out of school? Do you have to drop out of Stanford as the prerequisite to start a company? I am sure there is enough material for a separate post.)
Back to the topic.
These two gentlemen identified a need for a transparent benchmark to evaluate Large Language Models (LLMs) against standard criteria within a particular context. To begin, they chose three areas - Legal, Tax and Finance. To test, they chose 15 models, from famous and proprietary, to open source or less known. Then they started sending the questions and ranking the answers.
Demonstrating their (unfinished) engineering background they included in the stats things like speed of the answer or cost associated with the answer. Completely - at this stage - irrelevant metrics. The only number which matters is the accuracy. For the legal reasoning tasks, the highest number for accuracy was 77%. For legal contract-related questions it was 74%. For tax-relatedquestions it was 55%. And for corporate finance, 65%. And to make it clear again - these were the best results.
From Tomasz Tunguz' blog post, I will quote this: 'How does one design a product experience in the fog of AI? The answer lies in embracing the unpredictable nature of AI'.
Does the above bring a question in your mind? Something like 'Is this AI thing just another hype which will end in disaster?' Sadly, the whole AI industry is trying to achieve just that. And statements like 'AI Intelligence Will Be Smarter Than Some Humans Within a Year' by Mr. Musk are not helping.
The mentioned startup is trying to achieve an impossible task. How can you measure or benchmark something where the number of questions is very large and in order to evaluate the accuracy of the answers, you need a human expert (we call them lawyers) to approve it? And even then, I am sure there will be another human expert (another lawyer) who would argue that the answer is not that accurate. If you try to suggest that this can't happen, obviously you've never been part of any contract dispute negotiation. From my experience, I can also add that equally important to what's in the contract is what it is not in the contract. Try asking AI what's missing ...
True, there will be arguments that the models can improve, will get better, faster, more accurate. And I am sure, you already saw articles showing that AI can pass the bar exam. That is not a drinking game but a test for aspiring lawyers to be able to practice law in their respective jurisdiction. Sounds amazing, until you read that AI got just 76% of answers correct.
And this is the problem. We expect from technology 100% accuracy. For a straight question, a straight answer. For the same question, the same answer. Bragging about 76% accuracy, it's just that - bragging. Trying to use technology in production, where quality and accuracy is paramount; and we are told that we have to embrace 'the unpredictable nature of AI', will only result in questioning everything that comes out.
To make AI a helpful recurrent pattern, we should be more explicit about what it is and what is not. So far it is much less than what we are being told.