Oil and data. There will be blood
‘If it bleeds, it leads,’ is the famous journalistic ethos. And if it doesn't bleed, make sure there is at least a doomsday scenario in the topic. Otherwise, nobody is interested. Never include good news!
The same paradigm is used by anyone writing about AI. The latest?
We could run out of data to train AI language programs - MIT Technology Review
Big Tech needs to get creative as it runs out of data to train its AI models. Here are some of its wildest solutions. - Business Insider
Researchers warn we could run out of data to train AI by 2026. What then? - The Conversation
Indeed, the sky is falling, and in a few years all the Nvidia AI chips will suffocate from data emptiness.
Reading the above headlines brought another memory. First, there was the (famous) article in The Economist titled ‘The world’s most valuable resource is no longer oil, but data,’ which led me to draw a parallel with the end of oil supply. Perhaps you heard about the peak of oil concept and the associated predictions when the oil production will peak, and there will be no more oil. Fun fact: The first prediction is from 1880 and then from 1919. Regardless of your age, you’ve heard it since your childhood. And yet, there is still oil to be extracted.
Now, back to the data drought.
Interestingly, the out-of-data articles mention as the source of the prediction the research paper titled 'Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning'
In the paper’s ‘Key Takeaways' we learn that:
The way things are going, we will likely run out of language data between 2030 and 2050.
We are within one order of magnitude of exhausting high-quality data, and this will likely happen between 2023 and 2027.
Based on these trends, we will likely run out of vision data between 2030 to 2070.
It ends with the 'Conclusion' that, 'If our assumptions are correct, data will become the main bottleneck for scaling ML models, and we might see a slowdown in AI progress as a result.'
But the last sentence puts a completely different spin on this story because, 'However, as outlined, there are multiple reasons to doubt that these trends will continue as projected, such as the possibility of algorithmic innovations in data efficiency.'
Of course, this whole thing is nonsense. I don't know about oil, but let me ponder about the data scarcity for a moment.
A few things the research paper mentioned:
This projection naively assumes that the past trend will be sustained indefinitely. In reality, there are constraints on the amount of data that a model can be trained on. One of the most important constraints is computing availability. This is because increasing the amount of training data for a given model requires additional computation, and the amount of computing is limited by the supply of hardware and the cost of buying or renting that hardware.
The vast majority of data is user-generated and is stored in social media platforms, blogs, forums, etc. There are three factors that determine how much content is produced in a given period: human population, internet penetration rate, and the average amount of data produced by each internet user.
Yes, since I bought my first computer, I was always constrained by the speed of the processor, memory size and storage capacity. Interestingly enough, the real constraint was my ability to use these resources to their full capacity. And by the time I learned that, the technology scaled well past what I needed. And that allowed me to think bigger. If you are looking for a recurrent pattern, this is one for you. I understand that researchers in academia have a different view of the world due to their budgetary constraints. Microsoft and OpenAI see things differently.
The second constraint researchers see is in the number of people who are connected to the Internet and producing content. One data point here — I remember hearing from the Oracle's CEO saying that Facebook/Meta purchased several petabytes of storage before every Halloween to hold all the pictures people took during the wild parties. That was about 10 years ago. Do you think that the number is decreasing or increasing? True, we might debate the quality, but to appreciate caviar, you have to have a hot dog from time to time.
Also, speaking about high and low quality data, you might remember a post by yours truly, 'The case of the snobby AI (researcher)' where the researcher was complaining, 'Choi points out that, of course, the bot has flaws, but that we live in a world where people are constantly asking for answers from imperfect people, and tools — like Reddit threads and Google.'
That's where we are getting to the real topic of this data scarcity research — what is high quality content, and who are the perfect people generating it? I am sure that researchers will let us know.
In the meantime, let me assure you that we are not running out of content. What we don't have yet are the algorithms that understand the content we are feeding to them. Until now, what we did is learn how to predict the next word when we ask a question. We will be reusing the same content over and over, every single time we come up with better and better algorithms. The data scarcity — in this context — is a non-existent problem.
Yet, there is another problem with the data used for training.
Let me take you to the year 2000. Yes, there was the Y2K excitement and the infamous dot-com bubble, but there was also the hype around Knowledge Management (KM). One of the players was IBM with its product, Raven. The proponents of this technology came with an amazing pitch to the executives of large companies. It went like this: 'Did you know that only 20% of your employee knowledge is stored in some document or email? Imagine that when an employee leaves the organization, it leaves with 80% of knowledge. Your company knowledge!!' As you can imagine, it was an easy sell. The ROI was clear. Let's install the KM, suck as much knowledge out of their brains, and we don't need these people anymore.
You can read in more detail about the fallacy of the approach, but the fundamental issue is nicely summarized by these three principles:
We always know more than we can say, and we will always say more than we can write down.
We only know what we know when we need to know it.
Knowledge can only be volunteered; it cannot be conscripted.
As you can imagine, KM projects failed miserably and the only winners were the companies selling the technology.
Assuming that the 20 to 80 split between written content and what we have in our heads is correct, it also begs the question: Is what's written enough to train AI to know anything and everything?
What is definitely missing is the context of when and how the written content should be used.
I accept the argument that we are not talking only about text, but also images, videos, and conversations. We have far more data now than we had in the year 2000. My question is then, 'Did that improve the contextual awareness?'
We haven't even started yet with data collection.
The recurrent pattern? Without context, learning does not lead to the desired outcomes. And, like oil, most of the data is still in the ground waiting to be discovered.