The business model of scraping

Scraping — the word which rose to prominence, thanks to AI.

It has been around since the dawn of the first website that was visited by the bot of the first search engine.

Ever since, hundreds of them keep visiting every site, all the time. For what purpose? Some are part of a search engine like Google or Bing. Some are scanning for a particular type of content or change in the content.

The example here would be a bot scanning for changes in product prices or new press releases. Others are searching for website vulnerabilities, so they can either break those sites or make them part of a virus army.

Everyone has a reason to scrape.

There is also an industry based around optimizing your website to be visited, scraped and indexed to reach the highest level in Google search.

While providing all the information for free and begging Google to visit you, you are hoping that in exchange, people will visit your website, where they can either buy something, or you can serve them ads (conveniently provided by Google).

That business model has been working now for decades, and, for Google, it was so successful that it led to several lawsuits by the US Department Of Justice (DoJ). It also led to the demise of the newspaper industry. True, media organizations started complaining about it, but they were really the ones who didn't see the Internet as a threat.

That was then, and now we are in a new era of scraping. Welcome to the world of AI.

After all the AI companies scraped the internet and started using it to train their models, everyone got up in arms that these companies stole the content and started using it for unauthorized purposes.

The concern is that AI will reproduce and replace all creative work, with no attribution and no remuneration.

People creating original content are the most vocal about it. They’re demanding a stop to the practice of scraping their content and are asking their work won’t be used for any model going forward.

As you can imagine, understanding the technology’s capabilities, controlling it and enforcing it is beyond the means of most people.

Fortunately, we have technology companies which can do this on their behalf — for a fee. To humor you, one of the ways you, as a website owner, can prevent your content from being scraped by these bots is to use the file 'robots.txt.' This allows you to explicitly state which parts of the website can be scraped and by what agent.

It is entirely up to the particular scraping agent to observe these rules. It  is purely voluntary. It is like having a sign in front of the bank, saying, 'Please, don't take any money from this big pile.'

Cute.

Another self-defense mechanism doomed to fail is to poison your content. 

The owners of the content or the websites are trying to understand together, with the developers of these AI models, a new business model.

So far, these attempts are just that — early attempts with no actual realistic outcomes.

A few examples:

  • Reddit in AI content licensing deal with Google - If there is a deal that will catch the attention of the DoJ, this is the one. A search engine, which has a monopoly on search, pays a content provider for exclusive access to its content. The irony is that all the content created there is done by its individual users who will definitely not get paid, nor their content will receive any attribution. Even worse, these people now can’t delete their content from Reddit. Reddit owns it. And you are wondering why Google changed its tagline from 'Don't be evil.'

  • Perplexity will soon start selling ads within AI search - This is just wishful thinking. It will take a long time before any meaningful traffic and, hence, revenue, will come out of this. Also, the attribution, contracts and payments will become very complicated very quickly. On a larger scale, it will become unworkable.

  • OpenAI Strikes a Deal to License News Corp Content - The challenge with all of the above is that the technology is still in its infancy. Also, the thing is that many people associate the term AI with the large language models (LLMs) that provide the illusion of early-functioning AGI. Combine this with trying to apply controls which (barely) worked in the past, and you are getting these nonsensical attempts.


It won't work. The tech is changing too fast, we have no production (i.e. money-making) ready applications and we will also find that the way we train these AI models will keep changing. Creating restrictions is a window dressing exercise or a short-term monetary gain.

The recurrent pattern here? Once tech matures, the products will get defined and the business model will follow. And it will be a simple one. Right now, there is just too much noise.

Previous
Previous

222

Next
Next

SearchGPT is a PR stunt