We need an evolved robots.txt and regulations to enforce it

The robots.txt file is a simple text file that tells web robots (like search engine crawlers) which pages on your site to crawl and which not to crawl. It’s a standard that has been around for a long time and it’s still used today.

Some examples of rules you can put in a robots.txt file are:

1
2


User-agent: *
Disallow: /private/

This rule tells all web robots to not crawl the /private/ directory.

1
2


User-agent: Googlebot
Disallow: /users/

This rule tells Googlebot to not crawl the /users/ directory.

Then AI came

In the age of AI, the existing robots.txt specification is not enough to express the rules for web crawlers. We can only tell agent if they can or cannot crawl a certain path, but we cannot express more complex rules.

In my opinion we should be able to express more detailed rules, like:

Indexing: should a web crawler be able to index the content?
Caching: should a web crawler be able to cache the content?
LLM Training: should a web crawler be able to use the content to train a language model?
Summarising: should a web crawler be able to summarise the content?
etc…

Some of the above things were not possible in the past and it should be up to the website owner to decide if they want their content to be used in such ways.

Enforcing the rules

In addition to more detailed rules, we need new regulations to enforce them. It looks like the robots.txt file is not enough to stop certain companies from doing what they want.

As someone recently found out, Perplexity AI is using a fake user agent to crawl websites, pretending to be a regular user. This is a clear violation of the rules specified in robots.txt file. This claim has recently been confirmed by Wired and by MacStories.

Conclusion

As we have seen, having good rules is not enough if they are not enforced. In particular we need regulators to take care of complaints from content owners and fine companies that do not respect the rules (like Perplexity AI), because small content creators cannot afford to take legal actions against big companies.

As with every single thing, it’s never “the tool”, but rather “how you use it”. AI itself can bring innovation in certain fields, but this can’t be done at the expense of other people’s work and rights.

Disclaimer

Yes, of course the cover image has been generated with AI. It’s far from perfect, but it’s still better than my drawing skills. The content of this article instead is 100% human generated.