How to block ChatGPT and OpenAI on its site

Published in SEO・2023-04-03・4 min read

OpenAi and ChatGPT can use your site's content to "learn" (and provide answers). If you want to (try to) block them, read this article.

I answer the main question first, before making some other comments

How to prohibit the crawl of your site by ChatGPT and OpenAI?

ChatGPT (and OpenAI's products, and by extension Bing) uses multiple data sources (datasets) to train its learning algorithms. According to my research, there would be "many", at least these:

Common Crawl
WebText2
Books1 and Books2
Wikipedia

The only dataset you can try to act on is Common Crawl.

To do this, if you want to try to prohibit access to your site to ChatGPT, you must prohibit it from crawling using a directive in the robots.txt file. Of course, this will only have an impact for the future...

For Common Crawl, the User Agent name to use in the robots.txt file is .CCBot

To prohibit ChatGPT crawling the entire site, you must add these 2 lines:

UserAgent: CCBot
Disallow: /

To explicitly allow ChatGPT to crawl the entire site, you must add these 2 lines:

UserAgent: CCBot
Disallow:

Of course, this is to be adapted to your situation. Read my guide to the robots file.txt to learn how to prohibit crawling a directory, or a subdomain, or other more specific cases.

According to Common Crawl documentation:

the Common Crawl crawler also takes nofollow into account when it comes to URL discovery. You can prohibit its crawler from following all the links on a page by adding the tag <meta name="CCBot" content="nofollow">
It also takes into account sitemaps (listed in the robots.txt)
Its IP address is one of those used by Amazon S3 cloud services

Read on where I explain that it is probably useless...

How to prohibit the crawl of your site by ChatGPT plugins?

Since ChatGPT also manages plugins, other robots can crawl your site. This is what happens if a ChatGPT user asks to exploit the content located on your site.

In this case, the crawler identifies itself as ChatGPT-User:

The User Agent to use in the robots file.txt is ChatGPT-User
The full agent name (visible in the logs) is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot To prohibit the crawl of ChatGPT plugins from the entire site, you must add these 2 lines:

UserAgent: ChatGPT-User
Disallow: /

To explicitly allow ChatGPT plugins to crawl the entire site, you must add these 2 lines:

UserAgent: ChatGPT-User
Disallow:

Is it really possible to prevent ChatGPT and OpenAI from crawling a site?

No, it is not possible to ensure that your content is not exploited by ChatGPT and OpenAI.

First, your content may have already been used. There is no way (currently) to remove content from a dataset.

Then, it is almost certain that your content is in other datasets than Common Crawl.

Finally, I guess there are probably other technical reasons why you can't guarantee that these AIs won't exploit your content...

Is it a good idea to want to block OpenAI and its chat?

Basically, I think it's normal to want to control whether or not a third party has the right to exploit (for free) the content published on your site.

We've been used to operating with some sort of tacit agreement between search engines and site publishers. The latter allow by default search engines to crawl and index their content, in exchange for free visibility offered in the results pages. And therefore a contribution of visitors.

In the case of AI-based tools, if none of their sources are indicated in the response provided to the user, then this type of tacit agreement no longer exists.

I have the impression that with ChatGPT plugins, it is much more likely that your content will be mentioned (if it has been crawled by these plugins).

I also note that Bing's conversational search (which leverages ChatGPT) mentions sources (with links), but I get the impression that this is mostly what Bingbot found. If this is indeed the case, ChatGPT blocking is not affected here. But is excluding your site from these tools really the best thing to do? Isn't that also the future of research? And if these tools ever come to mention their sources, not being there becomes a weakness in your search marketing strategy.

Sources:

Common Crawl FAQ
Official OpenAI documentation
Article published on Search Engine Journal

2022 into 2023

Journey to Ixtlan by Carlos Castaneda