An IPinfo perspective on real-time AI crawlers

With a lot of buzz around AI crawlers, I thought it would be interesting to look into what our data says about AI crawlers. AI crawler detection technology out there usually implements an anti-bot mechanism that triggers after detecting bot-like behavior (i.e, fast large-scale site scraping operations).

Antibot :handshake: IPinfo

Although this anti-bot mechanism is effective, it fundamentally lacks in identifying suspicious activities on first contact. All algorithmic bot detection requires a behavioral history to build up before raising a captcha to be validated. So, once an AI crawler has built up a significant suspicious behavioral history on the site until that point, it can crawl content easily.

So, the question becomes what would be an effective model here to prevent bots on first contact.

IP data can be used prevent acess to your site from AI crawlers. You do not even have to see how the visitor acts when they access your site to know if they are an AI crawler. You can detect it at the first contact layer based on IP address and immediately raise a captcha. This can be done by identifying the ASN type (Available through IPinfo Core
and IPinfo Plus) or residential proxy IP address usage.

So, the anti-bot mechanism and IP data are not competing with each other, but rather IPinfo’s data fits in with the larger crawler and scraping prevention efforts.

Please note that, AI crawling access to your website does distribute your knowledge and information to a wider audience. Preventing AI crawling for your website is a personal, impactful and subjective decision.

Using my website to detect AI crawlers

I have a website that is hosted on a VPS and served through NGINX. My website is hosted through port 443 and I can see which IP addresses are visiting the website using tcpdump.

$ sudo tcpdump -i eth0 port 443

My prompt for the conversational AI:

Extract the SEO data from <website_name>

OpenAI

It is a Microsoft IP address. It is actually well known that OpenAI uses Microsoft infrastructure to access sites.

Google Gemini

The hostname was:

rate-limited-proxy-108-177-70-10.google.com.50924

The IP address extracted from it is:

Anthrophic Claude:

These are Google IP addresses. Anthrophic has their own ASN: AS399358 Anthropic, PBC details - IPinfo.io. However, it looks like they are using GCP for their crawling operations.

Deepseek or Edge Copilot

They said they do not have browing capabilities.


It seems that when an AI crawler conducts real-time browsing or crawling, they use well-known Cloud platforms. Therefore, detecting them could be fairly easy. However, if you suspect heavy traffic coming from AI Crawlers on your website, feel free to check your logs with our newly launched API services: