So, I created IPguessr as a meme of sorts, but it is grown to be a very fun project.
I was generating IP addresses randomly and testing them against our API service to eliminate bogon IP addresses.
function getRandomIP() {
const octet = () => Math.floor(Math.random() * 256);
return `${octet()}.${octet()}.${octet()}.${octet()}`;
}
Then another challenge was that I was getting some of the bigger ASNs way too frequently, so when I generated these random IP addresses, I would filter them by ASN:
But I still got way too many common ASNs. For a 5 IP per round game, it felt like these IP addresses came very frequently.
Using the IPinfo Lite DB to see what ASN’s I could exclude
Let’s first look at the ASNs that we can remove from the Lite data downloads service in Snowflake.
The Lite DB download is our free IP to Country ASN database. The data is currently available as data download and on Snowflake and BigQuery. Please let us know if you would like the data in Databricks, AWS, or elsewhere as well.
Let’s do some vibe-coding
(or AI-Assited coding) to generate the SQL query using Snowflake CoPilot Cortex AI:
Prompt:
The count_ip UDF takes start_ip and end_ip as input. Use start_ip LIKE ‘%.%’ to filter IPv4 addresses. From all active IPv4 addresses (type != ‘inactive’), group by asn, as_name, and as_domain (excluding NULL ASNs) from the lite table. Count how many IPs each ASN owns and calculate their percentage of the total. Return the top 100 ASNs, ordered by IP count.
/* Generated by Snowflake Copilot */
WITH ip_counts AS (
SELECT
asn,
as_name,
as_domain,
SUM(COUNT_IP (start_ip, end_ip)) AS ip_count,
SUM(SUM(COUNT_IP (start_ip, end_ip))) OVER () AS total_ips
FROM
lite
WHERE
start_ip LIKE '%.%'
AND NOT asn IS NULL
GROUP BY
asn,
as_name,
as_domain
)
SELECT
asn,
as_name,
as_domain,
ip_count,
ROUND((ip_count * 100.0 / total_ips), 4) AS percentage_of_total
FROM
ip_counts
ORDER BY
ip_count DESC
LIMIT
100;
These 100 ASNs represent 58% of all the IP addresses in the Lite table. So, that means you kind of have to make at least 2 API calls to get 1 ASN that is not included here. However, that does not account for the Bogon IPs generated by the random IP function.
FYI, 15% of the IP addresses in the Lite DB have an asn
marked as null
. This is because we base our ASN information on BGP announcements. These IP addresses were not being announced by any ASN, hence they are null
. However, if you are interested in ownership information, you can take a look at our WHOIS data.
Out of curiosity, let’s take a look at a percentile-based distribution of IP counts by ASN
Snowflake SQL query to calculate the cumulative ownership of IP addresses by ASN in percentile bands.
SELECT *,
count_ip(start_ip, end_ip) AS ip_count
FROM lite
WHERE start_ip LIKE '%.%'
AND NOT asn IS NULL
),
asn_totals AS (
SELECT asn, SUM(ip_count) AS total_ips
FROM active_ipv4
WHERE asn IS NOT NULL
GROUP BY asn
),
total_ip_space AS (
SELECT SUM(total_ips) AS total_ips
FROM asn_totals
),
ranked_asns AS (
SELECT
asn,
total_ips,
SUM(total_ips) OVER (ORDER BY total_ips DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
ROW_NUMBER() OVER (ORDER BY total_ips DESC) AS asn_rank
FROM asn_totals
),
percentile_cutoffs AS (
SELECT
ranked_asns.asn_rank,
ranked_asns.running_total,
ROUND(100.0 * ranked_asns.running_total / total_ip_space.total_ips, 1) AS cumulative_percent
FROM ranked_asns, total_ip_space
),
bucketed_percentiles AS (
SELECT MIN(asn_rank) AS min_asns_needed,
FLOOR(cumulative_percent / 10) * 10 AS percentile
FROM percentile_cutoffs
WHERE cumulative_percent <= 100
GROUP BY FLOOR(cumulative_percent / 10) * 10
)
SELECT percentile, min_asns_needed
FROM bucketed_percentiles
WHERE percentile!=0
ORDER BY percentile;
PERCENTILE | MIN_ASNS_NEEDED |
---|---|
10 | 2 |
20 | 5 |
30 | 12 |
40 | 28 |
50 | 57 |
60 | 107 |
70 | 216 |
80 | 533 |
90 | 2258 |
100 | 57107 |
An interesting pattern you can see here is that: From 10% to nearly 80%, the minimum ASN count keeps doubling.
Instead of ASN, what about we choose as_domain
?
PERCENTILE | MIN_AS_DOMAINS_NEEDED |
---|---|
10 | 2 |
20 | 4 |
30 | 8 |
40 | 15 |
50 | 32 |
60 | 65 |
70 | 134 |
80 | 329 |
90 | 1390 |
100 | 47566 |
This is rather interesting, as now we have a much smaller amount of IP metadata to evaluate if we want to filter them out.
To make the game more interesting, we will exclude the top 90% of ASNs from the game, which is represented by 1,390 as_domains
(through 2258 ASNs).
WITH active_ipv4 AS (
SELECT *,
count_ip(start_ip, end_ip) AS ip_count
FROM lite
WHERE start_ip LIKE '%.%'
AND as_domain is not null
),
as_domain_totals AS (
SELECT as_domain, SUM(ip_count) AS total_ips
FROM active_ipv4
WHERE as_domain IS NOT NULL
GROUP BY as_domain
),
ranked_as_domains AS (
SELECT
as_domain,
total_ips,
ROW_NUMBER() OVER (ORDER BY total_ips DESC) AS as_domain_rank
FROM as_domain_totals
)
SELECT as_domain
FROM ranked_as_domains
WHERE as_domain_rank <= 1390
ORDER BY as_domain_rank;
In the next community post, we will look into how we can take advantage of the Lite API’s unlimited API requests feature to filter out these ASNs without building an API preprocessor.