ClickHouse is a very fast columnar database which is well-suited to analytics use cases. For example, you might use ClickHouse to log requests made to a website. With IPinfoβs free databases you can enhance ClickHouse abilities by enabling IP to Country and IP to ASN lookups in SQL queries.
In this tutorial we will see how to integrate IPinfo Free Country + ASN into ClickHouse and lookup million of IP addresses within milliseconds.
Prerequisites
- IPinfo Free Country + ASN database in CSV format
- IPinfo CLI to convert IP ranges to CIDR notation
- Docker to run a ClickHouse server
If you already have a running ClickHouse installation, you can use it and skip the Docker section.
Step 1 β Prepare the database
IPinfoβs free database in CSV format contains IP ranges described by the start_ip
and end_ip
column. However, ClickHouse expects ranges in CIDR notation.
For example, the range 1.0.0.0-1.0.0.10
is equivalent to the following three prefixes in CIDR notation: 1.0.0.0/29
, 1.0.0.8/31
and 1.0.0.10/32
.
Fortunately, this is easy to do with the IPinfo CLI:
ipinfo range2cidr country_asn.csv > country_asn.cidr.csv
Before
# head country_asn.csv
start_ip,end_ip,country,country_name,continent,continent_name,asn,as_name,as_domain
204.25.255.0,204.25.255.255,US,United States,NA,North America,AS401307,Michigan Statewide Educational Network,mich.net
136.228.60.0,136.228.60.255,US,United States,NA,North America,AS401307,Michigan Statewide Educational Network,mich.net
After
# head country_asn.cidr.csv
cidr,country,country_name,continent,continent_name,asn,as_name,as_domain
204.25.255.0/24,US,United States,NA,North America,AS401307,Michigan Statewide Educational Network,mich.net
136.228.60.0/24,US,United States,NA,North America,AS401307,Michigan Statewide Educational Network,mich.net
We now have a cidr
column in place of the start_ip
and end_ip
columns.
Step 2 β Start a ClickHouse instance
(If you already have a running ClickHouse installation, you can skip this step.)
The easiest way to start a ClickHouse instance is to use the official Docker image:
docker run -d -it --name clickhouse clickhouse/clickhouse-server
To view the server logs run:
docker logs clickhouse
To stop and delete the container run:
docker rm -f clickhouse
Step 3 β Load the database
To enable efficient IP lookups, the database must be loaded in a radix tree. This is possible with ClickHouse by using a dictionary. A dictionary is a table which is stored in-memory with a layout optimized for certain kind of lookups. In our case, we want to use the ip_trie
layout.
Dictionaries can be created from multiple sources. To keep things simple, we will create our dictionary from a local file.
To do this, first copy the database, in CIDR notation, inside the container in the user_files
directory:
docker cp country_asn.cidr.csv clickhouse:/var/lib/clickhouse/user_files/
Then, start a SQL REPL by running:
docker exec -it clickhouse clickhouse client
And create the dictionary by running the following query:
CREATE DICTIONARY ipinfo
(
cidr String,
country String,
country_name String,
continent String,
continent_name String,
asn String,
as_name String,
as_domain String
)
PRIMARY KEY cidr
SOURCE(FILE(PATH '/var/lib/clickhouse/user_files/country_asn.cidr.csv' FORMAT 'CSVWithNames'))
LIFETIME(MIN 0 MAX 3600)
LAYOUT(IP_TRIE)
The LIFETIME(MIN 0 MAX 3600)
parameter indicates that the dictionary will be kept in memory at-most 3600 seconds without queries. Feel free to increase or reduce these values depending on your setup.
If you are using your own installation, see the user_files_path
server settings to know the location of the user_files
directory.
Step 4 β Query the database
ClickHouse provides the dict*
functions to query dictionaries. In our case weβre interested by the dictGet
function.
Note that the first query will take around 30 seconds as the dictionary gets loaded in memory. However subsequent queries will be extremely fast.
SELECT dictGet('ipinfo', 'country_name', IPv4StringToNum('1.1.1.1')) AS country
-- ββcountryββββββββ
-- β United States β
-- βββββββββββββββββ
SELECT dictGet('ipinfo', 'as_name', IPv4StringToNum('1.1.1.1')) AS as_name
-- ββas_nameβββββββββββ
-- β Cloudflare, Inc. β
-- ββββββββββββββββββββ
Step 5 β Use with real data
To show how fast ClickHouse dictionaries are, we will load 6.7M IP addresses from the IPv6 Hitlist Service.
To do this, first create the table:
CREATE TABLE ipv6_hitlist (ip IPv6) ENGINE=MergeTree ORDER BY ip
And insert the data (note that ClickHouse can insert data directly from an URL!):
INSERT INTO ipv6_hitlist
SELECT *
FROM url('https://alcatraz.net.in.tum.de/ipv6-hitlist-service/open/responsive-addresses.txt.xz', CSV)
SETTINGS input_format_csv_skip_first_lines = 1
We can then easily lookup countries and autonomous systems:
SELECT
dictGet('ipinfo', 'country_name', ip) AS country,
COUNT(*)
FROM ipv6_hitlist
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
-- ββcountryβββββββββ¬βcount()ββ
-- β France β 2037099 β
-- β Germany β 1235990 β
-- β United States β 1216283 β
-- β China β 379006 β
-- β Netherlands β 205644 β
-- β United Kingdom β 203483 β
-- β Japan β 187566 β
-- β Russia β 128275 β
-- β Brazil β 110815 β
-- β Singapore β 107681 β
-- ββββββββββββββββββ΄ββββββββββ
-- 10 rows in set. Elapsed: 0.105 sec. Processed 6.79 million rows, 108.69 MB (64.55 million rows/s., 1.03 GB/s.)
SELECT
dictGet('ipinfo', 'as_name', ip) AS as_name,
COUNT(*)
FROM ipv6_hitlist
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
-- ββas_nameββββββββββββββββββββββββββββ¬βcount()ββ
-- β Free SAS β 1880351 β
-- β DigitalOcean, LLC β 297871 β
-- β Deutsche Telekom AG β 268174 β
-- β Host Europe GmbH β 232593 β
-- β Akamai Connected Cloud β 192009 β
-- β Akamai International B.V. β 146435 β
-- β Comcast Cable Communications, LLC β 123381 β
-- β Deutsche Glasfaser Wholesale GmbH β 117109 β
-- β Amazon.com, Inc. β 103056 β
-- β OVH SAS β 98425 β
-- βββββββββββββββββββββββββββββββββββββ΄ββββββββββ
-- 10 rows in set. Elapsed: 0.137 sec. Processed 6.79 million rows, 108.69 MB (49.72 million rows/s., 795.55 MB/s.)
SELECT
dictGet('ipinfo', 'asn', ip) AS asn,
COUNT(*)
FROM ipv6_hitlist
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
-- ββasnβββββββ¬βcount()ββ
-- β AS12322 β 1880351 β
-- β AS14061 β 297871 β
-- β AS3320 β 268173 β
-- β AS20773 β 224090 β
-- β AS63949 β 192009 β
-- β AS20940 β 145545 β
-- β AS60294 β 117109 β
-- β AS16276 β 98165 β
-- β AS16509 β 93524 β
-- β AS197540 β 85089 β
-- ββββββββββββ΄ββββββββββ
-- 10 rows in set. Elapsed: 0.113 sec. Processed 6.79 million rows, 108.69 MB (60.00 million rows/s., 959.99 MB/s.)
6.7M rows processed in 0.1s!
Step 6 β Cleanup
To remove the dictionary:
DROP DICTIONARY ipinfo
To delete the server container:
docker rm -f clickhouse