Lookup IP geolocation and ASN with ClickHouse and IPinfo's free database

ClickHouse is a very fast columnar database which is well-suited to analytics use cases. For example, you might use ClickHouse to log requests made to a website. With IPinfo’s free databases you can enhance ClickHouse abilities by enabling IP to Country and IP to ASN lookups in SQL queries.

In this tutorial we will see how to integrate IPinfo Free Country + ASN into ClickHouse and lookup million of IP addresses within milliseconds.

Prerequisites

If you already have a running ClickHouse installation, you can use it and skip the Docker section.

Step 1 – Prepare the database

IPinfo’s free database in CSV format contains IP ranges described by the start_ip and end_ip column. However, ClickHouse expects ranges in CIDR notation.

For example, the range 1.0.0.0-1.0.0.10 is equivalent to the following three prefixes in CIDR notation: 1.0.0.0/29, 1.0.0.8/31 and 1.0.0.10/32.

Fortunately, this is easy to do with the IPinfo CLI:

ipinfo range2cidr country_asn.csv > country_asn.cidr.csv

Before

# head country_asn.csv
start_ip,end_ip,country,country_name,continent,continent_name,asn,as_name,as_domain
204.25.255.0,204.25.255.255,US,United States,NA,North America,AS401307,Michigan Statewide Educational Network,mich.net
136.228.60.0,136.228.60.255,US,United States,NA,North America,AS401307,Michigan Statewide Educational Network,mich.net

After

# head country_asn.cidr.csv
cidr,country,country_name,continent,continent_name,asn,as_name,as_domain
204.25.255.0/24,US,United States,NA,North America,AS401307,Michigan Statewide Educational Network,mich.net
136.228.60.0/24,US,United States,NA,North America,AS401307,Michigan Statewide Educational Network,mich.net

We now have a cidr column in place of the start_ip and end_ip columns.

Step 2 – Start a ClickHouse instance

(If you already have a running ClickHouse installation, you can skip this step.)

The easiest way to start a ClickHouse instance is to use the official Docker image:

docker run -d -it --name clickhouse clickhouse/clickhouse-server

To view the server logs run:

docker logs clickhouse

To stop and delete the container run:

docker rm -f clickhouse

Step 3 – Load the database

To enable efficient IP lookups, the database must be loaded in a radix tree. This is possible with ClickHouse by using a dictionary. A dictionary is a table which is stored in-memory with a layout optimized for certain kind of lookups. In our case, we want to use the ip_trie layout.

Dictionaries can be created from multiple sources. To keep things simple, we will create our dictionary from a local file.

To do this, first copy the database, in CIDR notation, inside the container in the user_files directory:

docker cp country_asn.cidr.csv clickhouse:/var/lib/clickhouse/user_files/

Then, start a SQL REPL by running:

docker exec -it clickhouse clickhouse client

And create the dictionary by running the following query:

CREATE DICTIONARY ipinfo
(
    cidr           String,
    country        String,
    country_name   String,
    continent      String,
    continent_name String,
    asn            String,
    as_name        String,
    as_domain      String
)
PRIMARY KEY cidr
SOURCE(FILE(PATH '/var/lib/clickhouse/user_files/country_asn.cidr.csv' FORMAT 'CSVWithNames'))
LIFETIME(MIN 0 MAX 3600)
LAYOUT(IP_TRIE)

The LIFETIME(MIN 0 MAX 3600) parameter indicates that the dictionary will be kept in memory at-most 3600 seconds without queries. Feel free to increase or reduce these values depending on your setup.

If you are using your own installation, see the user_files_path server settings to know the location of the user_files directory.

Step 4 – Query the database

ClickHouse provides the dict* functions to query dictionaries. In our case we’re interested by the dictGet function.

Note that the first query will take around 30 seconds as the dictionary gets loaded in memory. However subsequent queries will be extremely fast.

SELECT dictGet('ipinfo', 'country_name', IPv4StringToNum('1.1.1.1')) AS country
-- β”Œβ”€country───────┐
-- β”‚ United States β”‚
-- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

SELECT dictGet('ipinfo', 'as_name', IPv4StringToNum('1.1.1.1')) AS as_name
-- β”Œβ”€as_name──────────┐
-- β”‚ Cloudflare, Inc. β”‚
-- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 5 – Use with real data

To show how fast ClickHouse dictionaries are, we will load 6.7M IP addresses from the IPv6 Hitlist Service.

To do this, first create the table:

CREATE TABLE ipv6_hitlist (ip IPv6) ENGINE=MergeTree ORDER BY ip

And insert the data (note that ClickHouse can insert data directly from an URL!):

INSERT INTO ipv6_hitlist
SELECT *
FROM url('https://alcatraz.net.in.tum.de/ipv6-hitlist-service/open/responsive-addresses.txt.xz', CSV)
SETTINGS input_format_csv_skip_first_lines = 1

We can then easily lookup countries and autonomous systems:

SELECT
    dictGet('ipinfo', 'country_name', ip) AS country,
    COUNT(*)
FROM ipv6_hitlist
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

-- β”Œβ”€country────────┬─count()─┐
-- β”‚ France         β”‚ 2037099 β”‚
-- β”‚ Germany        β”‚ 1235990 β”‚
-- β”‚ United States  β”‚ 1216283 β”‚
-- β”‚ China          β”‚  379006 β”‚
-- β”‚ Netherlands    β”‚  205644 β”‚
-- β”‚ United Kingdom β”‚  203483 β”‚
-- β”‚ Japan          β”‚  187566 β”‚
-- β”‚ Russia         β”‚  128275 β”‚
-- β”‚ Brazil         β”‚  110815 β”‚
-- β”‚ Singapore      β”‚  107681 β”‚
-- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

-- 10 rows in set. Elapsed: 0.105 sec. Processed 6.79 million rows, 108.69 MB (64.55 million rows/s., 1.03 GB/s.)
SELECT
    dictGet('ipinfo', 'as_name', ip) AS as_name,
    COUNT(*)
FROM ipv6_hitlist
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

-- β”Œβ”€as_name───────────────────────────┬─count()─┐
-- β”‚ Free SAS                          β”‚ 1880351 β”‚
-- β”‚ DigitalOcean, LLC                 β”‚  297871 β”‚
-- β”‚ Deutsche Telekom AG               β”‚  268174 β”‚
-- β”‚ Host Europe GmbH                  β”‚  232593 β”‚
-- β”‚ Akamai Connected Cloud            β”‚  192009 β”‚
-- β”‚ Akamai International B.V.         β”‚  146435 β”‚
-- β”‚ Comcast Cable Communications, LLC β”‚  123381 β”‚
-- β”‚ Deutsche Glasfaser Wholesale GmbH β”‚  117109 β”‚
-- β”‚ Amazon.com, Inc.                  β”‚  103056 β”‚
-- β”‚ OVH SAS                           β”‚   98425 β”‚
-- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

-- 10 rows in set. Elapsed: 0.137 sec. Processed 6.79 million rows, 108.69 MB (49.72 million rows/s., 795.55 MB/s.)
SELECT
    dictGet('ipinfo', 'asn', ip) AS asn,
    COUNT(*)
FROM ipv6_hitlist
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

-- β”Œβ”€asn──────┬─count()─┐
-- β”‚ AS12322  β”‚ 1880351 β”‚
-- β”‚ AS14061  β”‚  297871 β”‚
-- β”‚ AS3320   β”‚  268173 β”‚
-- β”‚ AS20773  β”‚  224090 β”‚
-- β”‚ AS63949  β”‚  192009 β”‚
-- β”‚ AS20940  β”‚  145545 β”‚
-- β”‚ AS60294  β”‚  117109 β”‚
-- β”‚ AS16276  β”‚   98165 β”‚
-- β”‚ AS16509  β”‚   93524 β”‚
-- β”‚ AS197540 β”‚   85089 β”‚
-- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

-- 10 rows in set. Elapsed: 0.113 sec. Processed 6.79 million rows, 108.69 MB (60.00 million rows/s., 959.99 MB/s.)

6.7M rows processed in 0.1s!

Step 6 – Cleanup

To remove the dictionary:

DROP DICTIONARY ipinfo

To delete the server container:

docker rm -f clickhouse
1 Like

I recently made the mmdb-to-clickhouse tool to automate the above :slight_smile:

Example usage

First start a ClickHouse instance:

docker run --name clickhouse --rm -d -p 9000:9000 clickhouse/clickhouse-server

Then download the example MMDB file:

wget https://github.com/maxmouchet/mmdb-to-clickhouse/raw/main/example.mmdb

And run mmdb-to-clickhouse:

./mmdb-to-clickhouse -dsn clickhouse://localhost:9000 -mmdb example.mmdb -name example_mmdb -test

The output should look like the following:

2023/12/21 12:18:10 Schema: network String, country String, partition Date
2023/12/21 12:18:10 Creating example_mmdb_history
2023/12/21 12:18:10 Creating example_mmdb
2023/12/21 12:18:10 Dropping partition 2023-12-21
2023/12/21 12:18:10 Inserted 1 rows
2023/12/21 12:18:10 Running test query: SELECT dictGet('example_mmdb', 'country', IPv6StringToNum('1.1.1.1'))
2023/12/21 12:18:10 Test query result: WW

This will create two tables:

  • example_mmdb_history: a partitioned table which keeps the last 30 days of history by default (see the -ttl option)
  • example_mmdb: an in-memory IP trie dictionary which always uses the latest partition from example_mmdb_history. This dictionary enables very fast IP lookups.

Open a REPL and inspect the tables:

docker exec -it clickhouse clickhouse client
SHOW TABLES
-- β”Œβ”€name─────────────────┐
-- β”‚ example_mmdb         β”‚
-- β”‚ example_mmdb_history β”‚
-- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

SELECT * FROM example_mmdb_history
-- β”Œβ”€network───┬─country─┬──partition─┐
-- β”‚ 0.0.0.0/0 β”‚ WW      β”‚ 2023-12-21 β”‚
-- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

SELECT * FROM example_mmdb
-- β”Œβ”€network───┬─country─┬──partition─┐
-- β”‚ 0.0.0.0/0 β”‚ WW      β”‚ 2023-12-21 β”‚
-- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

SELECT dictGet('example_mmdb', 'country', IPv6StringToNum('1.1.1.1')) AS country
-- β”Œβ”€country─┐
-- β”‚ WW      β”‚
-- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

To clean up just remove the ClickHouse instance:

docker rm -f clickhouse
3 Likes