Hacktogether: Setting up DuckDB to use IPinfo's IP database using Python [Bad code, looking for feedback]

image

So, DuckDB last month or something announced a new datatype called UHUGEINT. It is 16 bytes (128 bits) and unsigned.

They had a proposal to support network as a native data type, but that is stuck in limbo: Feature Request: Network Address datatype · Issue #881 · duckdb/duckdb (github.com)

So, I thought 16-bytes is all I need, and why not take a crack at DuckDB using our free IP to Country ASN database?

Disclaimers

There are certain constraints, though:

  • I am not very well versed in DuckDB’s performance stuff. I mainly use it for ad-hoc stuff.
  • This whole thing is just an experiment. I doubt the DuckDB database (even with index) will be more efficient than the MMDB database.
  • The free IP to Country ASN database does not have a JOIN_KEY column natively, while our other IP databases have that.
  • This will be a better solution than Pandas/SQLite-based solutions and, in some instances, better than PostgreSQL and MySQL, considering your use case.
  • Polars has a data type of DECIMAL that supports 128 bits, but I have not explored it. DuckDB vs. Polars performance is way out of the scope of one weekend and should be left to people smarter than me.
  • I am unclear about the Python API-based CREATE FUNCTION mechanism and how it works with the ipaddress built-in library.

Download the DB

I will refer to the updated filename reference page on the IPinfo docs.

Convert the IP address to integer.

I am using DuckDB’s Python API based CREATE_FUNCTION command: Python Function API - DuckDB

ip_to_int = lambda ip : int(ipaddress.ip_address(ip))
con.create_function("to_int", ip_to_int, [VARCHAR], UHUGEINT)

Remember this is an OLAP database

As DUCKDB is an OLAP database, it should not be modified or ALTERed. You create the tables and drop the old ones. This is a faster solution according to my limited testing.

con.execute('''
    CREATE TABLE country_asn AS
    SELECT
        *,
        TO_INT(start_ip) start_ip_int,
        TO_INT(end_ip) end_ip_int
    FROM country_asn_raw
''')
con.execute("DROP TABLE country_asn_raw")

Limited interoperability with Python’s data

Even though we have created the function, my initial vision was to enrich a stream of IPs or even a large set of IP addresses from inside a Python list. However, from my limited exploration, joining operations with lists is not feasible with DuckDB. From what I can find, this is impossible unless you convert the data to a Pandas Dataframe or create a temporary table in DuckDB.

So, at this moment, we can only look up individual IP addresses.

Size issue

After creating the index based on the start_ip_int and end_ip_int columns, the size of the duckdb database is 334 MB. While the MMDB database is 45 MB. Although the uncompressed CSV dataset is about 212 MB.


This was fun. If you can share some comments and feedback, that would be great.

  • Am I doing anything wrong?
  • Is there any opportunity to bring our data to the users of DuckDB?
  • Does the “big data is dead” rhetoric work out in the space of data-focused cybersecurity?

If there is enough demand, I can probably do regular pre-indexed DuckDB versions of the [IP to Country ASN]( IP to Country + ASN Database - IPinfo.io) database uploads which will include the function.

Here is my Jupyter Notebook: IPinfo DuckDB experiment: https://community.ipinfo.io/t/hacktogether-setting-up-duckdb-to-use-ipinfos-ip-database-using-python-bad-code-looking-for-feedback/5538 (github.com)


Edit:

How fast are MMDBs you might wonder.

500,000 IPv4 Enrichment:

Duckdb: ~40 Seconds

  • With index declared on the integer columns
  • With input IPs in temp table, they are stored in memory.
  • Returning list of tuples

MMDB: ~ 2 seconds

  • With IPs being represented in a Python list
  • Enrichment being done with a simple Python loop
  • Returning a list of dictionary

Video ↓

Code: Duckdb vs MMDB performance: https://community.ipinfo.io/t/hacktogether-setting-up-duckdb-to-use-ipinfos-ip-database-using-python-bad-code-looking-for-feedback/5538 (github.com)

Perf tested MMDB and duckdb. MMDB is simply unbeatable.