Hacktogether: Setting up DuckDB to use IPinfo's IP database using Python [Bad code, looking for feedback]

Abdullah · March 3, 2024, 10:59pm

So, DuckDB last month or something announced a new datatype called UHUGEINT. It is 16 bytes (128 bits) and unsigned.

They had a proposal to support network as a native data type, but that is stuck in limbo: Feature Request: Network Address datatype · Issue #881 · duckdb/duckdb (github.com)

So, I thought 16-bytes is all I need, and why not take a crack at DuckDB using our free IP to Country ASN database?

Disclaimers

There are certain constraints, though:

I am not very well versed in DuckDB’s performance stuff. I mainly use it for ad-hoc stuff.
This whole thing is just an experiment. I doubt the DuckDB database (even with index) will be more efficient than the MMDB database.
The free IP to Country ASN database does not have a JOIN_KEY column natively, while our other IP databases have that.
This will be a better solution than Pandas/SQLite-based solutions and, in some instances, better than PostgreSQL and MySQL, considering your use case.
Polars has a data type of DECIMAL that supports 128 bits, but I have not explored it. DuckDB vs. Polars performance is way out of the scope of one weekend and should be left to people smarter than me.
I am unclear about the Python API-based CREATE FUNCTION mechanism and how it works with the ipaddress built-in library.

Download the DB

I will refer to the updated filename reference page on the IPinfo docs.

Convert the IP address to integer.

I am using DuckDB’s Python API based CREATE_FUNCTION command: Python Function API - DuckDB

ip_to_int = lambda ip : int(ipaddress.ip_address(ip))
con.create_function("to_int", ip_to_int, [VARCHAR], UHUGEINT)

Remember this is an OLAP database

As DUCKDB is an OLAP database, it should not be modified or ALTERed. You create the tables and drop the old ones. This is a faster solution according to my limited testing.

con.execute('''
    CREATE TABLE country_asn AS
    SELECT
        *,
        TO_INT(start_ip) start_ip_int,
        TO_INT(end_ip) end_ip_int
    FROM country_asn_raw
''')
con.execute("DROP TABLE country_asn_raw")

Limited interoperability with Python’s data

Even though we have created the function, my initial vision was to enrich a stream of IPs or even a large set of IP addresses from inside a Python list. However, from my limited exploration, joining operations with lists is not feasible with DuckDB. From what I can find, this is impossible unless you convert the data to a Pandas Dataframe or create a temporary table in DuckDB.

So, at this moment, we can only look up individual IP addresses.

Size issue

After creating the index based on the start_ip_int and end_ip_int columns, the size of the duckdb database is 334 MB. While the MMDB database is 45 MB. Although the uncompressed CSV dataset is about 212 MB.

This was fun. If you can share some comments and feedback, that would be great.

Am I doing anything wrong?
Is there any opportunity to bring our data to the users of DuckDB?
Does the “big data is dead” rhetoric work out in the space of data-focused cybersecurity?

If there is enough demand, I can probably do regular pre-indexed DuckDB versions of the [IP to Country ASN]( IP to Country + ASN Database - IPinfo.io) database uploads which will include the function.

Here is my Jupyter Notebook: IPinfo DuckDB experiment: https://community.ipinfo.io/t/hacktogether-setting-up-duckdb-to-use-ipinfos-ip-database-using-python-bad-code-looking-for-feedback/5538 (github.com)

gist.github.com

https://gist.github.com/abdullahdevrel/1333cfcae6daa8bc6c71660f27b4b330

IPinfo IP to Country ASN database with DuckDB.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ipaddress # for converting IP address (IPv4, IPv6) to their integer equivalent\n",
    "import duckdb as db # pip install duckdb\n",

This file has been truncated. show original

Edit:

How fast are MMDBs you might wonder.

500,000 IPv4 Enrichment:

Duckdb: ~40 Seconds

With index declared on the integer columns
With input IPs in temp table, they are stored in memory.
Returning list of tuples

MMDB: ~ 2 seconds

With IPs being represented in a Python list
Enrichment being done with a simple Python loop
Returning a list of dictionary

Video ↓

Code: Duckdb vs MMDB performance: https://community.ipinfo.io/t/hacktogether-setting-up-duckdb-to-use-ipinfos-ip-database-using-python-bad-code-looking-for-feedback/5538 (github.com)

Abdullah · March 4, 2024, 3:18am

Perf tested MMDB and duckdb. MMDB is simply unbeatable.

Topic	Replies	Views
[Announcement] Our new Python library for downloading and querying our databases - IPinfo-DB Announcement	291	May 31, 2023
Getting started with the IPinfo-DB Python Library. Download and Query our databases from Python Official Libraries database , mmdb , ip-country-asn , python	715	May 30, 2023
Using IPinfo's MMDB data downloads with Golang Database Downloads golang , go	367	December 12, 2023
PostgreSQL: `CREATE TABLE` and `CREATE FUNCTION` (UDF) queries for IPinfo data downloads Database Downloads database , postgresql	301	June 14, 2023
[Announcement] IPinfo data downloads now come with access to all available standard file formats Announcement	206	December 18, 2023