So, DuckDB last month or something announced a new datatype called UHUGEINT
. It is 16 bytes (128 bits) and unsigned.
They had a proposal to support network as a native data type, but that is stuck in limbo: Feature Request: Network Address datatype · Issue #881 · duckdb/duckdb (github.com)
So, I thought 16-bytes is all I need, and why not take a crack at DuckDB using our free IP to Country ASN database?
Disclaimers
There are certain constraints, though:
- I am not very well versed in DuckDB’s performance stuff. I mainly use it for ad-hoc stuff.
- This whole thing is just an experiment. I doubt the DuckDB database (even with index) will be more efficient than the MMDB database.
- The free IP to Country ASN database does not have a
JOIN_KEY
column natively, while our other IP databases have that. - This will be a better solution than Pandas/SQLite-based solutions and, in some instances, better than PostgreSQL and MySQL, considering your use case.
- Polars has a data type of
DECIMAL
that supports128 bits
, but I have not explored it. DuckDB vs. Polars performance is way out of the scope of one weekend and should be left to people smarter than me. - I am unclear about the Python API-based
CREATE FUNCTION
mechanism and how it works with theipaddress
built-in library.
Download the DB
I will refer to the updated filename reference page on the IPinfo docs.
Convert the IP address to integer.
I am using DuckDB’s Python API based CREATE_FUNCTION
command: Python Function API - DuckDB
ip_to_int = lambda ip : int(ipaddress.ip_address(ip))
con.create_function("to_int", ip_to_int, [VARCHAR], UHUGEINT)
Remember this is an OLAP database
As DUCKDB is an OLAP database, it should not be modified or ALTERed. You create the tables and drop the old ones. This is a faster solution according to my limited testing.
con.execute('''
CREATE TABLE country_asn AS
SELECT
*,
TO_INT(start_ip) start_ip_int,
TO_INT(end_ip) end_ip_int
FROM country_asn_raw
''')
con.execute("DROP TABLE country_asn_raw")
Limited interoperability with Python’s data
Even though we have created the function, my initial vision was to enrich a stream of IPs or even a large set of IP addresses from inside a Python list. However, from my limited exploration, joining operations with lists is not feasible with DuckDB. From what I can find, this is impossible unless you convert the data to a Pandas Dataframe or create a temporary table in DuckDB.
So, at this moment, we can only look up individual IP addresses.
Size issue
After creating the index based on the start_ip_int
and end_ip_int
columns, the size of the duckdb database is 334 MB
. While the MMDB database is 45 MB
. Although the uncompressed CSV dataset is about 212 MB
.
This was fun. If you can share some comments and feedback, that would be great.
- Am I doing anything wrong?
- Is there any opportunity to bring our data to the users of DuckDB?
- Does the “big data is dead” rhetoric work out in the space of data-focused cybersecurity?
If there is enough demand, I can probably do regular pre-indexed DuckDB versions of the [IP to Country ASN]( IP to Country + ASN Database - IPinfo.io) database uploads which will include the function.
Here is my Jupyter Notebook: IPinfo DuckDB experiment: https://community.ipinfo.io/t/hacktogether-setting-up-duckdb-to-use-ipinfos-ip-database-using-python-bad-code-looking-for-feedback/5538 (github.com)
Edit:
How fast are MMDBs you might wonder.
500,000 IPv4 Enrichment:
Duckdb: ~40 Seconds
- With index declared on the integer columns
- With input IPs in
temp table
, they are stored in memory. - Returning list of tuples
MMDB: ~ 2 seconds
- With IPs being represented in a Python list
- Enrichment being done with a simple Python loop
- Returning a list of dictionary
Video ↓