IPinfo's join_key column explained

This was taken from an internal wiki. Feel free to drop a comment if you have specific questions or feedback.

join_key for IPv4 is /16 and for IPv6 is /34.


The problem that join_key tries to solve is to make joins between large datasets with IP ranges relatively fast without the need for special DB capabilities, like indexes, on the client side.

The idea is to first do an equality join to find ranges that potentially may be overlapping and then filter this smaller subset of rows (that have the same join_key) using the actual range overlap condition. The second stage (filtering) is quadratic in nature (one has to compare each row in the left subset with each in the right one), so we want join_key to be specific enough to make these subsets as small as possible.

However, generation of join_key has a price. Say, we have two tables with just IPv4 IPs (i.e. /32s) or ranges that are smaller (or equal) size than /16. Then to obtain the join_keys for both tables we just truncate IPs to their first 16 bits, e.g. 1.2.3.4 transforms to 1.2.0.0 (last 16 bits are set to 0), and that’s it.

But If we have a range bigger than /16, say 1.2.0.0/15, then not all IPs in it have the same first 16 bits, so we can’t just use 1.2.0.0 as a join_key. Instead, we must split /15 into two /16s that cover it — 1.2.0.0/16 and 1.3.0.0/16:

$ ipinfo splitcidr 1.2.0.0/15 16
1.2.0.0/16
1.3.0.0/16

IPinfo CLI’s splitcidr command can split CIDRs by input prefix length or subnet mask.

So for each such row (/15) we generate two rows — with join_keys 1.2.0.0 and 1.3.0.0. For /14 we would generate 4 rows, et cetera. We could choose /24 as a join key size for IPv4, which would work — the filtering would involve at most (2^8)^2=65536 operations, but then on the preprocessing stage each row with /16 range would become 256 rows with different join_keys, and each /8 row would become 65536 rows. Anyway, that would still work, even though the data size would be bigger.

An actual problem starts when you consider IPv6. Unlike IPv4, some split sizes are just unavailable to us — BQ just can’t generate so many rows or it takes an eternity and the resulting preprocessed data becomes huge even if it initially contained a single /11, so we physically can not use split sizes like /64 (kind of like /24 in IPv4).

On the other hand, using /16 makes filtering prohibitively slow as too much rows in the joined tables may have the same first 16 bits. IPv6 is extremely inhomogeneous. The value of /34 is something that allows us to do the preprocessing without extreme inflation of the data size and also allows to make joins more-or-less performant, at least in some typical cases.

Of course this approach is not bullet-proof and join performance may be bad in some cases. If the user’s database of choice supports IP range indexes, like Postgres (ip4r), it is advisable to use them.

Using IPinfo’s data downloads in PostgreSQL - Docs / Database Downloads - IPinfo Community