Associating Geolocation with >600M records en masse via dbt

Hey y’all -

As the title says, I’m needing to enrich more en masse. I’m using dbt to move the data around. This presents two challenges.

  1. Irrespective of the exact solution, the volume is large
  2. The recommended solution is the UDTF provided by y’all, but dbt doesn’t support UTDFs

Currently, materializing the shared ipinfo geolocation table (normal dbt practice) and materializing the source data (600M+ records), then joining the two using effectively the same join criteria in the UDTF.

However, the way the model is currently constructed doesn’t support ipv6. Additionally, this approach doesn’t scale for using other tables y’all provide, or modifying our existing one due to the cost of backfilling.

simply re-running the existing model timed out at >12 hours.

solutions
1, hard coded the UDTF provided: wall time ~3hrs 40mins
2. switch the range join to an INT for ipv6 instead of the hexadecimal varchar: wall time 1hr 59mins

I don’t know what a reasonable or expected time for this operation to execute is, would love your all’s insight.

Hey Alex,

Thank you very much for posting this. I have pinged our engineers. They will join the conversation soon.

Hey Alex,

@Max, our data engineer, said that:

IPv6 is always going to be slow with the standard join, no matter how you compute the join key.

First, I would recommend to deduplicate IP addresses. Maybe they don’t have 600M distinct IP addresses?

Second, the native app is the proper solution if it can work for them.

Uman, our head of the backend, who led the Snowflake integration, said:

In Snowflake the native integer type doesn’t support the full range of IPv6, so that’s why we use the hex string format:

Numeric Data Types | Snowflake Documentation

>>> 2**128  > 99999999999999999999999999999999999999
True

Both recommended the Snowflake app if you frequently perform large-scale joins. However, we have not tested the integration of the app with Snowflake+dbt for large-scale joins.

Given the use of the hex range start as a join key, it appears that the full hexadecimal representation of the ipv6 isn’t needed for the range comparison and that the first 8 “digits” can be removed (ip_input_numeric_tail_for_range_end).

This allows the remaining hexadecimal numeric to be converted to a valid integer: ip_input_numeric_tail_for_range_end_int

We do the same for the start_ip_int and end_ip_int: tail_for_range_start_int and tail_for_range_end_int.

I put together a query to validate this:

with compare_output as (
    select
        ip_locations.start_ip as ip_input,
        -- '240e:386:82e:8900::' as ip_input,
        parse_ip(ip_input || '/34', 'inet', 1):hex_ipv6_range_start::varchar(32) as ip_input_join_key,
        parse_ip(ip_input || '/34', 'inet', 1):hex_ipv6::varchar(32) as ip_input_numeric,

        substr(ip_input_numeric, 9) as ip_input_numeric_tail_for_range_end,

        to_number(ip_input_numeric_tail_for_range_end,
                  repeat('X', length(ip_input_numeric_tail_for_range_end)))::int ip_input_numeric_tail_for_range_end_int,
        ip_locations.start_ip,
        ip_locations.end_ip,
        ip_locations.start_ip_int,
        ip_locations.end_ip_int,

        -- comparison by integer
        substr(ip_locations.start_ip_int, 9) as tail_for_range_start,
        to_number(tail_for_range_start, repeat('X', length(tail_for_range_start)))::int as tail_for_range_start_int,
        substr(ip_locations.end_ip_int, 9) as tail_for_range_end,
        to_number(tail_for_range_end, repeat('X', length(tail_for_range_end)))::int as tail_for_range_end_int,

        -- validate match using integer
        (ip_input_numeric_tail_for_range_end_int >= tail_for_range_start_int
            and ip_input_numeric_tail_for_range_end_int <= tail_for_range_end_int) as check_range
    
    from ipinfo_ip_geolocation_share.public.ip_geolocation as ip_locations
        where ip_input_join_key = ip_locations.join_key
            and ip_input_numeric between ip_locations.start_ip_int and ip_locations.end_ip_int
            and ip_locations.start_ip ilike '%:%'
)
select *
from compare_output

we see that for all cases where the hexadecimal range was satisfied, the integer range also suffices

1 Like

I was able to confirm it generates valid integers for IPv6 bogon IP’s:

with bogon_ips as (
    SELECT $1 as bogon_ip, $2 as bogon_ip_description FROM (VALUES
   ('::', 'Node-scope unicast unspecified address'),
('::1', 'Node-scope unicast loopback address'),
('::ffff:0:0', 'IPv4-mapped addresses'),
('::', 'IPv4-compatible addresses'),
('100::', 'Remotely triggered black hole addresses'),
('2001:10::', 'Overlay routable cryptographic hash identifiers (ORCHID)'),
('2001:db8::', 'Documentation prefix'),
('fc00::', 'Unique local addresses (ULA)'),
('fe80::', 'Link-local unicast'),
('fec0::', 'Site-local unicast (deprecated)'),
('ff00::', 'Multicast (Note: ff0e:/16 is global scope and may appear on the global internet.)'),
('2002::', '6to4 bogon (0.0.0.0/8)'),
('2002:a00::', '6to4 bogon (10.0.0.0/8)'),
('2002:7f00::', '6to4 bogon (127.0.0.0/8)'),
('2002:a9fe::', '6to4 bogon (169.254.0.0/16)'),
('2002:ac10::', '6to4 bogon (172.16.0.0/12)'),
('2002:c000::', '6to4 bogon (192.0.0.0/24)'),
('2002:c000:200::', '6to4 bogon (192.0.2.0/24)'),
('2002:c0a8::', '6to4 bogon (192.168.0.0/16)'),
('2002:c612::', '6to4 bogon (198.18.0.0/15)'),
('2002:c633:6400::', '6to4 bogon (198.51.100.0/24)'),
('2002:cb00:7100::', '6to4 bogon (203.0.113.0/24)'),
('2002:e000::', '6to4 bogon (224.0.0.0/4)'),
('2002:f000::', '6to4 bogon (240.0.0.0/4)'),
('2002:ffff:ffff::', '6to4 bogon (255.255.255.255/32)'),
('2001::', 'Teredo bogon (0.0.0.0/8)'),
('2001:0:a00::', 'Teredo bogon (10.0.0.0/8)'),
('2001:0:7f00::', 'Teredo bogon (127.0.0.0/8)'),
('2001:0:a9fe::', 'Teredo bogon (169.254.0.0/16)'),
('2001:0:ac10::', 'Teredo bogon (172.16.0.0/12)'),
('2001:0:c000::', 'Teredo bogon (192.0.0.0/24)'),
('2001:0:c000:200::', 'Teredo bogon (192.0.2.0/24)'),
('2001:0:c0a8::', 'Teredo bogon (192.168.0.0/16)'),
('2001:0:c612::', 'Teredo bogon (198.18.0.0/15)'),
('2001:0:c633:6400::', 'Teredo bogon (198.51.100.0/24)'),
('2001:0:cb00:7100::', 'Teredo bogon (203.0.113.0/24)'),
('2001:0:e000::', 'Teredo bogon (224.0.0.0/4)'),
('2001:0:f000::', 'Teredo bogon (240.0.0.0/4)'),
('2001:0:ffff:ffff::', 'Teredo bogon (255.255.255.255/32)')
    )
)
select 
        bogon_ip,
        bogon_ip_description,
        bogon_ip as ip_input,
        parse_ip(bogon_ip, 'inet', 1) as ip_input_parsed,
        ip_input_parsed:family::tinyint as ip_input_family,
        -- simulating udf
        case ip_input_family
            when 4
                then parse_ip(concat(ip_input, '/16'), 'inet', 1):ipv4_range_start::varchar
            when 6
                then parse_ip(concat(ip_input, '/34'), 'inet', 1):hex_ipv6_range_start::varchar
            else null
        end as ip_input_join_key,
        -- simulating udf
        case ip_input_family
            when 4
                then parse_ip(concat(ip_input, '/16'), 'inet', 1):ipv4::int
            when 6    
                then to_number(substr(parse_ip(concat(ip_input, '/34'), 'inet', 1):hex_ipv6::varchar(32), 9), repeat('X', 24))::int
            else null
        end as ip_input_numeric
from bogon_ips

Hey Alex,

Sorry for taking way longer to resolve this issue. The engineers are looking into the matter very closely and running some tests. We are still conducting tests on our side. We will keep you posted.

Thanks for looking at it Abdullah - I’m curious whether you’re able to offer any commentary on whether the logic behind the approach is at least valid.

By taking the portion of the hex key that extends past the supported integer range and making it an equality comparison, we’re able to reduce the size of the remaining hexadecimal value to an integer value we’re able to use in a range comparison.

Also, I should offer that when looking at IPv6 in your table, I believe the end integer value was always 79228162514264337593543950335, the result of select to_number(repeat('F', 24), repeat('X', 24)). I could be incorrect though.

Hey @alex_gable,

I’m sorry for the late reply. Thanks a lot for your suggestion and tests!

This is a good idea :slight_smile: Since the join key is always 34 bits, it’s indeed possible to represent it as an integer, rather than a byte string.

It’s tricky to change this without breaking existing customers workflow, but we’ll see what we can do.

We’re also working on other methods to make IPv6 lookups much faster in Snowflake. One of them is a Snowflake Native App that embeds the MMDB reader. It scales much better than the join approach. Ping @Abdullah if you want to give it a go :slight_smile:

Best,
Max

2 Likes

Max - just to be sure, you’re saying this is a valid solution if I’m able to implement it on my side?

Yes, that’s correct :slight_smile:

1 Like