Querying IPinfo IP databases inside BigQuery

Querying IPinfo IP databases inside BigQuery

We highly recommend that users not write their own code when querying our data inside BigQuery. These queries are often unoptimized and thus expensive.

IP databases are specialized and require specific SQL queries depending on the platform. These queries vary from platform to platform as they require functions and features that can support IP address data. For BigQuery, we highly recommend following this tutorial as it leverages BigQuery’s native functions.

Step 0: Prerequisites

This script assumes you have already ingested the IPinfo IP database in Big Query. Documentation: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv.

Step 1: Exploring the log data containing the IP addresses

Start with a table, view or a CTF containing IP addresses. The IP addresses used in this example are contained in the log_data table. The table has one column called IP.

SELECT
  *
FROM
  log_data

log_data

image

We will join the IP addresses with the IPinfo metadata information in the IPinfo IP database.

Step 2: The IPinfo IP database inside BigQuery

We recommend you ingest the CSV format data into BigQuery for this. For this, follow the instructions available here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv.

The IPinfo database we will use here is the IP to Geolocation database.

SELECT
  *
FROM
  `ipinfo-XXXX.ipinfo_data.location`
LIMIT
  100

Here:

Item Note
ipinfo-XXXX GCP project name
ipinfo_data Resource name
location Table name

We also recommend setting up indexing or clustering mechanism. Here is a blog article from GCP: Optimizing BigQuery: Cluster your tables | by Felipe Hoffa | Medium. If you index the IPinfo IP database, your target columns will be the IP range columns: start_ip, end_ip and join_key.

Step 3: Running the query

Step 3.1: Identify IP metadata information from the IPinfo IP database

The IP metadata information will be any column containing no IP address range (start_ip, end_ip or join_key). The IP metadata information in the IP to Geolocation database will be the following:

  • city
  • region
  • country
  • lat
  • long
  • postal
  • timezone

Step 3.2: The lookup code

This is the main section. Please note the syntax carefully.

The code first:

SELECT
  l.ip, -- ips in the log_data 
  loc.city, -- ip metadata columns in the ip loc database
  loc.region,
  loc.country,
  loc.lat,
  loc.lng,
  loc.postal,
  loc.timezone
FROM
  log_data AS l
JOIN
  `ipinfo-XXXX.ipinfo_data.location` loc
ON
  loc.join_key = NET.IP_TO_STRING(NET.IP_TRUNC(NET.IP_FROM_STRING(l.ip),
    IF
      (STRPOS(l.ip, ':') = 0, 16, 34)))
WHERE
  NET.IP_FROM_STRING(l.ip) 
      BETWEEN
        NET.IP_FROM_STRING(loc.start_ip)
          AND
        NET.IP_FROM_STRING(loc.end_ip)

Code breakdown:

  • IP addresses from the log data are joined with IP metadata information from the IPinfo IP database.
  • IP addresses in the log_data are converted from string data types to their IP address data type using BQ’s native NET.IP_FROM_STRING.
  • The join_key (a truncated version of the IP address) facilitates one aspect of the join. This operation also considers IPv6 addresses. The NET.IP_TRUNC command generates the join_key for the IP addresses in the log_data table.
  • The second aspect of the join operation is performed using the BETWEEN command.
  • Then, the IP metadata information is joined with the input IP addresses.

Query performance:

The log_data contained eight(8) thousand random IPv4 addresses and two(2) thousand random IPv6 addresses. The cache was cleared for this query.


If you need further clarifications, please drop a comment below. Thanks.