Creating MMDB datasets with selected rows

Creating an MMDB file with selected rows using the CSV file format IP database and mmdbctl tool

In this article, we will create a filtered mmdb dataset that includes only selected columns from the CSV dataset. By selecting specific columns, you can reduce the size of the mmdb database, improve query time and optimize your data pipeline.

You can also use this guide to add new columns to a mmdb file. For example, for IP to Geolocation database, it could be country name, continent name, isEU tag etc. If you need detailed instructions, please leave a comment, and we will prepare a dedicated post on that.

Step 0: Prerequisites

We recommend starting this solution first with the sample IP database. Modification of the full IP database will take some time. It is better to go for the full IP database conversion once you have confirmed the solution works with the sample IP database.

Step 1: Downloading the CSV dataset

I will use the IP to Geolocation sample dataset for this tutorial. I downloaded the dataset from the IPinfo sample database repo. I have named the sample dataset ip_geolocation_sample.csv.

Code to download and save the sample database from the Github repo:

curl -o "ip_geolocation_sample.csv" "https://raw.githubusercontent.com/ipinfo/sample-database/main/IP%20Geolocation/ip_geolocation_sample.csv"

ip_geolocation_sample.csv

Step 2: Filtering the IP database with bash (specifically with awk)

The AWK programming language is considered the fastest* way to filter CSV datasets. The AWK programming language is usually packaged with standard Linux-based distros.

Step 2.1: Identify the columns you need to extract

Go through the columns you need to extract from the CSV dataset. In this example, we chose the IP to Geolocation database. From there, we want to select only the city column. You can select multiple columns if you like, as well.

Now, list out the columns and assign an index value for them. The operation we will run with awk which starts at index one. Also, you must select the start_ip and end_ip columns. Without the IP range columns (start_ip and end_ip) the mmdb conversion will not work. However, you do not need to select the join_key column.

To declare these indexes, we must preface them with $.

Columns Index for awk Selected Notes
start_ip $1 :white_check_mark: mandatory column
end_ip $2 :white_check_mark: mandatory column
join_key $3
city $4 :white_check_mark: selected column
region $5
country $6
latitude $7
longitude $8
postal_code $9
timezone $10

Based on this information, our filtered CSV file will contain only the following columns:

  • start_ip ($1)
  • end_ip ($2)
  • city ($4)

Step 2.2: Created a filtered CSV file with awk

If you want a detailed description of the code, visit the following StackOverflow links:

Now, you need to run the following code in your terminal:

awk 'BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"}{print $1","$2","$4}' ip_geolocation_sample.csv > ip_geolocation_sample_filtered.csv

The TLDR is that:

  • awk calling the awk command.
  • BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"} is for escaping column native commas and parsing the CSV file appropriately.
  • {print $1","$2","$4} is for printing out the selected columns. Here you will add your target column index. In the section, if you want to add more or fewer columns you can add them with the $<column_index> separated by ",".
  • ip_geolocation_sample.csv is the source IP database in CSV format.
  • ip_geolocation_sample_filtered.csv is the output/filtered IP database in CSV format.

Now, running this command will yield a CSV file like this:

ip_geolocation_sample_filtered.csv

image

awk syntax like this is generally complicated but it makes up for it with its upstart and performance speed. If you find this command too complicated, you can use csvkit, Miller, cut etc.

Step 3: Converting the filtered CSV file to mmdb

Now that we are presented with output/filtered dataset (ip_geolocation_sample_filtered.csv), simply convert the CSV file with mmdbctl’s import command.

In the terminal, run the following code:

mmdbctl import --no-network --in ip_geolocation_sample_filtered.csv --out ip_geolocation_sample_filtered.mmdb

This command will produce the ip_geolocation_sample_filtered.mmdb mmdb database.

Step 4: Check the mmdb dataset

Now, you can query the mmdb dataset (ip_geolocation_sample_filtered.mmdb).

Go to the CSV file and pick out an IP address. Why you need to pick an IP address from the CSV file is described in this post.

I am picking 106.181.128.208. Using the mmdbctl tool, I can now look up the IP address from the mmdb dataset we just created.

$ mmdbctl read -f json-pretty 106.181.128.208 ip_geolocation_sample_filtered.mmdb
{
  "city": "Kawasaki"
}

You are presented with the selected IP metadata column (city).

Now, I recommend experimenting with the tutorial. Choose multiple columns and play with the sample. Once the solution is ready, run the queries on the entire IP dataset and generate a filtered mmdb dataset.


I highly recommend leaving a comment if you need further clarification. Thanks!

Filtering with multiple columns: city, country, and region

ipinfo → mmdb_selected_rows $ awk 'BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"}{print $1","$2","$4","$5","$6}' ip_geolocation_sample.csv > ip_geolocation_sample_filtered.csv

ipinfo → mmdb_selected_rows $ mmdbctl import --no-network --in ip_geolocation_sample_filtered.csv --out ip_geolocation_sample_filtered.mmdb
writing to ip_geolocation_sample_filtered.mmdb (100 entries)

ipinfo → mmdb_selected_rows $ mmdbctl read -f json-pretty 106.181.128.208 ip_geolocation_sample_filtered.mmdb
{
  "city": "Kawasaki",
  "country": "JP",
  "region": "Kanagawa"
}