Creating an MMDB file with selected rows using the CSV file format IP database and mmdbctl
tool
In this article, we will create a filtered mmdb
dataset that includes only selected columns from the CSV dataset. By selecting specific columns, you can reduce the size of the mmdb
database, improve query time and optimize your data pipeline.
You can also use this guide to add new columns to a
mmdb
file. For example, for IP to Geolocation database, it could be country name, continent name,isEU
tag etc. If you need detailed instructions, please leave a comment, and we will prepare a dedicated post on that.
Step 0: Prerequisites
- IPinfo sample IP database (Download it from here: ipinfo/sample-database: IPinfo database and API sample repository))
- mmdbctl tool
- IPinfo IP database
- Bash + AWK
We recommend starting this solution first with the sample IP database. Modification of the full IP database will take some time. It is better to go for the full IP database conversion once you have confirmed the solution works with the sample IP database.
Step 1: Downloading the CSV dataset
I will use the IP to Geolocation sample dataset for this tutorial. I downloaded the dataset from the IPinfo sample database repo. I have named the sample dataset ip_geolocation_sample.csv
.
Code to download and save the sample database from the Github repo:
curl -o "ip_geolocation_sample.csv" "https://raw.githubusercontent.com/ipinfo/sample-database/main/IP%20Geolocation/ip_geolocation_sample.csv"
ip_geolocation_sample.csv
Step 2: Filtering the IP database with bash (specifically with awk
)
The AWK programming language is considered the fastest* way to filter CSV datasets. The AWK programming language is usually packaged with standard Linux-based distros.
Step 2.1: Identify the columns you need to extract
Go through the columns you need to extract from the CSV dataset. In this example, we chose the IP to Geolocation database. From there, we want to select only the city
column. You can select multiple columns if you like, as well.
Now, list out the columns and assign an index value for them. The operation we will run with awk
which starts at index one. Also, you must select the start_ip
and end_ip
columns. Without the IP range columns (start_ip
and end_ip
) the mmdb conversion will not work. However, you do not need to select the join_key
column.
To declare these indexes, we must preface them with $
.
Columns | Index for awk | Selected | Notes |
---|---|---|---|
start_ip | $1 |
![]() |
mandatory column |
end_ip | $2 |
![]() |
mandatory column |
join_key | $3 |
||
city | $4 |
![]() |
selected column |
region | $5 |
||
country | $6 |
||
latitude | $7 |
||
longitude | $8 |
||
postal_code | $9 |
||
timezone | $10 |
Based on this information, our filtered CSV file will contain only the following columns:
start_ip
($1
)end_ip
($2
)city
($4
)
Step 2.2: Created a filtered CSV file with awk
If you want a detailed description of the code, visit the following StackOverflow links:
Now, you need to run the following code in your terminal:
awk 'BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"}{print $1","$2","$4}' ip_geolocation_sample.csv > ip_geolocation_sample_filtered.csv
The TLDR is that:
awk
calling the awk command.BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"}
is for escaping column native commas and parsing the CSV file appropriately.{print $1","$2","$4}
is for printing out the selected columns. Here you will add your target column index. In the section, if you want to add more or fewer columns you can add them with the$<column_index>
separated by","
.ip_geolocation_sample.csv
is the source IP database in CSV format.ip_geolocation_sample_filtered.csv
is the output/filtered IP database in CSV format.
Now, running this command will yield a CSV file like this:
ip_geolocation_sample_filtered.csv
awk
syntax like this is generally complicated but it makes up for it with its upstart and performance speed. If you find this command too complicated, you can use csvkit, Miller,cut
etc.
Step 3: Converting the filtered CSV file to mmdb
Now that we are presented with output/filtered dataset (ip_geolocation_sample_filtered.csv
), simply convert the CSV file with mmdbctl’s import
command.
In the terminal, run the following code:
mmdbctl import --no-network --in ip_geolocation_sample_filtered.csv --out ip_geolocation_sample_filtered.mmdb
This command will produce the ip_geolocation_sample_filtered.mmdb
mmdb database.
Step 4: Check the mmdb dataset
Now, you can query the mmdb dataset (ip_geolocation_sample_filtered.mmdb
).
Go to the CSV file and pick out an IP address. Why you need to pick an IP address from the CSV file is described in this post.
I am picking 106.181.128.208
. Using the mmdbctl
tool, I can now look up the IP address from the mmdb
dataset we just created.
$ mmdbctl read -f json-pretty 106.181.128.208 ip_geolocation_sample_filtered.mmdb
{
"city": "Kawasaki"
}
You are presented with the selected IP metadata column (city
).
Now, I recommend experimenting with the tutorial. Choose multiple columns and play with the sample. Once the solution is ready, run the queries on the entire IP dataset and generate a filtered mmdb
dataset.
I highly recommend leaving a comment if you need further clarification. Thanks!