Why do the numbers of columns not match between the CSV/JSON and MMDB files?

Please read this blog first: How to choose the best file format for your IPinfo database?

The MMDB file format is a special type of binary database, designed for efficient IP lookup and returning its IP metadata information. You pass in an IP address and get its metadata information (geolocation, company, ASN, etc.) back.

In contrast to the binary MMDB database, the CSV and JSON database contain the IP data in plaintext format. You can easily look up the contents of CSV and JSON databases while the MMDB database contains data in binary format.

WindowsTerminal_PZ81ZSZdNU-ezgif.com-optimize

To read the binary MMDB database file, you need specialized tools such as our mmdbctl tool or MMDB reader libraries.

The columns in these databases

The CSV and JSON files contain IP data in the form of IP ranges. These IP ranges columns are:

This IP range column information is converted to binary data within the MMDB file. The MMDB file does not generally return the range information and the join_key as they are not “IP metadata information”. You get the IP metadata when you look up an IP address from the MMDB.

Consider the IP to Company database:

The CSV file contains 3 IP range columns (including the join_key) and 8 IP metadata columns containing IP to company information:

# Field Name Column Type
1 start_ip IP Range Columns (1)
2 end_ip IP Range Columns (2)
3 join_key IP Range Columns (3)
4 name IP Metadata Columns (1)
5 domain IP Metadata Columns (2)
6 type IP Metadata Columns (3)
7 asn IP Metadata Columns (4)
8 as_name IP Metadata Columns (5)
9 as_domain IP Metadata Columns (6)
10 as_type IP Metadata Columns (7)
11 country IP Metadata Columns (8)

It has the same number of columns in the JSON database as well.

How many columns does the MMDB database have?

Let’s use mmdbctl tool to answer that:

Target IP address: 54.38.15.136

Let’s look up the target IP address using the mmdbctl tool:

mmdbctl read 54.38.15.136 company_sample.mmdb

The mmdbctl tool generally only returns the IP metadata information of the IP address being looked up. Because of the structure of the MMDB, IP ranges are converted to a binary representation, and they are used to return only the IP metadata information and not IP range information.

Special point: Is it always only IP metadata?

Outside the scope of the MMDB database, let’s talk about the capabilities of the mmdbctl tool. The mmdbctl tool can do extra things to return extra information in special circumstances.

mmdbctl can return input IP addresses

When you are outputting to TSV or CSV format data, the mmdbctl tool will add an extra column natively called IP, which represents the input IP address:

mmdbctl read 54.38.15.136 company_sample.mmdb -f csv

mmdbctl can support outputting IP range column

You can return the IP range or network information when querying the MMDB database. For that, you have to import/compile the MMDB database yourself from the CSV database.

IPinfo does not add the network column to their mmdb database by default. We use the --no-network declaration when importing/compiling to MMDB databases.

Compile the CSV IP address database to an MMDB database using the following command.

mmdbctl import --in company_sample.csv --out company_sample_network.mmdb

Then query the MMDB database as usual:

mmdbctl read 54.38.15.136 company_sample_network.mmdb

Notice the network column there. The network column just represents the range in the IP database. It does not indicate the parent range or prefix of the input IP address.

The network column has no impact or significance from a parent range perspective. The network column does not indicate parent range or prefix declaration as derived from WHOIS and BGP records. The network column has no significance outside the IP database’s boundaries. Use the ASN API service to get an IP address’s parent range or prefix information.

Representing IP range values in CIDR format

We can even take this one step further. Converting the IP address range (start_ip, end_ip) to their CIDR equivalent.

If you precompile the CSV to have their start_ip, and end_ip be converted to their CIDR equivalent using the IPinfo CLI, then import/compile the CSV file to MMDB format using the CLI, you can get the network value in the MMDB file in their CIDR format:

ipinfo range2cidr company_sample.csv > company_sample_cidr.csv
mmdbctl import --in company_sample_cidr.csv --out company_sample_network.mmdb
mmdbctl read 54.38.15.136 company_sample_network.mmd


Feel free to reply to this post if you have any questions or feedback.

Hello Abdullah,

I tried using the mentioned steps to generate an mmdb file from the company csv file. But the process keeps getting killed with Out of Memory. I tried with an 8Gig machine initially, then tried with a 32Gig machine having 20Gig of storage. But still the process gets killed.

I am able to convert the sample csv to mmdb format, but the tool does not work for full company csv file. Can you please let me know if you have tried and are able to convert the full company csv file to mmdb format. (The company csv file is around 4Gig)

Logs:
Mar 1 07:09:33 ip-172-31-27-69 kernel: [ 1245.636384] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=chrony.service,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=mmdbctl,pid=3250,uid=1000

Mar 1 07:09:33 ip-172-31-27-69 kernel: [ 1245.636440] Out of memory: Killed process 3250 (mmdbctl) total-vm:33098176kB, anon-rss:32353404kB, file-rss:128kB, shmem-rss:0kB, UID:1000 pgtables:63472kB oom_score_adj:0

Mar 1 07:09:33 ip-172-31-27-69 systemd[1]: session-1.scope: A process of this unit has been killed by the OOM killer.

Thanks,
Suhas

Hi Suhas,

The MMDB writing process does indeed require a lot of memory, but we can help here :slight_smile:

@Abdullah mentioned that you wanted to keep the range information in the MMDB. Is that correct?

If so, note that the range in the CSV file (start_ip, end_ip) has no specific meaning. It just represents a range of consecutive IPs with the same values, but it does not necessarily match the BGP (routed prefix) or WHOIS range (registered range).

Could you clarify which kind of information you would like to obtain? We can build an MMDB that contains what you need.

Thanks,
Max

1 Like

Hi Max,

That’s right, we want to keep the range information in MMDB.
Is it possible to include the start_ip, end_ip (from CSV file), routed prefix(from BGP) and the registered range (from WHOIS).

Thanks,
Suhas

Hello @Max and @Abdullah,

Any update on this?

Thanks,
Suhas

My apologies for the delay, @Suhas. Give me a moment. I have reached out to our sales team to see if they have contacted your team.

Hey @Suhas,

I hope you are doing well. Please let me know what you think of the solution:

Using the free IP to ASN database as a secondary data source

The dataset is free, updated daily, and provides full accuracy. There should not be any compromise with data quality, but there is a small caveat: you have to use two MMDB datasets. We will discuss this issue in the later section.

Instructions

First, download the IP to ASN free database in the CSV format:

curl -L https://ipinfo.io/data/free/asn.csv.gz?token=$TOKEN -o asn.csv.gz

Your existing token will work just fine. The IP dataset is available for free everyone.

Unzip gzipped CSV file:

gunzip asn.csv.gz

Add the range/network column to the database using the IPinfo CLI:

ipinfo range2cidr asn.csv > asn_cidr.csv

Convert it to MMDB database:

mmdbctl import --in asn_cidr.csv --out asn_cidr.mmdb

And now you will have access to range information of ASN.

Usage

mmdbctl read 86.196.240.78 asn_cidr.mmdb | jq
{
  "asn": "AS3215",
  "domain": "orange.com",
  "name": "Orange S.A.",
  "network": "86.192.0.0/11"
}

Caveats

Using two MMDB database

I understand that we discussed an idea about a single dataset, but considering our challenges for the moment, it is a simple compromise. The MMDB file format is incredibly fast, so you should be able to support lookup from two MMDB datasets.

If you need any assistance, please let me know. I am always happy to help.

Discrepency between ASN names in between ASN database and Company database

This is a non-issue, as using the IP to ASN database will give you the best possible result. I recommend sticking with the ASN database for ASN data.

Here is a recommended read: Differences in data for the ASN and the IP to Company database - Docs / Knowledgebase - IPinfo Community

Database range is not a representation of WHOIS records or other information

Because we aggregate and flatten ranges, the range representation in the IP database is not the same as how they are represented in WHOIS records and other public routing records.

However, I think (I have not verified this) the range information that will be generated from the method I highlighted will represent the largest prefix possible that spans across multiple neighboring prefixes as our data pipeline aggregates the ranges, resulting in smaller ranges merging to their highest possible common range.


I really appreciate your participating in the community. Let me know if you need any help in adopting this solution. Again, thank you very much!