Using IPinfo’s databases in Cribl

Abdullah · January 25, 2024, 5:28pm

IP intelligence inside Cribl powered by IPinfo

Cribl provides solutions for data management and observability. Their pioneering product, Cribl Stream, allows for the management and optimization of data, particularly machine-generated log data within an IT system. With Cribl Stream, users can collect data from various sources, route it to different processes, and send it to multiple destinations. The data pipelines provided by Cribl Stream can help with filtering, enrichment, and transformation of data.

A significant part of Cribl’s source data is internet/server traffic log data containing IP addresses. Within Cribl, IPinfo’s data can be used to enrich log data with geolocation, privacy, company, ASN, and more information.

Using IPinfo’s data download service can be considered a fundamental step in IP data enrichment through Cribl. Moreover, as the process is simple and effective, anyone interested in Cribl should consider exploring data enrichment through IPinfo as a first step in learning and evaluating Cribl’s abilities.

In this tutorial, we will explore IPinfo’s IP data in Cribl. As Cribl’s platform has a broad range of features, we will only focus on helping users get up and running with IPinfo and Cribl. If you are interested in learning more about using IPinfo’s data in Cribl, reach out to our data experts today.

Fundamental concepts of Cribl

In the simplest terms, Cribl is a platform that can work with multiple sources, perform transformations, and route the data stream to multiple destinations.

The data source can be log data from web servers, IoT events, data streaming platforms, etc. The data generated by these sources are called “events”. Cribl can capture these events to create a central observability and management platform. Instead of having multiple platforms generating scattered data in different places, Cribl can centralize this generated data within its own environment.

After centralizing them, Cribl not only improves the observability and maintainability of these events, but it can also route these events to data pipelines that can process, enrich (this is where IPinfo comes in), transform, filter, and anonymize them.

After this transformation process, it redirects the output of the pipelines to different destinations. These destinations can be data stores, data warehouses, data streaming platforms, analytics platforms, etc.

We highly recommend users to go through Cribl’s own documentation and tutorial to learn more about Cribl.

Example: Apache Web Server → IPinfo → AWS S3 Bucket

Consider that you are running an Apache web server to host a website. This web server is generating a lot of data that provides information about the users visiting your website. This data includes the IP addresses of the visitors.

For analytics and cybersecurity purposes, you would like to enrich the visitor IP addresses with IP metadata information, such as geolocation and ASN information.

After the enrichment process, you want to store the enriched log data in an AWS S3 storage bucket so that you can further process it using AWS Redshift, Snowflake, or any other platform of your choice.

This process of capturing log data, enriching it with IPinfo’s database as a lookup/reference database, and sending the enriched data to AWS S3 can all be done inside Cribl Stream.

Cribl Stream + IPinfo Walkthrough

So, let’s get started. Cribl provides easy sign-up for a free account. Even though they offer a sandbox environment to learn and experiment with Cribl’s ecosystem, at this moment, the sandbox environment does not support IPinfo’s IP database uploads. So, we recommend users to sign up for a free account with Cribl.

You probably already have an IPinfo account. But double checking. Here is the link to signup for IPinfo’s free API and free database service.

We will be working with synthetic data for traffic generation. This synthetic traffic is generated natively through Cribl and does not require us to invest in using a real traffic source or bringing in synthetic data ourselves.

We will be using the MMDB data format of IPinfo’s IP database. If you would like to learn about this data format, we recommend giving this article a read: How to choose the best file format for your IPinfo database? - IPinfo.io

Even though Cribl’s data process function to work with IPinfo’s MMDB database is called “GeoIP”, it can read and work with all the MMDB databases IPinfo produces. This schema agnostic function only converts the binary MMDB database to usable plaintext data.

Basic setup and setting up of the synthetic traffic source

After you have signed up for a free account with Cribl (and with IPinfo), you will be presented with a Cribl’s homepage. Go to “Manage Stream” product to get started with Cribl Stream. Cribl has other products, such as Cribl Edge and Cribl Search, which can support our data.

From there, select the “default” worker group.

Now, we will need to set up a synthetic traffic generation source. We will go with self-described simplest method of using Quick Connect. When it comes to using IPinfo’s database, the easiest solution always works.

From here, we will add a “source”.

After opening up the source configuration, we will search for “Datagen” which stands for Data Generator. We will select the “Add New” button to start configuring it.

We will configure the data generator source from here to generate synthetic data replicate traffic.

Please notice the following steps:

Step	Item	Description	Action
1	Input ID	The ID of the Datagen source	`apache_log_dummy`
2	Data Generator File	Name of the Datagen file type. We will pick “Apache Common Log” from the drop-down menu.	`apache_common.log`
3	Pre-Processing	Just note the existence of it in this configuration stage. We will get back to this later.	No action is required.
4	Tags	Tag to group and filter items	`traffic`
5	Save	Save the configuration file	Click to save the configuration for the Datagen source.

What we have done here is create a stream of synthetic events generated from the Datagen source. These synthetic events replicate real-life Apache web server logs.

One recurring theme in this tutorial is verbosity. When managing multiple sources, pipelines, and destinations across different VMs, orgs, data centers, and cloud instances, it’s easy to become overwhelmed if we use generic ID names and descriptions. We recommend being diligent about ID names, tags, descriptions and commit messages.

After setting up the configuration file, you need to commit and deploy the action using the “Commit & Deploy” button in the top right corner.

The version control mechanism of Cribl is awesome.

Provide as much context as possible in the commit message, and then press the bottom right button to commit and deploy.

However, the source is not active yet and is not generating events. To activate the source to generate events, you need to connect it to a destination. Here, you can click and drag the line from the Datagen source to the default destination. After connecting the source to the destination with the passthru option, press the save button.

opera_HRqrfJpMWV

Again, commit and deploy with a message. You can see in the diff section the source is now activated.

Wait for at least one minute before moving to the next step! You will not see the results immediately after deploying. After a minute, you can check out both the source and destination.

opera_Ufnn6I0HA7

By pressing “configure” and then “Live Data,” you can see events being generated in real-time in both the source and destination. This is the magic of Datagen. Synthetic data is being created in the format of Apache web server logs. There is no real server behind this information, but as log data, it is useful for testing the pipelines.

Note the live data information. The Apache log data contains raw log information and some metadata information provided by Cribl. There isn’t much going on in terms of data and context. As we go through the tutorial, we will not only parse the log event but also enrich it with IPinfo’s IP insights for each individual log event.

Use IPinfo’s data inside Cribl Stream

Now that we have set up the source and the destination, it is time we get started with IPinfo’s IP database.

We are going to upload IPinfo IP data downloads in the form of an MMDB database. For every IPinfo user, IPinfo generously provides the following free IP databases:

IPinfo Lite: The Most Accurate Free IP Geolocation API & Database - IPinfo.io

These free IPinfo IP databases are:

Updated every day
Provide full accuracy without compromising data quality
Do not cluster ranges to larger subnets and provide granular level (down to /32) IP metadata information
Have a flat and tabular structure
Are available in three different formats: CSV, MMDB, and JSON

The tabular structure and predictable schema are going to be a lifesaver for working with Cribl with our data.

You can download your free IPinfo IP to database from your IPinfo account dashboard.

Aside from our free IP database, you can use premium databases, such as:

For starters, we are going to use the free IP to Country database. Make sure to download the MMDB format database.

The IP to Country database contains the following fields:

FIELD NAME	EXAMPLE	DATA TYPE	DESCRIPTION
`start_ip`	217.220.0.0	TEXT	Starting IP address of an IP address range
`end_ip`	217.223.255.255	TEXT	Ending IP address of an IP address range
`country`	IT	TEXT	ISO 3166 country code of the location
`country_name`	Italy	TEXT	Name of the country
`continent`	EU	TEXT	Continent code of the country
`continent_name`	Europe	TEXT	Name of the continent

After downloading the IPinfo IP to Country database in the MMDB format, you must upload it to Cribl. Cribl has a dedicated section for hosting variables, databases, regex definitions, and more in the “knowledge library” section of pipelines. The knowledge library section can be accessed from the pipelines menu.

In the lookup tab of the knowledge section, we will click on “Add Lookup File.” Cribl can support multiple lookup file formats, including the MMDB format we are working with. Use this option to upload the IPinfo IP to Country database.

Provide the following information for the “New Lookup File”.

Step	Field name	Input
1	Filename	`country.mmdb`
2	Description	`IPinfo IP to Country Database MMDB`
3	Tag	`ipinfo`
4	Save	Click the button on the bottom right

After the IPinfo IP to Country database has finished uploading, “Commit & Deploy” with a descriptive message of the actions we took.

Set up the processing pipeline

Now that we have uploaded our IPinfo IP database into Cribl’s knowledge storage, we are going to use it as a lookup or reference table to enrich the log database with country-level information for each IP address.

This process will be done through a pipeline element, which will take the IP addresses inside each log event and look it up from the IPinfo IP to Country database and add the country location context to the event.

To create a new pipeline, open up Pipelines from the Processing menu.

And create a new Pipeline.

Add the ID, description and tags for the new pipeline. As we are creating this pipeline to add IP metadata information to the IP address in the log stream, we will add the information information in the following way.

Step	Field name	Input
1	ID	`log_enrichment_with_ipinfo`
2	Description	`Adding metadata information to IP addresses from log data using IPinfo's MMDB database`
3	Tag	`ipinfo`, `log_enrichment`
4	Save	Click the button on the bottom right

Parser function

For the first step, we will add a parser. Cribl’s native parser function can parse data from raw input. It supports multiple configuration and input files. However, in this instance, we are only interested in parsing the Apache common log that is being generated by Datagen. Note that you can also use regex to parse information from raw input.

After adding the parser function, configure it to parse the Apache common log format. Breakdown of the steps

Step	Field name	Input
1	Description	`Extract fields (including IP addresses) from Apache common log`
2	Operation mode (From options)	`Extract`
3	Type (From options)	`Common Log Format`
4	Library (From options)	`Apache Common Log Format`
5	List of fields	`Auto generated`
6	Save	Click the button on the bottom right

Note the parsed clientip field. From the Apache common log, we will use the clientip field to get the IP address and then look up its metadata information from IPinfo’s IP metadata information.

Note that if the parser function does not have a native library to parse IP addresses from raw input, you can use the regex function and regular expressions to extract IP addresses from raw payload.

IP database lookup or GeoIP function

Note that GeoIP might indicate IP Geolocation, but it is, in fact, only a function for reading MMDB databases. It can support reading MMDB databases that contain IP data in the binary file format.

This GeoIP function can support all of IPinfo’s IP database including IP to Geolocation, IP to Privacy Detection, IP to Company, IPinfo’s free IP database and more as long as they are in the MMDB database format.

To get started, add the GeoIP function to the log_enrichment_with_ipinfo pipeline after the parser function.

Now set the GeoIP function like so:

Step	Field name	Description	Input
1	Description	Description of the GeoIP function	`MMDB lookup function. Lookup IP addresses country-level information from IPinfo's IP to Country MMDB database.`
2	GeoIP File (.mmdb)	The name of the MMDB file inside the knowledge section	`country.mmdb`
3	IP Field	The field containing public IP addresses. Information parsed in the previous step.	`clientip`
4	Result Field	Field to be added to the pipeline output function	`ipinfo_country`
5	Save		Click the button on the bottom right

This is the core part of the entire tutorial: setting up the IP enrichment function.

The clientip field parsed from the log data in the previous step contains the IP address of the visitor who is visiting the website hosted through the Apache log server.

Then, we take those visitor or client IP addresses and look them up against the MMDB database we uploaded to the “Pipeline > Knowledge” section. The function looks up the IP address and uses the IPinfo IP to Country MMDB database as a lookup table. However, MMDB databases are binary databases and are efficient in returning IP metadata information almost instantaneously. So, you should not be concerned about performance or query times.

After the pipeline enriches the log data with the IP metadata information, in this case, the country-level information provided by IPinfo, it will output enriched log data that will be carried to a destination target or other pipelines.

If you would like to use the MMDB database outside of Cribl or want to use IPinfo’s data in general, we have a plethora of resources available at your disposal. The MMDB database can be used in web servers directly, in backend operations, or in any programming environment using an MMDB reader library. We offer MMDB databases for all our database products, including our free IP databases.

After you are done setting up the pipeline, “Commit & Deploy”.

Here is a brief overview of what we have achieved so far:

Setting the pipeline to pre-process source data

After you have finished setting up the pipeline, we will go back to the QuickConnect (“Routing > QuickConnect”) page.

Pre-processing with log enrichment pipeline

In the “Datagen” source section, we will open up the configure section to utilize the IP metadata log enrichment pipeline. This pipeline will enrich incoming log traffic with IPinfo IP to Country metadata. Here, we are using preprocessing as an easy step to do IP log enrichment. However, you can also do post-processing, add conditions, and add more pipelines. It is recommended to use “Data Routes” instead of “QuickConnect” for advanced operations.

IP log enrichment is a common practice as it is beneficial in preprocessing log data with IP metadata. Since API calls are not made and the lookup operations are done against a static file, there is no marginal cost for individual IP lookups from IPinfo.

From the configuration section, open up the “Pre-Processing” settings and add the pipeline we just created. After selecting the pipeline, “log_enrichment_with_ipinfo” save the configuration.

Now, “Commit & Deploy” and wait a minute for the changes to propagate.

Seeing the result

Now, you are done. You will be able to see the enriched log data powered by IPinfo in your destination target.

Open up the configure settings of the destination target named “default”.

Head down to the “Live Data” section and you will see enriched log data flowing in. This data is coming from the destination Datagen and is being enriched using IPinfo’s data.

You can see a number of fields:

clientip: Contains the client IP address parsed using the parser function.
ipinfo_country: The parent level field outputted from the GeoIP function.
Individual location fields: Showcased in step 3 and step 5, you can see the key-value pairs of individual location information of the IP address contained in each event payload.

The enrichment is happening at the event/log streaming level.

Bonus section: Multiple IPinfo database enrichment

Why stop at one IPinfo IP database? Why not add more? We will go through this section quickly as the process is more or less the same.

This time, we will use the IP to Company database for a change. First, we will add the MMDB format IP to Company database to the Lookup library, which is available in Pipeline → Knowledge → Lookups.

Then, we will add an extra step in the log_enrichment_with_ipinfo pipeline. We will add a new GeoIP function for the new MMDB database. Note that we are creating a new output field (ipinfo_company) that will contain the IP to Company metadata for each IP address.

Next, “Commit & Deploy” and wait a minute.

Going back to our Destination target’s live data, we can see a bunch of new fields. The parent field ipinfo_company contains all the IP to Company database information for each IP address, along with individual child fields that contain company name, company country, company domain, ASN, and more IP-related information.

Caveat: Keeping IP databases updated

IP addresses change location, ASNs, and privacy flags all the time. That is why IPinfo’s IP database is updated regularly. It is necessary to have the latest data available at your disposal. For that reason, you need to update your IP database regularly.

However, in this tutorial, we only cover the upload mechanism to Cribl’s lookup library for the lookup operation. We did not discuss the scheduled operation that is required to keep the data fresh and up to date. You need to regularly upload IPinfo’s database into the lookup library to keep your pipeline operation generating fresh data.

IPinfo’s database can be easily downloaded through its storage URI. The storage system simply accepts the IPinfo authentication token and the database filename to allow users to download their IP database.

So, to download the IPinfo’s Free IP to Country ASN database, the URI/API command is as follows:

curl -L <https://ipinfo.io/data/free/country_asn.mmdb?token=><YOUR_TOKEN> -o country_asn.mmdb

You can also use wget or any tool you prefer. You can also stream the output to a file.

Even though the pipeline and function remain the same and require a one-time configuration step, you need to keep the underlying data up to date.

Looking ahead

Using IPinfo’s data in Cribl is as simple as that. It is just a one-time setup, and users can inject IP intelligence information into logs through the central hub of Cribl. To start, you can use our free IP databases today and bring the country and ASN data to your log data inside Cribl.

Considering the limitations of keeping IP data fresh in Cribl and adding more IP-first functionality across Cribl, we hope to hear feedback from the Cribl and IPinfo community at large. Cribl offers rich functionality, and IPinfo offers the best-in-class IP data out there. So, please let us know your thoughts and feedback on how we can create a robust collaboration between the two platforms.

Topic	Replies	Views
About the Integrations & Platforms category Integrations & Platforms	203	March 24, 2023
CloudQuery Integrates IPinfo for Enhanced IP Address Insights Announcement integration	232	January 25, 2024
About the Community & Events category Community & Events	404	March 25, 2023
About the Internet Data & Research category Internet Data & Research	469	March 25, 2023
Using our IPinfo’s data on Snowflake through direct upload/ingestion Integrations & Platforms snowflake	526	November 9, 2023