Freshness of data: API vs Data Downloads, which one is more "latest"?

Summary: Our data downloads.

We have to maintain two services:

Both systems are quite complicated in their own rights, and there is a separation of processes. Our API and web infrastructure handle billions of requests per day and trillions of requests per year. Maintaining a web infrastructure that demands stability and low latency is not an easy task.

In this process, we launch a fresh deployment of our API and web services every 24 hours. This deployment includes caching and all infrastructure-building processes that make it reliable and scalable. Our API services also include all of our production data products, as it is built on top of them.

On the data product side, we have several products. These data products are not produced at a fixed time; rather, they are produced at different times as they pass through our data pipeline. Different data products are produced at various times throughout the day.

For example, our free IP to Country ASN database is a merged dataset of our free IP to ASN dataset and free IP to Country dataset. Our free IP to ASN dataset is a subset of our premium ASN dataset, and the IP to Country dataset is a subset of our IP to Geolocation dataset. So, to produce the IP to Country ASN database, we have to wait until the previous steps and databases are created in the data pipeline.

Now, individual data products are directly pushed to our storage buckets as soon as they are finished. IPinfo users can access the fresh and latest datasets from us directly as soon as they are finished building individually. There is no delay or fixed time for data updates. We do not even push data products in batches. Each individual product is pushed when it is built and is readily accessible to our users.

When you use our data products, you have direct access to the latest data. But as our API services are “built” on a periodical basis, even if a data product is available, we can not readily incorporate that into our API service. Data from data products are bought to the API and cached when the process is built.

This delay or lag is a compromise on the API side that we have to make to ensure reliable and low-latency services. We are continuously working on reducing this lag time while ensuring reliability and speed.

The API lags behind Data Downloads because of the architectural decisions we’ve made to prioritise API speed and reliability over freshness of data. Parity between the data available via the API and Data Downloads is something we’re working towards while maintaining our reliability and speed.

-Sam, Software Engineer