In the past we have experimented with delivering updates as deltas, and we continue to think about real-time or streaming data flow. However, the benefit-to-cost ratio is not always there. Delta and real-time updates introduce challenges that often require end users to build highly customized systems, adding complexity with limited upside in many use cases.
Delta Changes
The first consideration with delta changes is the binary data format. All our downloadable databases are delivered in binary form. There are two categories of files users can download today:
- Binary by design: MMDB and Parquet
- Binary by compression: Compressed binary files of CSV and JSON format
Providing a delta or diff file for MMDB and Parquet is not practical. MMDB uses a binary search tree where node offsets are computed at build time, and Parquet organizes data into row groups with column-wise compression and per-group statistics. Small logical changes can produce large differences in the binary output. We could ship a logical delta as a list of changed ranges, but the customer would then need to run a build step to apply it, which shifts the complexity to their side.
CSV and JSON are plaintext formats. Even though we can provide a delta file, applying it is not easy. The user has to:
- Decompress the original CSV/JSON file
- Decompress the delta/diff file
- Run a merge process to update the original file based on the delta
Decompression is cheap. The merge is where the real work sits. Compared to just using the original file, a delta-based process introduces additional steps that make the update time unpredictable for both us and the developer. The current full-update mechanism is predictable, since you know the dataset size upfront.
Delta size can also be significant. An IP database uses IP ranges as the index column, so changes within a single range often cascade to many others. For example, a /23 geolocated in one country today may be split into two /24s in different countries tomorrow. That single logical change becomes one removal and two additions. During larger reclassification events, the delta size can approach the size of a full download, which reduces the bandwidth benefit users typically expect from a delta-based approach.
Our IP databases update frequently, so a delta-based update mechanism does not scale well for the data download product as it stands today.
For users with bandwidth limitations, we recommend our API service. IPinfo Lite provides unlimited API queries, and for enterprise customers we can arrange millions or even trillions of requests per day/month across any of our API products.
Real-Time / Streaming Updates
Real-time delivery is something we are open to exploring for specific data types where it makes sense. Carrier and mobile IP data are a good candidate, since those changes can be timely and high-signal.
For the full database, real-time is harder. Our pipeline ingests data from many sources, including our own internet measurement efforts, public datasets, and other third-party datasets. Each source has its own update cadence. We aggregate these sources, run a ranking operation on the “location hints,” and select the locations that demonstrate the highest potential to be accurate and reliable. Last time I checked, we process more than 70 different data sources for our IP location data alone.
Because our current model is aggregate, evaluate, and deliver, full real-time updates are not something we can offer today.
Moving more frequent updates into our delivery pipeline is on our radar. The shape it takes will likely depend on the product (API vs. database download) and the data type.