MMDB Performance and LMDB for QPS for the Resproxy Database

In general, we recommend using the MMDB file format. It is super fast, efficient, and a universally supported binary file format designed specifically for IP databases. However, for the resproxy database, even though there is an MMDB file format, we recommend opting for an alternative solution for it.

What is an MMDB file format?

The MMDB database is a binary, file-based database format used to store IP address data.
An MMDB database uses IP addresses as the index, meaning every lookup starts by converting an IPv4 or IPv6 address into binary form and walking a radix tree based on those bits.

Each bit determines which branch to follow, and this process continues until the traversal reaches a node that represents the IP’s range. Because the tree is arranged by binary prefixes, IP networks naturally fit into it, making lookups extremely fast and efficient.

           [root]
           /    \
      192.0.2   198.51.100
       /           \
  192.0.2.0/24    198.51.100.0/24

MMDB’s challenge with a non-contiguous IP address

The data stored in an MMDB is organized by IP ranges, not individual addresses. In a large continuous block of IP space, the database can store that entire range with a single pointer to a data record. This keeps the structure compact, reduces the number of nodes, and minimizes redundancy.

In an MMDB radix tree, the traversal works bit by bit. Each IP address is converted to its binary form. Starting from the root node:

  1. Examine the next bit of the IP.
  2. Choose a branch based on the bit: typically left for 0 and right for 1.
  3. Move to the next node.
  4. Repeat steps 1–3 until you reach a node that contains the pointer to the data for that IP range.

Each node may represent a prefix of bits, so longer prefixes represent smaller, more specific IP ranges. This structure allows fast lookups because the tree branches reduce the search space exponentially at each bit.

In an MMDB, the tree is optimized for ranges. Each node can represent an entire continuous block of IPs (like a /24 or /16) with one pointer to a single data record. This is highly efficient for storage and lookups.

The problem with storing individual IPs instead of ranges is:

  1. Even if the addresses are consecutive, each IP gets a separate branch in the radix tree.
  2. With millions of individual IPs, the number of nodes skyrockets.
  3. Each node has a pointer, plus metadata overhead. Storing individual IPs rather than ranges creates a massive file.
  4. Lookups become slightly slower: The tree is deeper and has more nodes to traverse, although still logarithmic.

In short, MMDB works best when contiguous IPs are stored as ranges. Storing each IP individually is inefficient and unnecessarily bloats the file.

Structure of the residential proxy database

The IP to Residential Proxy Detection database currently stores individual IP addresses in the ip column rather than aggregating them into ranges. While this structure works perfectly in formats like Parquet, it creates a major inefficiency when converting to an MMDB file.

The MMDB format is designed to be optimized for IP ranges, using a radix tree where each node can represent a continuous block of IPs with a single pointer to a data record. Because the resproxy database uses individual IP entries, every single IP creates its own node in the radix tree, even if consecutive IPs share identical metadata. This leads to:

  • Explosion of nodes in the tree
  • Redundant storage of metadata for many IPs
  • Significantly larger MMDB file size compared to Parquet
Metric Parquet File MMDB File Explanation
Entries 120,064,249 120,064,249 Same number of IPs, individual entries stored
Structure Columnar Radix Tree + Binary Data Parquet is optimized for tabular storage; MMDB is optimized for ranges, not individual IPs
File Size 1.18 GB 24.22 GB MMDB stores each IP as a separate node with its own pointer and metadata, inflating the size

How to query the Resproxy database

Depending on your environment and workload, there are multiple approaches to querying this dataset, ranging from in-memory structures like hashmaps and dictionaries to lightweight analytical databases like DuckDB, and high-performance, disk-backed solutions such as LMDB.

Each method offers different trade-offs in terms of speed, memory usage, and ease of deployment, which we will explore in detail.

Option 1: HashMap in Go

A hashmap in Go is one of the fastest ways to query the ResProxy database if your goal is real-time IP → metadata lookups. In this approach, each IP address is stored as a key in a Go map, with the associated metadata (such as service, last_seen, and percent_days_seen) stored as the value. Lookups are performed in O(1) average time, which makes this approach ideal for high-throughput API services where latency must be minimized.

Pros:

  • Ultra-fast lookups: Go’s native map implementation provides near-in-memory speed, even with millions of entries.
  • Strong typing: Metadata can be stored in structs, ensuring consistent and safe access.
  • Simplicity: Easy to implement, no external dependencies required.

Cons:

  • Memory-intensive: All 120M+ entries must fit in RAM, including metadata. This can require tens of gigabytes depending on the metadata structure.
  • Persistence: Maps exist only in memory, so the dataset must be loaded from a file (e.g., Parquet or CSV) at service startup.
  • Scaling: For distributed systems, additional mechanisms (like sharding or caching) are needed to share the dataset across multiple nodes.

Option 2: Dictionary in Python

A Python dictionary (dict) provides a simple and effective way to query the ResProxy database. Each IP address is stored as a key, with its associated metadata (service, last_seen, percent_days_seen) as the value. Python dictionaries are hashmaps under the hood, offering average O(1) lookup time, making them ideal for high-speed, read-heavy queries on datasets that fit into memory.

Pros:

  • Simple to implement: Python’s dict is native and requires no additional libraries for basic functionality.
  • Fast lookups: Even for millions of IPs, dictionary access is very quick.
  • Flexible: Metadata can be stored as dictionaries or custom objects, supporting rich data structures.

Cons:

  • Memory usage: Storing 120M+ IPs with metadata can require tens of gigabytes of RAM.
  • Persistence: Dictionaries exist only in memory. You need to load the data from a file (CSV, Parquet, etc.) at startup.
import pyarrow.parquet as pq

# Load the Parquet file
table = pq.read_table('resproxy.parquet')

# Convert to a dictionary: key = IP, value = metadata dict
ip_map = {
    row['ip'].as_py(): {
        'service': row['service'].as_py(),
        'last_seen': row['last_seen'].as_py(),
        'percent_days_seen': row['percent_days_seen'].as_py()
    }
    for row in table.to_pylist()
}

print(f"Loaded {len(ip_map)} IPs into memory.")

# Example lookup
ip_to_lookup = '38.222.31.85'
metadata = ip_map.get(ip_to_lookup)
if metadata:
    print(f"IP: {ip_to_lookup}\nService: {metadata['service']}\n"
          f"Last Seen: {metadata['last_seen']}\nPercent Days Seen: {metadata['percent_days_seen']}%")
else:
    print(f"IP {ip_to_lookup} not found in the database.")

Option 3: DuckDB

DuckDB is an in-process, columnar analytical database optimized for OLAP-style queries and large datasets. Unlike in-memory hashmaps, DuckDB allows you to query the ResProxy database directly from Parquet files without fully loading all data into memory. Each IP is stored as a row, and metadata such as service, last_seen, and percent_days_seen can be queried efficiently using SQL-like syntax.

Pros:

  • Handles large datasets: You don’t need to load all 120M IPs into memory; DuckDB reads only the necessary columns and rows.
  • SQL interface: Easy to filter, aggregate, and join data using familiar SQL syntax.
  • Columnar storage: Only columns used in a query are read, which improves speed and reduces memory usage.
  • No server required: Embedded, single-file database — great for scripts or local analysis.

Cons:

  • Single-key lookups slower than in-memory structures: While DuckDB is fast, an individual IP lookup will be slower than a hashmap or LMDB.
  • Best for batch queries or analytics: Less ideal for microsecond-level real-time lookups.
import duckdb

# ===== CONFIGURATION =====
PARQUET_FILE = 'resproxy.parquet'
DB_FILE = 'resproxy.duckdb'

# ===== CONNECT TO DUCKDB =====
con = duckdb.connect(DB_FILE)

# ===== CREATE TABLE =====
# Use TEXT for IPs but add an index for fast lookups
con.execute("""
CREATE TABLE IF NOT EXISTS resproxy (
    ip TEXT,
    service TEXT,
    last_seen DATE,
    percent_days_seen INTEGER
)
""")

# ===== INGEST PARQUET FILE =====
con.execute(f"""
INSERT INTO resproxy
SELECT ip, service, last_seen, percent_days_seen
FROM read_parquet('{PARQUET_FILE}')
""")

# ===== CREATE INDEX ON IP =====
con.execute("""
CREATE INDEX IF NOT EXISTS idx_ip ON resproxy(ip)
""")

print("Data ingested and index created successfully.")

# ===== EXAMPLE LOOKUP =====
def lookup_ip(ip_address):
    result = con.execute(f"""
    SELECT *
    FROM resproxy
    WHERE ip = '{ip_address}'
    """).fetchall()
    return result

# Example usage
ip_to_lookup = '38.222.31.85'
metadata = lookup_ip(ip_to_lookup)
if metadata:
    print(metadata)
else:
    print(f"IP {ip_to_lookup} not found.")

# ===== CLOSE CONNECTION =====
con.close()

Option 4: LMDB (Recommended)

Considering the nature of the resproxy detection database with non-contiguous standalone/individual IP addresses (unlike contiguous IP addresses in the form of CIDRs), LMDB as a KV database performs solidly. It strikes a balance between file size, RAM consumption, and lookup performance (QPS).

However, please note that MMDB, as a file format, is the industry standard when it comes to IP databases. For that reason, we have not validated the file format as a production file format.

LMDB (Lightning Memory-Mapped Database) is a high-performance, disk-backed key-value store optimized for read-heavy workloads. In the ResProxy database scenario, each IP address can be stored as the key, with its metadata (service, last_seen, percent_days_seen) as the value.

LMDB’s memory-mapped B+ tree structure enables near in-memory read speeds, even with 120M+ IPs, making it the best solution for large-scale, read-only IP lookups.

The author of the LMDB project said (https://youtu.be/tEa5sAh-kVk?si=SqIWkFHsJHVzge2w&t=238):

This is not an exaggeration. It is the world’s fastest for all read operations.

Pros:

  • Extremely fast read performance: Memory-mapped structure allows O(log N) access with minimal overhead.
  • Memory-efficient: Only accessed portions are loaded into memory; OS handles caching.
  • Read-only optimization: Supports multiple concurrent readers with zero write contention.
  • Persistent storage: No need to load the entire dataset into RAM.
  • Scalable: Works well with datasets larger than available RAM.

Cons:

  • Single-writer limitation: Only one writer allowed at a time (irrelevant for read-only use).
  • Initial import needed: Must preprocess Parquet into LMDB once.
  • Slightly more complex setup: Requires LMDB library and serialization of metadata.

This Python script efficiently converts a large Parquet dataset into an LMDB database optimized for high-speed point lookups. It reads the Parquet file row group by row group to avoid memory spikes, converts columns to NumPy arrays for fast iteration, and stores each IP as the LMDB key with the remaining fields concatenated into a UTF-8 string as the value. Batched LMDB transactions ensure fast, memory-efficient writes, making this approach suitable for very large datasets while keeping lookup operations extremely fast.

Key benefits:

  • Memory-efficient: Processes one row group and batch at a time.
  • Fast ingestion: Minimal Python overhead with string concatenation instead of dictionaries.
  • Optimized for lookups: IP addresses as keys allow high-QPS point queries.
  • Safe for large datasets: LMDB map size and batching prevent memory exhaustion.
import lmdb
import pyarrow.parquet as pq
import numpy as np

# -------------------------------
# Configuration
# -------------------------------
LMDB_FILE = "resproxy.lmdb"
PARQUET_FILE = "resproxy.parquet"
MAP_SIZE = 16 * 1024**3       # 16 GB map size
BATCH_SIZE = 50_000            # batch size for LMDB writes

# -------------------------------
# LMDB environment
# -------------------------------
env = lmdb.open(
    LMDB_FILE,
    map_size=MAP_SIZE,
    subdir=False,
    writemap=True,
    map_async=True,
    sync=False
)

# -------------------------------
# Generator for batch insertion
# -------------------------------
def kv_batches(table, batch_size=BATCH_SIZE):
    # Convert PyArrow columns to NumPy arrays (memory-efficient)
    ip_arr = np.array(table.column("ip").to_pylist(), dtype=object)
    svc_arr = np.array(table.column("service").to_pylist(), dtype=object)
    last_arr = np.array(table.column("last_seen").to_pylist(), dtype=object)
    pct_arr = np.array(table.column("percent_days_seen").to_pylist(), dtype=object)
    
    total = len(ip_arr)
    for start in range(0, total, batch_size):
        end = min(start + batch_size, total)
        keys = [ip.encode("utf-8") for ip in ip_arr[start:end]]
        # Store values as concatenated string for speed and memory efficiency
        values = [f"{svc}~{last}~{pct}".encode("utf-8") 
                  for svc, last, pct in zip(svc_arr[start:end], last_arr[start:end], pct_arr[start:end])]
        yield keys, values

# -------------------------------
# Read Parquet row groups and insert
# -------------------------------
pq_file = pq.ParquetFile(PARQUET_FILE)
print(f"Parquet file has {pq_file.num_row_groups} row groups")

for rg_index in range(pq_file.num_row_groups):
    table = pq_file.read_row_group(rg_index)
    print(f"Inserting row group {rg_index + 1}/{pq_file.num_row_groups}")
    
    for keys, values in kv_batches(table):
        txn = env.begin(write=True)
        for k, v in zip(keys, values):
            txn.put(k, v)
        txn.commit()

env.sync()
env.close()
print(f"LMDB database written to {LMDB_FILE}")

Once the LMDB database is built, querying by IP is extremely fast because each IP is stored as the key. Values are stored as concatenated UTF-8 strings (service~last_seen~percent_days_seen) to minimize memory usage and maximize read speed. During reads, the value can be split to reconstruct the original data. This approach allows high-QPS point lookups without loading the entire database into memory.

Read workflow:

  • Open the LMDB environment in read-only mode to allow multiple concurrent readers.
  • Begin a transaction for safe, consistent reads.
  • Lookup by IP key directly, which is extremely fast due to LMDB’s memory-mapped structure.
  • Decode the UTF-8 string and split fields to reconstruct a dictionary if needed.
import lmdb

env = lmdb.open("resproxy.lmdb", readonly=True, subdir=False)
with env.begin() as txn:
    value = txn.get(b"1.0.90.227")  # lookup by IP
    if value:
        service, last_seen, percent_days_seen = value.decode("utf-8").split("~")
        record = {
            "service": service,
            "last_seen": last_seen,
            "percent_days_seen": percent_days_seen
        }
        print(record)
env.close()

Filesize results:

  • NDJSON (Uncompressed): 13 GB
  • Parquet: 1 GB
  • MMDB: 22 GB
  • LMDB: 16 GB (Compressed to .gz for file transfer ~2 GB)

How LMDB stacks up to MMDB

MMDB is optimized for CIDR-based, neighboring IP ranges, so when storing millions of isolated IPs, its trie structure becomes sparse and leads to large files and high memory overhead. LMDB is a compact key–value B-tree that stores each IP as an independent key, making it far more efficient for non-adjacent records.

MMDB lookups may require more RAM because parts of the tree must be memory-mapped, while LMDB is designed for low-RAM, high-QPS point lookups with zero-copy reads. For datasets of individual IPs, LMDB provides smaller memory footprint, faster reads, and more stable performance.

Third-parthy Benchmark

Metric SQLite MMDB BBolt LMDB
Raw DB Size 4.6 GB 17.7GB 12.24GB 17 GB
Compressed DB Size (.zst) 1.12GB 4 GB 1.27GB 1.06GB
Latency (P99, µs) 1,051 608 8 3
Latency (Avg, µs) 142.1 36.2 6.3 2.8

(Benchmark Criteria: SampleKeys = 60,000, TotalOps = 500,000, Concurrency = 16)

Future improvements to the LMDB solution

The current LMDB is designed as a proof of concept and is currently pending official benchmarking and optimization from our data and engineering team. This was designed as a foundational script that requires incremental optimization.

If you have any ideas or feedback, please don’t hesitate to share them with us.

Big thanks to David Belson for the correction.