When working with very large IP datasets, the choice of storage format matters. In this article, we compare LMDB (Lightning Memory-Mapped Database) and MMDB, with a focus on a real-world scenario: hundreds of millions of non-contiguous IP addresses accessed from Python.
While both systems are fast and memory-efficient, they are optimized for very different data shapes. Those differences become critical at scale.
What is LMDB?
LMDB is a general-purpose key–value database built on top of a B+ tree and backed by memory-mapped files.
Key characteristics:
- Arbitrary byte keys and values (
IP address → metadata blob) - Memory-mapped I/O via
mmap - Lock-free reads
- Copy-on-write writes
- Excellent random read performance
Each IP is stored as an independent key.
What is MMDB?
MMDB is a specialized, read-only database format designed specifically for IP address lookups.
Key characteristics:
- Binary radix trie structure (
IP address → trie traversal → record) - Optimized for CIDR ranges
- Highly compact representation
- Immutable once built
MMDB is extremely efficient when IPs can be represented as large contiguous CIDR ranges, for example:
1.0.0.0/8 → Record A
2.0.0.0/7 → Record B
The Challenge with Non-Contiguous IPs
When IPs are sparse and non-contiguous, MMDB must:
- Split data into many small CIDR blocks
- Create deeper and more fragmented trie paths
- Duplicate references across multiple nodes
- Perform more pointer chasing during lookups
The result:
- Larger database files
- Deeper lookup paths
- Poorer CPU cache locality
- More branch mispredictions
MMDB’s compression advantage largely disappears, and lookup cost increases.
LMDB with Non-Contiguous IPs
LMDB treats every IP as a first-class key. For IPv4, the mapping is uint32(IP) → value, and for IPv6 it’s uint128(IP) → value.
Why LMDB Performs Better for Sparse IPs
Fewer memory jumps. MMDB relies on pointer-heavy traversal. LMDB uses page-aligned access that benefits from OS-level prefetching.
Better CPU cache behavior. B+ tree nodes align well with cache lines and avoid unpredictable branching.
No range fragmentation penalty. MMDB performance degrades as CIDR compression breaks down. LMDB performance does not change.
Shorter lookup path. MMDB requires up to 32 steps for IPv4 and 128 for IPv6 in the worst case.
MMDB can still be smaller on disk when data compresses well into ranges. But with 111 million non-contiguous IPs, that advantage shrinks or disappears, while lookup cost becomes the dominant factor.
Which File Format to Choose: Parquet, CSV (Gzipped), or JSON (Gzipped)?
Parquet is the clear winner for this use case.
Faster reads. Parquet is columnar and binary. Python libraries like pyarrow can read it much faster than gzipped CSV.
Lower CPU cost. Gzipped CSV requires decompression and text parsing. Parquet avoids expensive string parsing.
Better memory control. You can read Parquet in batches efficiently, which matters at 111M rows.
Schema-aware. Types are preserved. CSV forces you to re-infer and cast everything.
Smaller size. In our tests, Parquet came in at 916 MB vs 987 MB for gzipped CSV—and the real win is read efficiency, not just disk size.
Understanding LMDB Map Size and Preallocation
When working with LMDB, one of the first things you notice is that the database files on disk can be surprisingly large—sometimes much bigger than the actual data you stored. For example, even if your dataset only contains a few gigabytes of IP addresses and metadata, the LMDB file might show 10–12GB on disk. Why is that?
This happens because of LMDB’s preallocation strategy:
Fixed maximum size. When you create an LMDB environment, you must specify a map_size—the maximum size the database can grow to. LMDB reserves this space in the file upfront.
Memory-mapped storage. LMDB maps this file into virtual memory. The OS handles page caching, so even though the file is “big” on disk, LMDB does not load all of it into RAM. Only the pages you actually access are used in memory.
No dynamic growth. LMDB cannot automatically increase the map size. If your dataset grows beyond the current limit, you’ll encounter a MapFullError. You can, however, manually increase the map size with env.set_mapsize(new_size) before continuing to write.
Why This Design Matters
Speed. Knowing the maximum size upfront allows LMDB to perform ACID-compliant writes extremely efficiently.
Simplicity. LMDB avoids complex resizing and locking logic, making it very robust.
Predictable memory usage. Even with large files, LMDB’s actual RAM usage stays low because only actively used pages are loaded.
Best Practices
- Estimate your dataset size and add 130–150% overhead to
map_size. This ensures you don’t hitMapFullErrorduring ingestion. - Compact after ingestion using
mdb_copy -cto trim unused space. - Monitor growth: if you add more data later, you can safely increase
map_sizemanually.
Why Not Shard into Two LMDB Files: IPv4 and IPv6?
We don’t know the split between IPv4 and IPv6 ahead of time, so sharding into two LMDBs is hard because:
Dynamic MAP_SIZE depends on row count and average row size. If you preallocate one LMDB for IPv4 and one for IPv6 without knowing the proportions, one of them could hit MapFullError while the other has plenty of unused space.
Row size can vary between IPv4 and IPv6. IPv6 addresses are longer, so each row may use more bytes in LMDB. That makes MAP_SIZE calculation trickier.
We’re streaming the data. We’d have to either count IPv4/IPv6 first (a full scan) to estimate sizes, or over-provision both LMDBs heavily, wasting disk space.
Writing to LMDB: Efficient Bulk Ingestion from Parquet
Full Code:
import lmdb
import pyarrow.parquet as pq
import os
import psutil
# -------------------------------
# Configuration
# -------------------------------
PARQUET_FILE = "resproxy.parquet"
LMDB_FILE = "resproxy.lmdb"
PARQUET_BATCH_SIZE = 100_000 # PyArrow mini-batch size
MAP_SIZE_OVERHEAD = 2.5 # 150% overhead for LMDB
# -------------------------------
# Estimate available RAM and dynamic LMDB batch size
# -------------------------------
available_ram = psutil.virtual_memory().available
LMDB_BATCH_SIZE = int(available_ram * 0.4 / 500)
LMDB_BATCH_SIZE = max(50_000, min(LMDB_BATCH_SIZE, 1_000_000))
print(f"Dynamic LMDB batch size: {LMDB_BATCH_SIZE:,} rows")
# -------------------------------
# Estimate MAP_SIZE from Parquet file
# -------------------------------
pq_file = pq.ParquetFile(PARQUET_FILE)
total_rows = sum(pq_file.metadata.row_group(i).num_rows for i in range(pq_file.num_row_groups))
sample_table = pq_file.read_row_group(0)
num_sample_rows = sample_table.num_rows
sample_size_bytes = sample_table.nbytes / num_sample_rows
estimated_map_size = int(total_rows * sample_size_bytes * MAP_SIZE_OVERHEAD)
print(f"Parquet total rows: {total_rows:,}")
print(f"Estimated LMDB map size: {estimated_map_size / (1024**3):.2f} GB")
# -------------------------------
# LMDB Environment
# -------------------------------
os.makedirs(os.path.dirname(LMDB_FILE) or ".", exist_ok=True)
env = lmdb.open(
LMDB_FILE,
map_size=estimated_map_size,
subdir=False,
writemap=True,
map_async=True,
sync=False,
readahead=False
)
# -------------------------------
# Helper: Encode value as bytes
# -------------------------------
def encode_value(service, last_seen, percent_days_seen):
return f"{service}~{last_seen}~{percent_days_seen}".encode("utf-8")
# -------------------------------
# Stream Parquet using pq_file.iter_batches()
# -------------------------------
keys_batch = []
values_batch = []
lmdb_batches_written = 0
rows_written = 0
for batch_index, batch in enumerate(pq_file.iter_batches(batch_size=PARQUET_BATCH_SIZE), start=1):
# Convert to numpy arrays safely
ip_arr = batch.column("ip").to_numpy(zero_copy_only=False)
svc_arr = batch.column("service").to_numpy(zero_copy_only=False)
last_arr = batch.column("last_seen").to_numpy(zero_copy_only=False)
pct_arr = batch.column("percent_days_seen").to_numpy(zero_copy_only=False)
for ip, svc, last, pct in zip(ip_arr, svc_arr, last_arr, pct_arr):
keys_batch.append(ip.encode("utf-8"))
values_batch.append(encode_value(svc, last, pct))
rows_written += 1
if len(keys_batch) >= LMDB_BATCH_SIZE:
with env.begin(write=True) as txn:
for k, v in zip(keys_batch, values_batch):
txn.put(k, v)
keys_batch.clear()
values_batch.clear()
lmdb_batches_written += 1
print(
f"Inserted ~{LMDB_BATCH_SIZE * lmdb_batches_written:,} rows "
f"(Parquet batch {batch_index})"
)
# Flush remaining rows
if keys_batch:
with env.begin(write=True) as txn:
for k, v in zip(keys_batch, values_batch):
txn.put(k, v)
lmdb_batches_written += 1
print(f"Inserted remaining rows. Total LMDB batches: {lmdb_batches_written}")
env.sync()
env.close()
print(f"LMDB database written to {LMDB_FILE}, total rows: {rows_written:,}")
Ingesting large datasets into LMDB efficiently requires careful attention to batching, memory management, and map sizing. For our 100+ million IP dataset, we used PyArrow Parquet as the source format and LMDB as the target key-value store.
1. Estimate LMDB Map Size Dynamically
LMDB pre-allocates space based on the map_size parameter. Setting it too small causes MapFullError; setting it too large doesn’t waste RAM, but unnecessarily large maps can consume disk space.
We estimate the map size from the Parquet file:
sample_table = pq_file.read_row_group(0)
sample_size_bytes = sample_table.nbytes / sample_table.num_rows
estimated_map_size = total_rows * sample_size_bytes * MAP_SIZE_OVERHEAD
MAP_SIZE_OVERHEAD accounts for 150–250% extra space for LMDB internal structures and future growth.
This ensures we never run out of space during ingestion, even with hundreds of millions of rows.
2. Dynamic Batching Based on Available RAM
To maximize ingestion speed while avoiding memory overcommitment, we compute a dynamic batch size:
available_ram = psutil.virtual_memory().available
LMDB_BATCH_SIZE = int(available_ram * 0.4 / 500)
LMDB_BATCH_SIZE = max(50_000, min(LMDB_BATCH_SIZE, 1_000_000))
This ensures that each transaction writes a large number of rows, minimizing transaction overhead. Using large batches reduces the number of writes and speeds up ingestion significantly.
3. Streaming Parquet Batches
Instead of loading the entire Parquet file into memory, we stream it in mini-batches:
for batch in pq_file.iter_batches(batch_size=PARQUET_BATCH_SIZE):
ip_arr = batch.column("ip").to_numpy(zero_copy_only=False)
svc_arr = batch.column("service").to_numpy(zero_copy_only=False)
last_arr = batch.column("last_seen").to_numpy(zero_copy_only=False)
pct_arr = batch.column("percent_days_seen").to_numpy(zero_copy_only=False)
Zero-copy conversion ensures that only the necessary data is loaded into RAM. Streaming enables processing millions of rows on a machine with just a few GB of memory.
4. Write Keys and Values in Batches
LMDB expects bytes for keys and values. We encode IPs and associated metadata once per batch:
keys_batch.append(ip.encode("utf-8"))
values_batch.append(encode_value(svc, last, pct))
Once a batch reaches LMDB_BATCH_SIZE, we commit it:
with env.begin(write=True) as txn:
for k, v in zip(keys_batch, values_batch):
txn.put(k, v)
keys_batch.clear()
values_batch.clear()
Batching drastically reduces transaction overhead. Flushing remaining rows at the end ensures all data is written.
5. Optional Performance Tweaks
writemap=Trueandmap_async=True→ use memory-mapped writes and asynchronous flush to disksync=False→ prevents LMDB from fsync-ing after every transaction, improving speed (safe when ingesting large static datasets)readahead=False→ turns off OS read-ahead since we write sequentially and already stream in batches
6. Results
With this approach:
- We ingested over 114 million IPs in roughly 4–6 minutes on a mid-range machine (~6.5 GB RAM, LMDB filesize ~11 GB)
- The LMDB file is optimized for fast reads, with keys sorted and stored as bytes
- Memory usage remains under control thanks to dynamic batching and streamed Parquet reads
Troubleshooting: MapFullError
If you encounter this error:
MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached
Increase the MAP_SIZE_OVERHEAD value in your configuration. You can start with a higher map size and later compact it as well from the terminal:
mdb_copy -n -c resproxy.lmdb resproxy_compact.lmdb
mv resproxy_compact.lmdb resproxy.lmdb
Reading from LMDB: High-Performance IP Lookups
Full Code:
import lmdb
import time
# -------------------------------
# Configuration
# -------------------------------
LMDB_FILE = "resproxy.lmdb"
IP_LIST_FILE = "ips.csv"
# -------------------------------
# Load IPs and pre-encode
# -------------------------------
with open(IP_LIST_FILE, "r") as f:
ip_list = set([line.strip() for line in f])
# Convert all IPs to bytes once
keys_bytes = [ip.encode("utf-8") for ip in ip_list]
# -------------------------------
# Open LMDB in read-only mode
# -------------------------------
env = lmdb.open(
LMDB_FILE,
subdir=False,
readonly=True,
lock=False, # allows multiple readers, no lock overhead
readahead=True
)
# -------------------------------
# Single read-only transaction for all lookups
# -------------------------------
start_time = time.time()
results = []
with env.begin() as txn:
for key in keys_bytes:
value_bytes = txn.get(key)
if value_bytes:
service, last_seen, pct_days_seen = value_bytes.decode("utf-8").split("~")
results.append((key.decode("utf-8"), service, last_seen, pct_days_seen))
else:
results.append((key.decode("utf-8"), None, None, None))
elapsed = time.time() - start_time
print(f"Read {len(keys_bytes):,} IPs in {elapsed:.3f} sec "
f"({elapsed/len(keys_bytes)*1000:.3f} ms per lookup)")
# -------------------------------
# Example: print first 10 results
# -------------------------------
for r in results[:10]:
print(r)
env.close()
Output:
Read 447,649 IPs in 1.566 sec (0.003 ms per lookup)
('2.39.142.208', 'Floppydata', '2025-11-24', '2')
('72.255.18.145', 'Databay', '2025-11-23', '9')
('157.32.139.118', '711Proxy', '2025-11-24', '4')
('136.169.151.73', 'Proxy-Seller', '2025-11-23', '16')
('14.169.38.102', 'Lightning Proxies', '2025-10-27', '16')
('240e:36f:465:e0b1:1ce4:77e3:c622:8add', None, None, None)
('80.83.237.68', '922 S5 Proxy', '2025-11-12', '10')
('73.56.203.70', None, None, None)
('220.173.28.64', None, None, None)
('91.204.150.69', '711Proxy', '2025-11-23', '17')
Once your IP data is ingested into LMDB, reading it efficiently is critical for building fast APIs or performing bulk lookups. LMDB is designed for extremely fast read operations, but there are a few key practices that maximize performance.
1. Use Read-Only Transactions
LMDB supports multiple concurrent readers without any locking. To achieve peak speed, open a single read-only transaction and reuse it for all your lookups:
with env.begin() as txn:
value_bytes = txn.get(ip_lookup.encode("utf-8"))
Creating a new transaction for each lookup is very expensive and can drastically slow down performance.
2. Pre-Encode Keys
LMDB stores keys and values as bytes, not Python strings. Convert your IP addresses to bytes once before the lookup loop:
keys_bytes = [ip.encode("utf-8") for ip in ip_list]
This avoids repeated encoding inside a tight loop, which saves a lot of CPU cycles.
3. Avoid Printing Inside the Loop
Printing during lookups can easily dominate the runtime. Only print results after the lookups are complete, or sample a few entries for verification.
4. Enable Readahead and Disable Locks
When opening LMDB for read-only access:
env = lmdb.open(
LMDB_FILE,
readonly=True,
lock=False, # no locks for read-only access
readahead=True # let OS prefetch pages
)
This ensures LMDB and the operating system efficiently cache data pages in memory, reducing disk I/O.
5. Single vs Multi-Threaded Reads
LMDB allows multiple readers to operate concurrently. For extremely high-volume APIs, you can split lookups across threads. However, for most scenarios, a single read-only transaction is already extremely fast—often under a millisecond per lookup even on millions of entries.
Results
Using these strategies, you can perform hundreds of thousands of lookups per second from a 100+ million row LMDB database, with predictable, low-latency response times perfect for Residential Proxy IP database, Hosted Domains database, or any high-throughput network application.
For large-scale, non-contiguous IP datasets, LMDB offers significant advantages over MMDB: predictable performance regardless of data distribution, efficient bulk ingestion from Parquet, and extremely fast read operations. Combined with proper batching strategies and memory management, you can build high-performance IP lookup systems that scale to hundreds of millions of records.