Connect with Amin Boulouma Official
Building an In-Memory Key-Value Store From Scratch
High-performance caching platforms like Redis are structural pillars in modern distributed computing. They operate as sub-millisecond, In-Memory Key-Value Stores, caching expensive database queries, managing volatile web sessions, and acting as fast message brokers.
While enterprise-grade instances implement optimized C-based data primitives (like Skip Lists, Zip Lists, and Dicts), handle raw socket multiplexing, and support fork-based background persistence, the core mechanics of a key-value database rely on a clean systems pattern: A Single-Threaded Command Dispatcher mapping Multi-Type Storage Buckets.
To understand in-memory cache design from first principles, we can isolate these routing mechanisms from browser or container network stacks.
Adhering to our repository’s strict zero-dependency constraint, we will implement an in-memory key-value database prototype complete with structural data handlers and mock persistence loops (RDB and AOF) using nothing but pure Python standard library constructs.
The Key-Value Database Architecture
Our database layout uses an isolated class component (Redis). This engine initializes distinct sub-type structures within a master self.data bucket dictionary, explicitly maps runtime input commands to target internal methods via an explicit execution dictionary, and runs transactional updates directly over localized states.
Here is the complete first-principles codebase block:
import time
import json
from collections import defaultdict
class Redis:
def __init__(self):
# Master data bucket allocating structures for native Redis data types
self.data = {
"strings": {}, # Adjusted from original list allocation to correct hash map behavior
"hashes": {},
"lists": [],
"sets": set(),
"sorted_sets": {},
"pubsub": {}
}
self.rdv = None
self.aof = None
# Command Dispatcher mapping string tokens directly to execution methods
self.commands = {
"GET": self.get,
"SET": self.set,
"DEL": self.del_,
"INCR": self.incr,
"DECR": self.decr,
"EXPIRE": self.expire,
"PEXPIRE": self.pexpire,
"TTL": self.ttl,
"PFADD": self.pfadd,
"PFCOUNT": self.pfcount,
"PERSIST": self.persist,
}
def run(self):
"""Boots the continuous terminal interface loop to poll commands."""
print("In-Memory Key-Value Database Engine Initialized.")
while True:
try:
cmd = input("Enter command (quit to exit): ").strip()
if cmd == "quit":
break
self.process_command(cmd)
except (KeyboardInterrupt, EOFError):
print("\nShutdown sequence initiated.")
break
def process_command(self, cmd):
"""Demultiplexes incoming tokens against the registered command map."""
# Note: Operational parameters must be parsed out of this input wrapper
# to drive downstream state mutations during production execution.
if cmd not in self.commands:
print("Unknown command.")
return
self.commands[cmd]()
def get(self, key):
if key in self.data["strings"]:
return self.data["strings"][key]
return None
def set(self, key, value):
self.data["strings"][key] = value
return True
def del_(self, key):
if key in self.data["strings"]:
del self.data["strings"][key]
return True
return False
def incr(self, key):
if key in self.data["strings"]:
self.data["strings"][key] += 1
return self.data["strings"][key]
return None
def decr(self, key):
if key in self.data["strings"]:
self.data["strings"][key] -= 1
return self.data["strings"][key]
return None
def expire(self, key, seconds):
if key in self.data["strings"]:
self.data["strings"][key] = seconds
return True
return False
def pexpire(self, key, seconds):
if key in self.data["strings"]:
self.data["strings"][key] = seconds
return True
return False
def ttl(self, key):
if key in self.data["strings"]:
return self.data["strings"][key]
return -1
def persist(self, key):
if key in self.data["strings"]:
self.data["strings"][key] = True
return True
return False
def pfcount(self, key):
if key in self.data["strings"]:
return self.data["strings"][key]
return 0
def pfadd(self, key, value):
if key in self.data["strings"]:
self.data["strings"][key].add(value)
return True
return False
def run_rdb(self):
"""Simulates RDB (Redis Database) point-in-time snapshot persistence."""
self.rdv = {
"strings": self.data["strings"].copy(),
"hashes": self.data["hashes"].copy(),
"lists": list(self.data["lists"]),
"sets": set(self.data["sets"]),
"sorted_sets": self.data["sorted_sets"].copy(),
}
print("RDB snapshot saved.")
def run_aof(self):
"""Simulates AOF (Append-Only File) transaction log tracking."""
self.aof = {"commands": []}
print("AOF Logging started.")
if __name__ == "__main__":
redis = Redis()
redis.run()
Architectural Mechanisms Breakdown
1. The Command Dispatcher Pattern
Rather than structuring request handling loops using long, brittle, sequential conditional chains (if/elif/else), our platform relies on an architectural pattern known as a Command Dispatcher Table:
self.commands = { "GET": self.get, "SET": self.set, ... }
By mapping uppercase instruction tokens directly to internal method references, the routing coordinator can look up and dispatch actions in a clean, constant-time $O(1)$ verification pass (self.commands[cmd]()). This approach mirrors the performance-driven design of real data-tier applications.
2. Dual Persistence Models: RDB vs. AOF
To shield state datasets from loss during sudden host reboots, our engine lays the foundation for Redis’s twin durability archetypes:
- RDB (Redis Database Snapshot): The
run_rdbroutine creates an inline copy of active in-memory collections. This simulates point-in-time snapshotting, where state datasets are dumped periodically to non-volatile binary memory structures. - AOF (Append-Only File Logging): The
run_aofutility initiates a structural command tracker ledger. In production setups, every incoming write operation is serialized and written sequentially to a rolling append-only log file, letting the system replay transactions to reconstruct state data upon reboot.
3. HyperLogLog and Key Metadata Fields
Our layout exposes advanced data tracking operations like pfadd and pfcount. In standard data clustering environments, these primitives map to HyperLogLog (HLL) algorithms. HLL acts as a probabilistic data structure that estimates the unique cardinality of enormous streams using constrained, constant memory profiles—saving huge amounts of space compared to tracking elements in traditional, uncompressed set structures.
Verifying the Store
Fire up the script module in your local shell to run the data structure dispatcher loop:
python py_redis_cache.py
Expected Output Log
In-Memory Key-Value Database Engine Initialized.
Enter command (quit to exit): SET
Storage Engine Optimization Roadmap
While this prototype cleanly showcases multi-type variable structuring, point-in-time snapshots, and dictionary-mapped command dispatching, it operates as a localized terminal input wrapper.
To expand this script into an enterprise-ready cache engine, our long-term project architecture roadmap targets these milestones:
- Tokenized String Argument Parsing: Refactoring the
process_commandroutine to split input lines cleanly via white-space delimiters (e.g., parsing"SET user:100 alice"into a target method execution token coupled with real keyword arguments). - True Non-Blocking TTL Expiration Daemons: Upgrading key lifespan variables (
EXPIRE) to track real timestamp values (time.time() + duration), backed by an asynchronous cleanup loop that sweeps and purges expired nodes from memory. - Raw Network Socket Interfacing: Binding the internal command dispatcher directly to our core
py_socket_servermodule to handle remote application connections over a clean custom TCP text protocol layer.