Day 21:Config Management: Implementing YAML-Based System Settings

Mar 19, 2026

∙ Paid

1. The “Global Constants File” Trap

Every junior quant engineer writes one. You’ve seen it. You’ve written it.

# config.py
MAX_POSITION_SIZE = 100
RISK_LIMIT = 0.02
ALPACA_API_KEY = "PK3X7V..."
PAPER_TRADING = True

It works on your laptop. It works in your backtest. It fails in production in ways that don’t produce errors — they produce losses.

The failure isn’t a crash. It’s a silent drift. The risk limit that was 0.02 in testing becomes 0.2 because someone did a find-replace. The API key rotates and now lives in two places. You deploy to a new server, forget to update the constants file, and spend 20 minutes wondering why positions are capped at 10 shares instead of 1000. Worst of all: you’re running paper and live off the same config, separated by a single boolean you forgot to flip.

This is not a style problem. This is a system reliability problem that directly maps to operational risk.

2. The Failure Mode: Three Paths to Account Ruin

2.1 Path 1: Type Coercion Corruption

YAML is not Python. When you write:

risk_per_trade: .02
max_drawdown: 5%
slippage_bps: 3

yaml.safe_load() gives you {'risk_per_trade': 0.02, 'max_drawdown': '5%', 'slippage_bps': 3}. Notice max_drawdown is a string. Your downstream if portfolio_dd > config['max_drawdown'] comparison silently evaluates to False in Python (comparing float to string raises TypeError in Python 3, but an unhandled exception mid-session is just as bad). You blow through your drawdown limit because the guard never triggers.

2.2 Path 2: No Environment Isolation

A single-file config that’s shared between paper and live is a loaded gun. The moment you want to test a new strategy, you copy-paste the live config and edit it. Now you have two files that look the same but diverge over weeks. A merge conflict resolves the wrong way. Your paper-trading position_multiplier: 0.1 gets written into your live config.

2.3 Path 3: No Hot-Reload = Missed Risk Windows

Markets move. Risk parameters need to move with them. If your system requires a restart to update a position limit during a volatility spike, you’re dead. The 30-second restart window during a flash crash is where drawdown happens. Your config system needs to respond to file changes while the event loop is running.

3. The AutoQuant-Alpha Architecture: Layered Config with Validation

We implement a three-layer config system based on the 12-factor app methodology, adapted for trading systems:

Layer 1: base.yaml        —  System-wide defaults. Never changes per deployment.
Layer 2: {ENV}.yaml        —  Environment overrides (paper, live, backtest).
Layer 3: Environment Vars  —  Secrets and runtime overrides. NEVER in YAML files.

The merge order is explicit: each layer overrides the previous. The final config is a single validated dataclass, not a raw dict. Consumers never touch raw YAML.

ConfigLoader → deep_merge(base, env) → ConfigValidator → ConfigRegistry (singleton)
                                                ↕
                                         ConfigWatcher (threading.Thread)
                                         polls mtime every N seconds
                                         → triggers re-load on change

Secrets (API keys) are injected exclusively through environment variables. They appear in the dataclass but never in any YAML file. A misconfigured .gitignore cannot leak your trading credentials.

4. Implementation Deep Dive

4.1 The Config Dataclass: Zero Magic, Full Validation

We use Python 3.11 dataclasses with __post_init__ validation rather than Pydantic. This is intentional: you need to understand what validation does, not just that it runs.

@dataclass
class RiskConfig:
    max_position_usd: float
    risk_per_trade: float
    max_daily_drawdown: float
    max_open_positions: int

    def __post_init__(self):
        if not 0 < self.risk_per_trade < 0.1:
            raise ConfigValidationError(
                f"risk_per_trade={self.risk_per_trade} outside safe bounds (0, 0.1). "
                f"This is a trading system, not a calculator."
            )
        if self.max_daily_drawdown <= 0 or self.max_daily_drawdown >= 1.0:
            raise ConfigValidationError(
                f"max_daily_drawdown must be in (0.0, 1.0), got {self.max_daily_drawdown}"
            )

__post_init__ fires after __init__ populates all fields. You get Python’s type system enforcing the struct layout AND your domain logic enforcing the business rules. If validation fails during a hot-reload, the current valid config stays active. The system logs the error and continues. It does not restart. It does not crash. This is the difference between a warning and an outage.

4.2 The Deep Merge: Why `dict.update()` Kills You

Standard dict.update() does a shallow merge. This means:

base = {'risk': {'max_pos': 100, 'drawdown': 0.05}}
env  = {'risk': {'max_pos': 500}}
base.update(env)
# Result: {'risk': {'max_pos': 500}}  ← drawdown GONE

Your base config’s drawdown guard vanishes silently. We implement a recursive deep_merge:

def deep_merge(base: dict, override: dict) -> dict:
    result = base.copy()
    for key, value in override.items():
        if key in result and isinstance(result[key], dict) and isinstance(value, dict):
            result[key] = deep_merge(result[key], value)
        else:
            result[key] = value
    return result

Every nested key is preserved unless explicitly overridden. This is O(n) on config size — which is measured in bytes, not the concern.

4.3 The ConfigWatcher: Polling vs. inotify

We use polling with os.path.getmtime() rather than inotify/watchdog. Why? inotify is Linux-only, depends on file descriptor limits, and introduces an external dependency. A trading system’s config watcher needs to work identically across development (macOS), staging (Linux), and potentially containerized environments where filesystem event propagation is unreliable.

class ConfigWatcher(threading.Thread):
    def __init__(self, paths: list[Path], callback: Callable, interval_s: float = 2.0):
        super().__init__(daemon=True)   # dies with the main process
        self._paths = paths
        self._callback = callback
        self._interval = interval_s
        self._mtimes: dict[Path, float] = {}
        self._stop_event = threading.Event()

    def run(self):
        # Initialize baseline mtimes
        for p in self._paths:
            self._mtimes[p] = p.stat().st_mtime if p.exists() else 0.0

        while not self._stop_event.wait(self._interval):
            for p in self._paths:
                current = p.stat().st_mtime if p.exists() else 0.0
                if current != self._mtimes[p]:
                    self._mtimes[p] = current
                    self._callback(p)   # re-load on change

Why daemon=True matters: Daemon threads die automatically when the main process exits. Without it, your watcher thread keeps the process alive after a Ctrl+C, and you’re back to kill -9 debugging.

4.4 The ConfigRegistry: Thread-Safe Singleton

The registry is the single source of truth for config access. It uses threading.Lock to protect against TOCTOU (time-of-check-to-time-of-use) races during hot-reload:

class ConfigRegistry:
    _instance: "ConfigRegistry | None" = None
    _lock: threading.Lock = threading.Lock()

    def __new__(cls) -> "ConfigRegistry":
        with cls._lock:
            if cls._instance is None:
                cls._instance = super().__new__(cls)
        return cls._instance

    def get(self) -> TradingConfig:
        with self._lock:
            if self._config is None:
                raise RuntimeError("ConfigRegistry not initialized. Call load() first.")
            return self._config

    def reload(self, changed_path: Path) -> None:
        try:
            new_config = self._loader.load()
            with self._lock:
                self._config = new_config
            logger.info(f"Config hot-reloaded from {changed_path}")
        except ConfigValidationError as e:
            logger.error(f"Hot-reload REJECTED: {e}. Keeping active config.")

The double-checked locking pattern with __new__ ensures only one registry instance exists across all threads. The reload path holds the lock only for the assignment — never for the I/O. The disk read happens outside the lock, so the event loop is never blocked waiting for a filesystem operation.

5. System Diagrams

The three diagrams below show: (1) how the components are wired together, (2) the step-by-step path from file read to hot-reload, and (3) every state the ConfigRegistry can be in during its lifetime.