Back to news
ReloadiumEdge Case DebuggerDevOpsDevDebugging

How to reproduce a bug that only fails once a week

Intermittent bugs aren't actually random. They have triggers — you just haven't found them yet. Here's the systematic process for turning a flaky failure into a deterministic repro.

Intermittent doesn't mean random

The first mental shift in debugging intermittent failures is accepting that they aren't random. The system isn't rolling dice. The bug is firing every time a specific combination of conditions lines up; you just don't know what those conditions are yet. The whole job is figuring out the combination.

This matters because the natural response to an intermittent failure — "let's add more logging and wait for it to happen again" — is the slowest possible approach. You'll eventually get there, but you'll spend weeks. A systematic process gets you there in hours.

Step 1: Capture the full state at failure

When the bug fires, you usually have the error and a stack trace and not much else. That's not enough. The next time it fires, you want to capture every piece of state that could possibly be relevant: input data, configuration, version of every deployed service, time of day, queue depths, the specific code path that was hit, the state of any caches.

Most teams skip this because the instrumentation feels heavy. It's the single highest-leverage thing you can do. Without the captured state, every subsequent step is guessing.

Step 2: Enumerate hypotheses before testing any

With one failure captured in detail, list every plausible cause. Don't filter yet — get them on paper. Race condition between services. Stale cache. Specific input shape. Timing-sensitive code path. Memory pressure. External dependency latency spike. Configuration drift between environments. Database lock contention.

The enumeration step matters because the bug you're chasing is usually not the one you'd guess first. Listing alternatives lets you test the cheap ones quickly and rule out big chunks of hypothesis space.

Step 3: Design a falsifiable test for each

Each hypothesis needs a verification step that would prove it wrong, not just consistent with it. "Stale cache?" → flush cache, run the workload that previously failed, see if it still fails. "Race condition?" → run the same workload with a deliberate ordering constraint that would eliminate the race; see if the bug still occurs.

Falsifiable tests are faster than confirmation tests because they let you eliminate hypotheses without having to reproduce the failure. You're shrinking the search space.

Step 4: Compress the time between trials

If the bug fires once a week under production load, you can't iterate on it in production. You need a harness that fires the suspected workload as fast as possible, with deliberate noise on the variables you think matter. Parallelize. Run thousands of trials. Bias toward conditions that maximize the chance of reproduction — high concurrency, edge-case inputs, simulated latency on dependencies.

When you can run a thousand trials in an hour instead of one trial per week, the bug becomes deterministic in days instead of months.

Step 5: Verify the reproduction is causal, not coincidental

Once you've found conditions under which the bug reproduces, the work isn't done. You have to verify that the conditions you identified are actually causing the failure, not just correlated with it. Vary each condition independently. If removing condition X makes the bug stop, and adding it back makes it return, you've found a real cause. If the bug reproduces equally with or without X, X was a red herring.

Where AI debugging assistance fits

The hypothesis enumeration step is the one most prone to anchoring — you generate the hypotheses your team has thought of before, miss the ones outside that pattern, and the investigation gets stuck. Running the symptoms through Reloadium Edge Case Debugger surfaces hypothesis categories the team might not have considered, particularly cross-cutting issues like timing, ordering, resource contention, and environmental drift.

The AI doesn't replace the human investigation. It expands the hypothesis space at the front of the process, which is where most intermittent bug investigations go wrong.

Share