Blameless Post-Mortem: Laravel SessionManager Clone Removal

TL;DR: I removed a clone in Laravel's SessionManager thinking it was unnecessary. It broke session routing and cache isolation. After investigating, the performance problem it was meant to fix (duplicate Redis connections) doesn't appear to be reproducible. The clone costs one PHP object per request, not a TCP connection. Lessons learned, process improved, tests written.


Not every PR lands cleanly. This one got reverted, and it was worth understanding why.

The change

PR #59323 targeted a reported performance problem (#58377): SessionManager::createCacheHandler() was cloning the entire cache Repository on every request. For Redis, that clone triggered Repository::__clone(), which deep-cloned the underlying Store. The original issue reported that this was creating duplicate Redis TCP connections.

The fix removed the clone and added conditional Store cloning only when session.connection was explicitly configured. The tests passed. The reasoning looked sound.

It was in 13.x for eight days before it was reverted.

What broke

Two bugs, both caused by the same fundamental mistake.

Bug 1: Sessions silently switched Redis databases. When SESSION_CONNECTION is null (the default), the old clone inherited the default Redis connection (DB 0). By removing the clone and sharing the cache Repository directly, sessions started using the cache Redis connection (DB 1) instead. Session data moved to the wrong database. Any user with an existing session in DB 0 would appear logged out after upgrading.

Bug 2: Cache contamination. When SESSION_CONNECTION is explicitly configured, setStore() mutated the shared cache Repository singleton. After session middleware ran, every Cache::get() call in the application was reading from the session database instead of the cache database. Rate limiting, job deduplication, cache locks, everything that depends on cache would return null.

Both bugs were reported by the community within days. stancl filed issue #59515 identifying the connection switch, and webpatser identified the cache contamination in the same thread and provided a workaround. The PR was reverted shortly after.

Worst case and real-world impact

For apps using Redis sessions with separate default and cache connections, Bug 1 would cause existing sessions in DB 0 to become invisible when the app reads from DB 1, along with any session data including CSRF tokens, flash messages, and cart contents. For apps with an explicit session.connection, Bug 2 would break any feature that depends on cache: rate limiting would stop working, queue job deduplication would fail, cache-based locks would break, and any cached config or feature flags would return null.

The bug shipped in a tagged release:

  • v13.2.0 (Mar 24) - last clean release
  • PR #59323 merged (Mar 27) - the broken code enters 13.x branch
  • v13.3.0 (Apr 1) - released with the bug. Anyone who ran composer update got it.
  • Issue #59515 filed (Apr 3) - stancl reports it
  • PR #59542 reverts (Apr 5) - fix merged to 13.x
  • v13.4.0 (Apr 7) - clean release with the revert

In practice, only two people reported the regression during the six days v13.3.0 was the latest release. No other issues were filed describing symptoms consistent with it.

The root cause

The concept that best describes this mistake is Chesterton's Fence: don't remove something until you understand why it was put there.

I looked at the clone and asked the wrong question. I asked: "Does CacheBasedSessionHandler mutate the Repository?" The answer was no, it only calls get, put, and forget. So I concluded the clone was unnecessary.

The right question was: "What invariant does this clone maintain?" The answer: isolation between subsystems. The session needs its own Repository instance so that operations on it (like setStore() for connection rebinding) don't contaminate the cache singleton that the rest of the application depends on.

The clone wasn't protecting against mutation of the Repository's data. It was protecting against mutation of the Repository as a shared object in the service container.

What the PR description got wrong

My PR stated "No behavior change" for the common case. That was wrong. I reasoned about it instead of proving it. The tests I wrote validated that sessions could read and write (the happy path), but never asserted which Redis connection was being used or whether the cache singleton remained clean after session initialization.

If I'd spun up Redis with separate default (DB 0) and cache (DB 1) connections and written a test that checked which database the session was actually writing to, both bugs would have been caught before the PR was ever opened.

What I changed in my process

I use AI to help find issues with my fixes, and I verify its analysis myself before submitting. That means asking questions, thinking about edge cases, and looking for concerns that could break the code. The goal is to fix it without introducing regressions, while making minimal changes that are impactful and useful, without taking shortcuts.

I maintain a PR Preflight checklist that I run through before marking any Laravel PR ready for review. After this revert, I added six new items. These are the kinds of checks AI can help with, as long as you verify the answers:

  1. Reproduce the problem first. Before writing any fix, reproduce the reported issue with a test and confirm it is an actual problem. If you can't reproduce it, the fix may not be needed. This PR was based on a reported performance problem that could not be reproduced with real connection counts.

  2. Chesterton's Fence check. For every removal (clone, copy, guard, wrapper), answer: "What invariant does this maintain? What breaks if two subsystems share state without it?" If I can't answer confidently, the removal isn't safe.

  3. "Prove, don't claim." If a PR description says "no behavior change", there must be a test that would fail if behavior did change. Reasoning is not proof.

  4. Test with real multi-service configs. Single-connection test setups hide routing bugs. For anything touching sessions, cache, or Redis, test with configs that use separate connections on different databases.

  5. Singleton contamination audit. When removing isolation around a shared singleton, trace every code path that could mutate the shared object after the change. The blast radius of singleton mutation is the entire request.

  6. Verify reasoning empirically. Confident analysis can be wrong. Any claim about behavior needs to be tested, not reasoned about.

Does the original problem actually exist?

The original issue (#58377) claimed that the clone creates duplicate Redis TCP connections. After the revert, I looked more carefully at what the clone actually does.

Repository::__clone() deep-clones the Store, creating a new RedisStore object. But RedisStore doesn't hold a connection directly. It holds a reference to the RedisManager factory, which is shared by reference (not cloned). When the cloned Store calls $this->redis->connection($this->connection), it hits the same RedisManager singleton:

// RedisManager::connection() - caches by name
public function connection($name = null)
{
    $name = enum_value($name) ?: 'default';

    if (isset($this->connections[$name])) {
        return $this->connections[$name];
    }

    return $this->connections[$name] = $this->configure(
        $this->resolve($name), $name
    );
}

The manager caches connections by name. Both the original and cloned Store resolve to the same cached connection. No duplicate TCP connection is created.

To confirm this, I tested connection counts against a real Redis instance. Session alone opens 1 connection (default). Cache adds 1 more (cache). Total: 2 connections, one per named connection, both necessary. The clone doesn't add a third. I could not reproduce the duplicate connections that the original issue reported. The actual cost of the clone is one extra PHP object per request, not a TCP connection.

I also used the same CLIENT INFO approach that stancl used in his issue report to independently verify the connection routing. All states produced identical results: correct routing, no contamination, no duplicate connections. I could only work that out by trying to reproduce the original bug and testing it properly. The code analysis suggested the connections were cached. The real-world tests proved it.

This changes the risk-reward calculus. The clone provides valuable isolation between the session and cache subsystems, as this revert demonstrated. The performance cost is one object allocation per request. Removing it risks the kind of regression we saw here, for a gain that doesn't appear to exist.

I've drafted an alternative approach that preserves the clone for isolation while sharing the Store. Benchmarking shows it offers no measurable performance benefit over stock Laravel for Redis sessions, which reinforces the conclusion that the original issue may not be a real problem. The value of the draft PR is the regression tests and documentation it adds, which may be useful as a reference for anyone revisiting this area in the future.

Between stancl's findings and the testing here, the duplicate connection claim in the original issue does not appear to be reproducible. The issue may no longer need to remain open.

Not every attempt lands on the first try. What matters is understanding what happened, improving the process, and applying those lessons to the next contribution. Sometimes solving a hard problem takes multiple attempts - each one reveals something the previous approach missed. That's not necessarily a complete failure, that's how difficult problems get solved. The best response is better work next time.