/ Updated 22 Apr 2026 45 min read / ,

Improving Queue Safety in Laravel

TLDR Every queued job needs eight things the scaffold doesn't give you: explicit retry bounds, a timeout shorter than retry_after, non-zero backoff, a failed() handler, HTTP timeouts, an idempotency guard, explicit lock expiry, and a platform cost ceiling. 23 findings verified against Laravel 13.4.0.

A follow-up to Why Your Laravel Jobs Might Retry Forever After an OOM

Eight things every queued job needs before you ship:

  1. Retry bounds -- $tries (count-based) or retryUntil() + $maxExceptions (time-based). Never both. Never platform defaults.
  2. $timeout shorter than the connection's retry_after (default 90s on both database and Redis) to avoid duplicate execution.
  3. Non-zero $backoff so retries pace themselves.
  4. A failed() method so failures aren't silent.
  5. Timeouts on every HTTP call inside the job.
  6. An idempotency guard at the top of handle() if the job writes external state. Queues redeliver.
  7. Explicit lock expiry -- $uniqueFor on ShouldBeUnique, ->expireAfter() on WithoutOverlapping. 0 means "never" in Laravel.
  8. On serverless: tight queue-timeout and a hard cost kill switch. Lambda bills wall-clock including idle I/O.

The scaffold gives you none of them. You are responsible for the safety and reliability of your queues.

Using an AI coding agent? Review the prompts below and copy them into your tool (Claude Code, Cursor, Windsurf, Copilot, or similar) to audit and apply the spec across your jobs.

AI agent: summary of changes to apply across my codebase
Main goal: Apply the queue-safety spec across every class in app/Jobs/ that implements Illuminate\Contracts\Queue\ShouldQueue.

Read https://joshsalway.com/articles/improving-queue-safety-in-laravel for full context. Before changing anything, summarise per file what you plan to do and ask me to confirm. Do not change business logic. Leave existing values alone and note them in the summary. Each recommendation below is structured as Goal, Issue, Why, Fix.

---

Goal: Target the right files.

Issue: A grep for "implements ShouldQueue" misses classes that inherit it from an abstract parent. Event-subscriber classes with a subscribe() method sometimes implement ShouldQueue but have no handle() and should not be treated as jobs.

Why: Applying the spec to a non-job class injects nonsensical retry properties and can break listener dispatch. Missing inheriting children leaves whole families of jobs uncovered.

Fix: Target every class that (a) directly implements Illuminate\Contracts\Queue\ShouldQueue, OR (b) extends a class that implements it (transitively). Exclude classes that lack a handle() method or contain subscribe($dispatcher); those are queued event subscribers, report them separately.

---

Goal: Explicit retry bounds on every queued job.

Issue: Jobs without $tries inherit the worker's --tries setting, which varies by platform. queue:work defaults to 1, vapor:work's signature defaults to 0 (unlimited); the runtime usually overrides to SQS_TRIES ?? 3 but relying on platform defaults is fragile.

Why: Setting $tries on the job class makes retry behaviour portable across platforms and visible in code review.

Fix: Add public int $tries = 3; Use 3 when retrying the job is safe and has no side effects on the second or third attempt. Use 1 when the job writes external state that would duplicate on retry (raw bulk inserts, non-idempotent HTTP POSTs, shell-outs via Symfony\Component\Process\Process / proc_open / exec / shell_exec / passthru / Composer / rsync / git, Artisan::call() or $this->call(), event dispatches with side-effect listeners, broadcast() calls with unknown listener sets, counter increments, non-deduplicated email sends, dispatches of further jobs from handle() via Bus::chain / Bus::batch / Job::dispatch, and GuzzleHttp\Pool or any concurrent HTTP fanout). Also use 1 when the job's handle() already retries internally via Http::retry() or a similar in-process retry layer; stacking two retry layers multiplies cost and attempt counts without improving recovery. If the job polls by calling $this->release($interval) inside handle() (a pattern for "wait for N dependent jobs to finish" or "wait for external state to settle"), $tries is the polling ceiling not a retry ceiling: compute it as (max_wait_time / release_interval) with a small buffer, and do not apply the default. Real-world example from SpartnerNL/Laravel-Excel's AfterImportJob: public $tries = 10; with the comment "each release() in handle() counts as an attempt". Report the release() call, the interval, and your chosen ceiling to the user. If retryUntil() is set AND the job is non-idempotent, still add $tries = 1; the framework ignores $tries when retryUntil is set, but leaving it as a documentation marker is correct and protects future refactors. Overwrite existing $tries values of 0, null, or negative integers; those are not "existing values to preserve" but anti-patterns explicitly called out in the linked post.

---

Goal: Pace the retries.

Issue: The default $backoff is 0 seconds, so a failing job retries immediately as fast as the worker can process it.

Why: Non-zero backoff prevents tight failure loops against downstream services, APIs, or databases.

Fix: When $tries > 1, add public array $backoff = [30, 60, 300]; (seconds). Skip this when $tries = 1 since there is only one attempt. If the job defines a backoff() method instead of a property, do not add the property; the method form takes precedence and an added property would be ignored at best and confusing at worst.

---

Goal: Per-attempt timeout.

Issue: Jobs without $timeout run until the worker's --timeout kicks in, or indefinitely if there is no --timeout. On serverless this is also the cost-per-attempt ceiling.

Why: A bounded timeout caps the worst-case wall-clock time (and cost) per attempt.

Fix: Add public int $timeout = 120; Keep this shorter than the connection's retry_after (90s on both database and Redis by default) to prevent duplicate execution when a slow job is taken over by a second worker. If the job already has $timeout > 90, flag it for review rather than silently preserving it; the job is a candidate for duplicate-execution (Finding #10 in the linked post).

---

Goal: Exception ceiling that survives OOM.

Issue: Any $maxExceptions value greater than 0 (and null, the implicit default when the property is unset) can trigger an OOM infinite loop. Finding #1 in the linked post shows the exception counter can silently fail to increment when the worker is killed mid-fire by OOM, so a positive $maxExceptions never reaches the fail threshold and the job retries indefinitely within the retryUntil window.

Why: Only $maxExceptions = 0 is safe, because it fails the job on the first catchable exception without relying on the counter-increment path at all.

Fix: If the job already defines $maxExceptions with a positive integer, change it to 0. If the job does not define $maxExceptions at all, leave the property off (do not add a positive value). Never set it to a positive integer. When you do set it to 0, include a short comment noting that $tries guards retries across restarts and $maxExceptions = 0 guards against catchable-exception loops under OOM.

---

Goal: Observable permanent failures.

Issue: Silent permanent failures are invisible until an invoice or missing data triggers an investigation.

Why: A log entry on final failure turns a missed job into a diagnosable event.

Fix: Add a failed(Throwable $exception) method that logs static::class, job-specific identifiers (ids, keys), and $exception->getMessage(). Log only scalar values; never log Illuminate\Support\Collection, Eloquent model instances, or array-property bodies because a large collection will blow up the log payload and make triage harder, not easier. Skip this if the job already has a failed() method OR internal failure handling (any try/catch in handle() that logs, or a named method like logFailedPing). Do not add duplicate logging.

---

Goal: Unique-dispatch lock expiry.

Issue: ShouldBeUnique with $uniqueFor = 0 (the default) means the lock never expires on worker kill.

Why: A killed worker leaves the lock held; all future dispatches of the same unique key are silently rejected until cache TTL.

Fix: On any class implementing ShouldBeUnique OR ShouldBeUniqueUntilProcessing, add public int $uniqueFor = 3600; (tune by lock-hold expectation: 300 for short-refresh jobs, 600 for Composer-like external-call jobs, 3600 for long-running work). Skip if the job already defines uniqueFor().

---

Goal: Overlapping-middleware lock expiry.

Issue: WithoutOverlapping without ->expireAfter() leaves an infinite lock when the worker is killed.

Why: An infinite lock on worker kill blocks all future runs of this job.

Fix: On any middleware() returning WithoutOverlapping, chain ->expireAfter(minutes: 30)->releaseAfter(seconds: 30). Use ->dontRelease() if overlapping dispatches should be silently skipped instead of requeued.

---

Goal: Middleware backoff interaction.

Issue: ThrottlesExceptions catches exceptions internally and releases the job directly; the job-level $backoff is bypassed.

Why: Without a chained ->backoff() on the middleware, throttled retries happen immediately (retryAfterMinutes default is 0).

Fix: On any middleware() returning ThrottlesExceptions or ThrottlesExceptionsWithRedis, chain ->backoff($minutes) where $minutes is the desired delay. For any other rate-limit-family middleware you do not recognise (RateLimited, RateLimitedWithRedis, Skip, or custom wrappers), do not modify. Report the middleware class name and the full middleware() body to the user and ask before changing.

---

Goal: Bounded external HTTP calls.

Issue: Bare file_get_contents(), fopen(), or Http::get() without ::timeout() can hang for minutes against slow origins.

Why: A hung HTTP call pins the worker at the job $timeout ceiling, which on serverless is money.

Fix: Replace direct Http:: facade calls inside the job's handle() method with a bounded equivalent: $response = Http::timeout(10)->get($url)->throw(); Only modify calls inside the job file. Do not touch Http::retry() chains (removing retry config is a regression), HTTP calls inside service-layer classes the job delegates to, or calls in test files. If the job uses a service class for HTTP, report the service class path and recommend the user review its HTTP call sites separately. Flag (do not rewrite) raw GuzzleHttp\Client, GuzzleHttp\Pool, injected ClientInterface / HttpRequestService, and per-row client-timeout patterns (e.g. $this->webhook->timeout passed into a client constructor) — these bypass Laravel's HTTP facade and automatic rewriting is unsafe. Report each site and ask.

---

Goal: Batch-aware job lifecycle.

Issue: Jobs using the Batchable trait that fail or are cancelled mid-batch leak work: handle() keeps running even after the batch has been cancelled, and failed() that does not surface the batch id makes post-mortem triage harder than it needs to be.

Why: The framework provides batch-cancellation checks and batch id access, but does not wire them into the scaffold. Every batched job has to opt in.

Fix: If the job uses the Batchable trait and handle() does not already begin with an early return on batch cancellation, add if ($this->batch()?->cancelled()) { return; } as the first line of handle(). In the failed() method, include $this->batch()?->id in the log context so permanent failures are attributable to a specific batch.

---

Verify before running:
- If a property is already set to a safe value, keep it and note in the summary. Exceptions: $tries = 0, null, or negative integers should be overwritten (these are anti-patterns); $timeout > 90 should be flagged for review.
- If retryUntil() is set AND the job's handle() is idempotent (re-running on the same input produces the same state), skip $tries. If retryUntil() is set AND handle() is non-idempotent, add $tries = 1 anyway as a documentation marker and a guard against future refactors that remove retryUntil.
- Check parent class for inherited properties too, not just the job file itself.
- Preserve unrelated public properties and methods you do not recognise ($failOnTimeout, $deleteWhenMissingModels, $connection, custom business-logic properties, etc.). This spec covers retry, timeout, lock, and HTTP safety; everything else stays as the original author wrote it.
- For every $timeout = N you add or preserve, audit config/queue.php for the connection's retry_after. It must be strictly greater than N (at least N + 10 seconds of buffer). Default retry_after is 90s on both database and Redis in a fresh Laravel scaffold; both are shorter than the spec's $timeout = 120 and cause silent duplicate execution under slow-path.
- Before modifying retry properties, scan for traits used by the job class whose names match /Recover|Retry|Throttle|Rate/. A shared trait that sets retry config is a force-multiplier: one bad value infects every job using the trait. Halt and report the trait source before modifying job-level retry properties.
- Preserve constructor-level $this->queue = 'x', $this->onQueue(...), $this->onConnection(...) calls and any queue-enum routing. Do not rewrite or reorder these.
- Recognize both Illuminate\Bus\Queueable (Laravel 10 and earlier) and Illuminate\Foundation\Queue\Queueable (Laravel 11+); either trait composition is valid, do not swap one for the other.
- If the job has a uniqueId() method without the ShouldBeUnique interface, leave both alone. The method without the interface is dormant; adding $uniqueFor changes nothing.
- If the job declares a #[DebounceFor] attribute (Laravel 13.6+ debounceable queued jobs), skip the $uniqueFor and ShouldBeUnique logic. #[DebounceFor] uses last-writer-wins cache-token semantics at execute time; it is a different deduplication path and layering $uniqueFor on top produces confusing semantics.
- Read handle() before choosing $tries = 1 vs 3. When in doubt, report the handle() body and ask.

This post documents 23 queue safety findings -- 14 from my audit, 9 from the community. Verified against Laravel Framework v13.4.0 through v13.6.0 with links to real GitHub issues spanning 2015 to 2026.

Any changes you make to your application are your responsibility. Do your own research first.


How I found these

After two Vapor billing incidents ($140 in 2019, $218.90 in 2023), I traced my original application's git history commit by commit, cross-referenced support emails, and compared Worker.php across Laravel v10, v12, and v13. My code had gaps, but so did the framework and platform. That led to a broader question: how many other places in the queue system have the same pattern?

Using Claude Code, I audited the full queue system for unbounded behaviour, counters bypassed by process death, and unsafe defaults.

Packages examined:

  • laravel/framework (v13.4.0 original audit; spot-checks against v13.6.0): src/Illuminate/Queue/, src/Illuminate/Bus/, src/Illuminate/Pipeline/, src/Illuminate/Console/Scheduling/
  • laravel/vapor-core (v2.43.3): VaporWorkCommand.php, QueueHandler.php, VaporWorker.php, VaporJob.php
  • laravel/vapor-cli, laravel/cloud-cli, laravel/forge-cli, laravel/forge-sdk: checked for comparison

I also compared defaults across Sidekiq, BullMQ, Celery, Google Cloud Tasks, and AWS SQS native to find suitable guardrails and recommendations.


What likely caused my billing incidents

After investigating the original application's git history, support emails, and vapor.yml at the time of the incidents, the root cause of the August 2023 bill ($218.90 in 7 days) became clear.

This was not a literal infinite retry loop. The OOM bug in Finding #1 is a real framework bug and can produce unbounded retries in its specific trigger conditions, but the 2023 incident is a different mechanism. On Vapor, QueueHandler invokes vapor:work --tries=$SQS_TRIES ?? 3, so each problem email got up to 3 SQS redeliveries, each running to the full 15-minute queue-timeout: 900 ceiling because of uncapped file_get_contents() hanging on slow image origins. That's up to 45 minutes of billed 2048MB Lambda time per problem message (3 attempts x 900s x 2GB x $0.0000166667/GB-s ~= $0.09 per message). Multiply by roughly 350 problem emails per day over 7 days and the bill arrives at $218.90 without anything retrying forever. Same class of queue-safety failure as Finding #1, different mechanism. I cannot prove the OOM bug caused this specific bill, and on the Lambda pricing math I no longer think it did.

What was a skill issue

  • I set queue-timeout: 900 in vapor.yml "to be safe." This was the dominant cost multiplier. Lambda bills wall-clock time including idle I/O waits, and at 2048MB x 900s x $0.0000166667/GB-s each timed-out attempt cost roughly $0.03. With the same mistakes below but a default 60s queue-timeout, the same incident would have cost around $15 instead of $218. "To be safe" was the thing that wasn't.
  • I used file_get_contents() on arbitrary URLs with no timeout inside a queue job. One slow origin pinned the worker at the Lambda ceiling for the full 15 minutes per attempt.
  • I didn't set $tries on my job classes. That's in the docs. On Vapor the effective retry count was SQS_TRIES ?? 3 (bounded, not infinite), but three attempts at $0.03 each times inbound email volume is how the 2023 bill accrued in a week.
  • I didn't set up AWS budget alerts. The spend was invisible until the invoice arrived.

Those are real mistakes and I own them.

What wasn't a skill issue

  • The tries configuration has three layers that contradict each other (see finding #4).
  • There is no global default_tries config. One missed property on one job class and the behaviour depends on which layer wins.
  • Retries are completely silent. No warning, no log entry, no event.
  • Support responded to both incidents with "check your AWS invoice" and "set up budget alerts." Neither response mentioned $tries, queue-timeout, retry limits, or the Vapor tries default. Even a basic triage question in 2019 ("what is your queue-timeout, do your jobs have $tries and $timeout set, do your HTTP calls have timeouts?") would have turned the 2023 incident into a background-noise bill instead of an alarm-bell one.

Whose fault is it?

In the previous post I wrote that the framework, hosting platform, and application code should all have seatbelts in place to reduce the likelihood of infinite retries in serverless environments. All three failed:

Mine: A 15-minute queue-timeout (the dominant cost multiplier on the evidence), unbounded HTTP calls inside jobs, and jobs without $tries. In that order of impact.

The framework: Unsafe defaults -- make:job stub is bare, $tries null (see #4), backoff 0 (see #2), WithoutOverlapping lock infinite (see #3), maxExceptions bypassed by OOM (see #1) -- meant my mistakes had no guardrails at the layer I actually wrote.

The platform (Vapor): Three layers of tries configuration that contradict each other, none clearly documented (see #4). Support didn't flag the root cause in either incident.

The git log from August 20, 2023 (private repository) tells the story: 11 commits in a single day, reverts of reverts, and the first $tries = 5 added mid-crisis. The fix that actually turned off the money hose was the file_get_contents HTTP timeout added the next day (commit e6f7cb5, Aug 21), which stopped attempts from hanging to the Lambda ceiling; that change alone dropped per-attempt cost by roughly 30x. The $tries = 5 + idempotency guard + queue-timeout reduction landed the night before were supporting fixes, not the dominant one. No single layer caused this alone. My bad code, combined with unsafe framework defaults, on a platform with inconsistent and undocumented tries configuration, turned a coding mistake into $358.90 across two separate incidents four years apart.

Lambda is for short bursts, not long-running work

The same code, same bugs, same SQS redeliveries would have cost very different amounts on different platforms. The 2023 mechanism was 3 attempts x up to 15 minutes x 2GB x pay-per-millisecond billing. Run the same scenario on:

  • Non-serverless environments (Forge, Ploi, Laravel Cloud, DigitalOcean droplet): $0 incremental. The worker was already paid for. You'd see a backed-up queue, a slow server, and possibly dropped jobs. No excessive bill.
  • Laravel Cloud with a capped queue worker: bounded to the worker's provisioned compute regardless of retry behaviour. Slow queue, predictable ceiling.
  • Cloudflare Workers: CPU-time-gated (10 seconds on free, configurable up to 5 minutes on paid). Tasks are tied to the HTTP request lifecycle -- client disconnect cancels them with a 30-second waitUntil grace. Pay-per-request plus CPU time keeps costs predictable. Caveat: a hung fetch() does not count against CPU time, so slow-origin I/O can still wait longer than the CPU cap implies.
  • Vercel Functions: 10 seconds default on Hobby, 60 seconds on Pro, configurable up to 800 seconds on Pro with Fluid Compute. The low default is the safety net; raising maxDuration near 800s recreates the Lambda shape.
  • Lambda with a default 60 second queue-timeout (same Vapor, different config): about $15 instead of $218 on the same code.

This shape of problem isn't framework-specific; the same unsafe patterns cause problems on every platform. The lesson is that Lambda's pricing model rewards short bursts and punishes long-running work. A 15-minute queue-timeout inverts Lambda's value proposition: per-millisecond billing is great for the 30-second happy path and catastrophic on the 15-minute unhappy path.

For background work that's inherently quick -- parsing an inbound email, saving an attachment to storage, rendering a small PDF, calling an API -- Cloudflare Workers or Vercel Functions give you low default timeout caps and pay-per-request billing that keeps costs predictable. The low default is the safety net; raise the cap and you give it up. For longer or memory-heavy tasks, non-serverless environments (Forge, Ploi, Laravel Cloud) absorb the mistake class entirely: bugs produce slow queues, not excessive bills. Lambda via Vapor is best reserved for bursty request-response work, not for queues where one hung HTTP call means minutes of billed idle time.


Do we actually need seatbelts or guardrails?

Nobody plans to have a car accident. You don't put on a seatbelt because you expect to crash. You put it on because if something goes wrong, the seatbelt is the difference between a bad day and a catastrophic one.

Fun fact: When Volvo invented the three-point seatbelt in 1959, they made the patent available to every car manufacturer for free because they believed safety should be a shared standard, not a competitive advantage. It still took decades for seatbelts to become mandatory worldwide.

Queue safety is the same. Most jobs work fine. Most deployments don't have runaway billing. Most developers go years without a serious queue incident. But when one hits, the difference between a job with the eight properties set and a job without them is the difference between a failed job in your failed_jobs table and a surprise bill on the invoice.

Guardrails on a road don't slow you down. They're invisible until the moment you need them. $tries, $backoff, $timeout, and failed() are the same. They cost nothing in normal operation. They save you when something unexpected happens: an API goes down, a payload is larger than expected, a memory limit is hit, a deploy goes wrong.

On serverless platforms, a runaway job shows up as a bill. In non-serverless environments (Forge, Ploi, Laravel Cloud, etc.), there's no excessive bill to trigger an investigation. The signal is quieter: a job retrying in a tight loop consumes CPU and memory, your server gets slower, other jobs back up, and some jobs may be dropped. Without monitoring or logging, you might never know it's happening. Queue safety is a shared discipline across every platform. Serverless just surfaces the cost quickly in dollars, while non-serverless hides it in slower throughput and dropped work.

If you've never had a queue incident, that's great. Add the guardrails anyway. They're free insurance. Regardless of the safety angle, these recommendations will likely make your queue jobs more reliable. And if you already knew all of this, pat yourself on the back.


Recommendations you can do today

Even after writing this audit, I checked my own applications and found jobs without $tries set -- including the exact job that caused my billing incident. That's how easy this is to miss. Safe defaults would catch it for everyone.

use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Support\Facades\Log;
use Throwable;

class ProcessEmailJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public $tries = 3;
    public $maxExceptions = 0;
    public $backoff = [30, 60, 300];
    public $timeout = 120;

    public function retryUntil()
    {
        return now()->addHours(2);
    }

    public function failed(Throwable $exception)
    {
        // Log, notify, or alert. Don't let failures be silent.
        Log::error('ProcessEmailJob permanently failed', [
            'exception' => $exception->getMessage(),
        ]);
    }
}

Pick one retry policy, not both. Laravel's Worker treats $tries and retryUntil() as mutually exclusive: if retryUntil() returns a timestamp, $tries is ignored completely (Worker.php line 612 on 13.x, behaviour introduced in framework PR #35214, Nov 2020). So pick one:

  • Count-based -- $tries + $backoff + $maxExceptions, no retryUntil(). Bounds by attempt count.
  • Time-based -- retryUntil() + $maxExceptions, no $tries. Bounds by wall-clock. Taylor's own guidance on #35199: "When using retryUntil I would use maxExceptions if you want to determine how many uncaught exceptions are allowed."

The example above sets both so you can see the properties in one place, but in production use one or the other. See Finding #11 for the source-level proof.

Why each property matters: five properties, set explicitly on every queued job. The caveats under each item explain scope and edge cases, they don't weaken the rule.

  • $tries = 3 -- hard cap on total attempts (count-based policy). Don't rely on platform defaults, set this explicitly on every job. Use 3 if multiple tries are useful and don't cause side effects, or 1 if you want to be strict and only run once. Don't use $tries = 0: in Laravel that means infinite retries, not zero. Use a positive integer. Ignored when retryUntil() is set. $tries on the job class wins over every worker-level default, which matters because those defaults differ by platform (queue:work uses 1, vapor:work uses 0 but is runtime-overridden to SQS_TRIES ?? 3). See Finding #4 for the three-layer diagnosis.
  • $maxExceptions = 0 -- fails the job on the first catchable exception. See note below. Works with either policy.
  • $backoff = [30, 60, 300] -- exponential delays between retries. Without this, retries are immediate (0 seconds). Applies when exceptions propagate to the worker's handleJobException. Middleware that catches and releases internally (ThrottlesExceptions, RateLimited) bypasses $backoff and uses its own delay parameter; chain ->backoff($minutes) on the middleware in that case. See Finding #8 for the source-level proof.

Why $maxExceptions = 0? This is deliberately aggressive because of finding #1: the maxExceptions counter never increments during OOM, so any value above 0 allows infinite retries when OOM and catchable exceptions alternate. Setting it to 0 means the first catchable exception stops the loop. Retry tolerance is handled by $tries and $backoff instead, which work across process restarts.

  • $timeout = 120 -- kills the job after 2 minutes. Without this, jobs can run until Lambda's 15-minute ceiling. $timeout must be shorter than the connection's retry_after config value (default 90 on both database and Redis); if $timeout exceeds retry_after, the same job can be picked up and executed twice. See Finding #10 for the duplicate-execution scenario.
  • retryUntil() -- time-based circuit breaker (time-based policy). Negates $tries completely when set. Use alongside $maxExceptions, not $tries.
  • failed() -- get notified when a job permanently fails. Silent failures are what cause $200 bills.

On WithoutOverlapping middleware

public function middleware()
{
    return [
        (new WithoutOverlapping($this->key))
            ->expireAfter(minutes: 30)
            ->releaseAfter(seconds: 30),
    ];
}

Always set expireAfter. The default is 0 (never expires). If your worker crashes while holding the lock, the job is permanently blocked without this.

On ShouldBeUnique jobs

class ImportDataJob implements ShouldQueue, ShouldBeUnique
{
    public $uniqueFor = 3600; // seconds
}

Always set $uniqueFor. The default is 0, which means the lock never expires (same pattern as WithoutOverlapping). On a normal exception, CallQueuedHandler::failed() releases the lock via ensureUniqueJobLockIsReleased() (source). The leak case is when failed() never runs: the worker is terminated mid-handle() (OOM, SIGKILL, container eviction), or the $deleteWhenMissingModels = true + missing-model path from Issue #49890 reported by @naquad (closed as "no-fix for us"). In those cases the lock persists until cache TTL, blocking all future dispatches of that unique key.

If you want the lock released as soon as the job starts processing rather than when it completes or fails, implement ShouldBeUniqueUntilProcessing instead. Narrower window, less leak risk, but allows the job to run concurrently with another dispatch of the same unique key after processing begins.

Laravel 13.6.0 introduced #[DebounceFor] as an attribute-driven alternative (PR #59507, merged April 2026). It uses last-writer-wins cache-token semantics over a debounce window, no interface required, and fires a JobDebounced event when a dispatch is superseded. Use #[DebounceFor] when you want to deduplicate a burst of dispatches and run only the last one; use ShouldBeUnique when you want the first to run and the rest to be silently dropped while the lock is held.

Prune failed jobs

// In app/Console/Kernel.php or routes/console.php
Schedule::command('queue:prune-failed --hours=168')->daily();

The failed_jobs table grows unbounded. At 300k+ records, queue:retry --all will OOM (Issue #49185 reported by @arharp). Prune weekly.

On external HTTP calls inside jobs

// Bad: no timeout, no size limit
$content = file_get_contents($url);

// Good: bounded timeout, exception on failure
$response = Http::timeout(10)->get($url);
$response->throw();
$content = $response->body();

Every external call inside a queue job should have a timeout. One slow or unresponsive endpoint can hold a Lambda invocation running for minutes.

Hard limits

These catch problems regardless of whether individual jobs are configured correctly, ordered by leverage:

  • queue-memory and queue-timeout in vapor.yml -- these set your per-attempt cost ceiling on Lambda: memory-gb x timeout-seconds x $0.0000166667/GB-s. At 2048MB and queue-timeout: 900 that is roughly $0.03 per stuck attempt, multiplied by your retry count and your volume. Keep both as low as your jobs actually need. If you take one thing from this section, take this: the dominant cost control on serverless queues is how long each failed attempt is allowed to burn, not how many retries you allow.
  • queue:work --tries=3 -- set this on your worker command or supervisor config. Acts as a floor even if a job class forgets $tries. On Vapor, set SQS_TRIES=3 in your environment. If you run an SQS dead letter queue with maxReceiveCount as your retry ceiling, set SQS_TRIES=0 instead so Laravel releases the message back and the DLQ handles the cap at the infrastructure layer.
  • SQS dead letter queue -- configure a redrive policy with maxReceiveCount in AWS. After N receive attempts, SQS moves the message to a DLQ automatically, even if your application crashes. This is your infrastructure-level circuit breaker and works independently of Laravel.
  • Ideally, a cost-based kill switch -- a soft limit sends you an email when spend hits a threshold. A hard limit actually stops the workload. Vapor and most serverless platforms only offer soft limits today. A hard limit that pauses queue processing when spend exceeds a configurable amount (e.g. $30/month) would have prevented both of my incidents entirely. The alert told me the house was on fire. A hard limit would have put it out.
  • AWS budget alerts -- won't stop the spend, but tells you early. Set via the Vapor UI.
  • Monitor queue depth -- as of Laravel 13.4.0, Queue::pendingJobs(), Queue::delayedJobs(), and Queue::reservedJobs() (PR #59511) let you inspect queue state natively. On AWS, you can also use CloudWatch alarms on ApproximateNumberOfMessagesVisible.
  • Lambda concurrency limits -- set reserved concurrency on your queue Lambda to cap how many concurrent invocations can run. Limits the burn rate during a runaway.

A base job class

If you want to apply safe defaults across all your jobs without repeating yourself:

use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Support\Facades\Log;
use Throwable;

abstract class SafeJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public $tries = 3;
    public $maxExceptions = 0;
    public $backoff = [30, 60, 300];
    public $timeout = 120;

    public function failed(Throwable $exception)
    {
        Log::error(static::class . ' permanently failed', [
            'exception' => $exception->getMessage(),
        ]);
    }
}

Then extend it:

class ProcessEmailJob extends SafeJob
{
    // Inherits all safe defaults
    // Override any property if this job needs different limits
}

The Findings

Critical: Unbounded retries and permanent resource locks

These findings can cause infinite retry loops, permanent job lockout, or runaway costs on serverless platforms. They are the highest priority for anyone running queues in production.

1. maxExceptions counter only increments inside the catch block

File: src/Illuminate/Queue/Worker.php - process() and markJobAsFailedIfWillExceedMaxExceptions()

This is the bug documented in my previous blog post. The maxExceptions counter is incremented inside handleJobException(), which is called from the catch (Throwable) block. When the process is killed by an out-of-memory error, the catch block never executes. The counter is never incremented. The job retries with a counter of zero indefinitely.

There is a pre-fire check for maxTries via markJobAsFailedIfAlreadyExceedsMaxAttempts, but no equivalent pre-fire check for maxExceptions.

Real-world: Issue #58207 (Dec 2025) reported by @pingencom -- 31 comments from production users independently building workarounds.

Suggested fix: Increment the exception counter before fire(), decrement after successful completion. If the worker dies during fire(), the increment persists. On the next pickup, a pre-fire check can fail the job if the counter meets the threshold.


2. Default backoff is 0 seconds

File: src/Illuminate/Queue/Worker.php - calculateBackoff()

When a job fails and has no backoff property, the default is 0 seconds. The job is immediately re-queued. Combined with a high or unlimited retry count, this creates a tight failure loop where the same job fails and retries as fast as the worker can process it.

Real-world: Issue #44680 (Oct 2022) reported by @hjeldin -- backoff ignored after timeout kill, job retries immediately.

Suggested fix: Default backoff to a small positive value (e.g. 3 seconds) instead of 0.


3. WithoutOverlapping middleware defaults to an infinite lock

File: src/Illuminate/Queue/Middleware/WithoutOverlapping.php

$expiresAfter is unset by default (null). When passed to Cache::lock() with no TTL, most drivers (Redis, Database) produce a lock that never expires. If a job using this middleware is killed (OOM, SIGKILL, server crash), the lock is never released. All future instances of that job are permanently blocked, either released back to the queue indefinitely or silently dropped.

Real-world: Issue #37060 (Apr 2021) reported by @lasselehtinen -- lock not released on failed jobs.

Suggested fix: Default $expiresAfter to a reasonable value (e.g. 3600 seconds) instead of leaving it unset.


4. $tries defaults to null (unlimited) in job payload

Files: src/Illuminate/Queue/Queue.php - getJobTries(), src/Illuminate/Queue/Worker.php - markJobAsFailedIfAlreadyExceedsMaxAttempts()

When a job class does not define a $tries property, the payload contains maxTries: null. Null does not itself mean unlimited: Worker.php line 578 replaces null with the worker's --tries argument before the unlimited-if-zero check at line 586. So null delegates to the worker layer, and unlimited only happens when that layer also resolves to 0. The queue:work command defaults to --tries=1 (safe). On Laravel Vapor, VaporWorkCommand defines --tries=0 (unlimited) in its command signature, but the QueueHandler runtime that invokes it passes $_ENV['SQS_TRIES'] ?? 3, so in practice the default is 3 unless explicitly overridden. This inconsistency between the command definition and the runtime invocation is confusing and not documented.

The interaction between job-level, command-level, and runtime-level tries configuration is complex. A global default in config/queue.php would provide a single, visible safety net.

Real-world: PR #29385 (Aug 2019) by @SjorsO -- changed the queue:work default from 0 to 1 in Laravel 6.0. The PR states: "Changing the default solves the problem of broken jobs getting stuck in an infinite loop when you forget to pass the queue worker a --tries flag." This partially addressed the issue but didn't unify the other layers: job-level $tries still defaults to null, VaporWorkCommand still defines --tries=0 in its signature, and SQS_TRIES is a separate runtime concern. Issue #58207 (Dec 2025) reported by @pingencom -- jobs retried endlessly with $tries=0. PR #59718 (Apr 2026, merged in Laravel 13.6.0) -- a developer hit the TINYINT 255-attempts limit on a unique job retrying every minute for a full day; the column was widened to SMALLINT. The schema fix addresses the symptom; the underlying gap (unbounded retries running to the column's range) remains.

The real gap is the scaffold. php artisan make:job generates from job.queued.stub. That template has been touched nine times since 2020 -- formatting, imports, PHP type declarations, ShouldBeUnique added and removed. Never $tries. Never $backoff. Never $timeout. Never failed(). Every queued job generated in every new Laravel app since 2020 ships with nothing.

Suggested fix: Add a default_tries option to config/queue.php that applies when neither the job class nor the command line specifies a value. And update job.queued.stub to include $tries, $backoff, and $timeout -- commented out if you want, just so the developer sees them.


5. Race condition in maxExceptions counter initialization

File: src/Illuminate/Queue/Worker.php - markJobAsFailedIfWillExceedMaxExceptions()

The counter initialization uses a non-atomic get/put sequence. When multiple workers process retries of the same job UUID simultaneously, both workers can see a missing key and reset the counter, causing the exception count to be lost. An atomic Cache::add() call would prevent this race condition.

Real-world: Silent bug -- users see the symptom (jobs retrying forever) without understanding the cause. Compounds Issue #58207 (Dec 2025) reported by @pingencom.

Suggested fix: Replace the Cache::get() / Cache::put() initialization sequence with Cache::add(), which only sets the key if it doesn't already exist. The subsequent Cache::increment() is already atomic.


6. handleJobException releases when failure checks are bypassed

File: src/Illuminate/Queue/Worker.php - handleJobException()

The finally block in handleJobException releases the job back to the queue if it hasn't been deleted, released, or marked as failed. These three guards are correct and work well when maxTries or maxExceptions properly mark the job as failed. However, when those checks are bypassed (because maxTries is 0 or maxExceptions fails to increment due to OOM), the job is never marked as failed, and the release always proceeds. This amplifies the unlimited retry issue in those specific scenarios.

Real-world: PR #45876 (Jan 2023) by @khepin, closed without merging -- "jobs that fail because of high memory usage all stay on the queue and accumulate there. If enough of them have accumulated, the workers keep spinning on jobs that can never go through."

Suggested improvement: Resolving finding #1 (pre-fire maxExceptions check) would address this for jobs with maxExceptions set. For the broader case, consider logging a warning when a job is released back to the queue with a high attempt count and no maxTries or maxExceptions configured. This wouldn't change behaviour but would make the problem visible instead of silent.


Important: Unsafe defaults and resource exhaustion risks

These findings involve default values or missing limits that could cause problems under load, after outages, or with specific configuration combinations. They are less likely to cause immediate damage but represent gaps that production systems can hit.

7. Redis migrationBatchSize defaults to unlimited

File: src/Illuminate/Queue/RedisQueue.php

The migrationBatchSize defaults to -1, which means unlimited. The Lua script that migrates expired delayed and reserved jobs fetches all of them in a single call. If a large number of delayed jobs expire simultaneously (for example, after an outage or server restart), this single Lua operation can block Redis for all clients, consume significant memory in the Lua execution context, and potentially trigger the lua-time-limit threshold.

Real-world: PR #43310 (Jul 2022) by @AbiriAmir -- "scheduling a large number of jobs for a specific time causes Redis to halt since migrate script is a heavy script." This PR added the migrationBatchSize config but defaulted it to -1 (unlimited).

Suggested fix: Default migrationBatchSize to a bounded value (e.g. 1000).


8. ThrottlesExceptions retryAfterMinutes defaults to 0

File: src/Illuminate/Queue/Middleware/ThrottlesExceptions.php

When an exception triggers the throttle, the job is released with a delay of retryAfterMinutes * 60. The default for retryAfterMinutes is 0, meaning the job is immediately re-queued after an exception. Combined with a high retry count, this creates a tight failure loop similar to the 0-second backoff issue.

The middleware's handle() method catches the exception and calls $job->release($this->retryAfterMinutes * 60) directly, then returns. The exception never propagates out, so the worker's handleJobException never runs and calculateBackoff() never consults the job's $backoff property. A job-level $backoff is ignored on this path. Setting ->backoff($minutes) on the middleware construction is the only way to pace these retries (e.g. (new ThrottlesExceptions(1, 600))->backoff(10)).

Real-world: Issue #36637 (Mar 2021) reported by @tairau -- backoff docblock says seconds but value is used as minutes. Issue #56087 (Jun 2025) reported by @michaeldzjap -- ThrottlesExceptions overrides FailOnException, causing jobs to retry despite being told to fail.

Suggested fix: Default retryAfterMinutes to a small positive value (e.g. 5).


9. RateLimited middleware release loop

File: src/Illuminate/Queue/Middleware/RateLimited.php

When a job is rate-limited, it is released back to the queue. Each release counts as an attempt. In high-concurrency environments with many workers competing for the same rate limit, a job can be released and re-attempted many times without ever executing its actual logic. If the job has a high or unlimited retry count, this consumes worker capacity without doing useful work.

Real-world: Issue #53157 (Oct 2024) reported by @amir9480 -- RateLimiter perSecond not working as expected for queue jobs.

Suggested improvement: Consider not counting rate-limited releases as attempts, or providing an option to distinguish between "failed" and "deferred" releases.


10. Database queue reserved-but-expired can cause duplicate execution

File: src/Illuminate/Queue/DatabaseQueue.php - getNextAvailableJob() + isReservedButExpired()

Jobs where reserved_at is older than retry_after seconds are treated as available. The default retry_after is 90 seconds. If a job legitimately takes longer than 90 seconds to process, another worker can pick it up while it is still running. The same job executes concurrently in two workers.

Real-world: Issue #8577 (Apr 2015) reported by @m4tthumphrey -- multiple Redis workers picking up the same job. Issue #7046 (Jan 2015) reported by @easmith -- database queue deadlocks from concurrent execution.

Suggested improvement: Document this interaction clearly and consider a longer default retry_after, or add a mechanism for long-running jobs to extend their reservation.


11. retryUntil can override maxTries

File: src/Illuminate/Queue/Worker.php - markJobAsFailedIfAlreadyExceedsMaxAttempts()

If retryUntil() returns a future timestamp, the maxTries check is skipped entirely. A job with retryUntil() returning a far-future date (e.g. one year) will retry for the entire window regardless of how many times it has failed. Combined with a 0-second backoff, this is a sustained failure loop for the duration of the window.

I tested this. Job with retryUntil(10s) and no $tries, run against queue:work --tries=1: 275 retries in 3 seconds. The worker's --tries flag is ignored when retryUntil() is set. Setting job-level $tries alongside retryUntil() does not help either: the !$job->retryUntil() guard in Worker.php line 612 on 13.x short-circuits the attempt check whenever retryUntil() is set. Worker-level safety is no protection here, and neither is job-level $tries. When retryUntil() is set, the only bound is wall-clock time.

Real-world: Issue #35199 (Nov 2020) reported by @trevorgehman -- "Queue worker ignores job's maxTries setting if using retryUntil()." Closed by PR #35214 which unified the behaviour: if retryUntil is set, maxAttempts is ignored in every path. Taylor's position on the issue: "retryUntil and maxTries are sort of mutually exclusive. When using retryUntil I would use maxExceptions if you want to determine how many uncaught exceptions are allowed." Follow-up comments in 2022 and 2024 show users still hitting this and filing it as broken, which suggests the docs gap remains.

Suggested improvement: Surface this interaction prominently in the queues docs. A one-line note under $tries ("ignored when retryUntil() is set") and under retryUntil() ("pair with $maxExceptions, not $tries") would close the comprehension gap. The behaviour itself is deliberate and shipped in Laravel 8.x (November 2020), so changing it would break every app that depends on the current semantics.


12. Pipeline memory retention between jobs

File: src/Illuminate/Pipeline/Pipeline.php

The Pipeline retains references to $passable and $pipes after execution. In long-running queue workers, this means the previous job's data is held in memory until the next job overwrites it. While the retention is bounded to one job's worth of memory, it contributes to gradual memory growth in workers, increasing the likelihood of OOM events.

Real-world: Issue #56395 (Jul 2025, OPEN) reported by @momala454 -- job objects retain large data in memory after processing. A related issue I filed, Issue #59402, was closed once I verified that the $passable/$pipes cleanup in PR #59415 and PR #59330 makes the Job::$instance chain GC-eligible indirectly -- but those Pipeline PRs were themselves closed, so the underlying retention still ships in v13.4.0.

Suggested fix: Null $passable and $pipes after pipeline execution.


Improvement: Edge cases and documentation gaps

These findings are lower risk but represent real gaps that specific configurations can hit.

13. Reserved job migration ignores attempt count

File: src/Illuminate/Queue/RedisQueue.php - migrate()

When reserved jobs expire (worker died mid-processing), they are moved back to the ready queue without checking their attempt count. The attempt check only happens when the worker next pops and processes the job. This means a job that has already exceeded its max tries can be re-enqueued and picked up before being failed.

Real-world: Issue #32103 (Mar 2020) reported by @mfn -- job retried despite still running, reserved timeout expired mid-execution.

Suggested improvement: The pre-fire check via markJobAsFailedIfAlreadyExceedsMaxAttempts already handles this when the worker pops the job. The gap is the brief window where an over-limit job sits in the "available" queue before being popped. Checking attempts inside the Lua migration script would be expensive. A lighter approach: log or emit an event when a reserved job is migrated back, so monitoring tools can flag jobs that are cycling.


14. No timeout enforcement on Windows

File: src/Illuminate/Queue/Worker.php - registerTimeoutHandler()

The timeout handler relies on pcntl_alarm, which is only available on systems with the pcntl extension (Linux/Mac). On Windows, Laravel deliberately throws an error if you set a timeout, forcing you to pass --timeout 0 to acknowledge you're running without timeout protection. This is the right design choice (silently ignoring the timeout would be worse), but it means Windows workers have no timeout enforcement at all. A job that enters an infinite loop or deadlock will block the worker process forever.

Real-world: Issue #15002 (Aug 2016) reported by @StevenBock -- queues require explicit --timeout 0 on Windows. Issue #14909 (Aug 2016) reported by @ac1982 -- PHP requires --enable-pcntl for queue timeouts.

Suggested improvement: Add a fallback timeout mechanism for environments without pcntl support.


What the community found

The following issues were reported by other developers. I didn't find these in my audit -- they found them first. I'm including them here so everything is in one place.

Worker enters infinite silent loop on non-database exceptions

Issue #59517 (Apr 2026, OPEN) reported by @thuggins-engrain. stopWorkerIfLostConnection() only checks for database connection errors. If SQS SDK, Redis auth, or HTTP errors occur in getNextJob(), the worker catches the exception, sleeps 1 second, and retries forever with no exit condition. PR #59553 by @webpatser proposed --max-pop-exceptions to address this; closed by maintainers.

Batch deadlocks under high concurrency

Issue #39722 (Nov 2021) reported by @gm-lunatix. Issue #36478 (Mar 2021) reported by @murphatron. Issue #40574 (Jan 2022) reported by @walkonthemarz. The job_batches table uses SELECT FOR UPDATE, causing row-level lock contention with high-concurrency workers.

Batch never finishes when jobs fail

Issue #36180 (Feb 2021) reported by @stephenstack. Issue #35711 (Dec 2020) reported by @nalingia. When a batch job fails, pending_jobs may never reach 0, so then/finally callbacks never fire. Batch hangs forever. Closed as completed but no linked fix PR found.

ShouldBeUnique lock not released

Issue #49890 (Jan 2024) reported by @naquad -- lock not released when dependent model is deleted before processing. Closed with "this is a no-fix for us right now." Issue #37729 (Jun 2021) reported by @rflatt-reassured -- lock only releases after timeout, not after successful completion.

Failed jobs table causes OOM when retrying

Issue #49185 (Nov 2023) reported by @arharp. Issue #52129 (Jul 2024) reported by @godwin-loyaltek. RetryCommand and FailedJobProviderInterface::all() load the entire failed_jobs table into memory. At 300k+ records, it OOMs.

Chain jobs silently terminate on queue restart

Issue #45426 (Dec 2022) reported by @Monilsh. If queue:restart is issued mid-chain, remaining jobs are silently dropped. No error, no failed job record. Closed as "expected behavior."

Timed-out worker kill leaks resources

Issue #30351 (Oct 2019) reported by @halaei. Worker::kill() sends SIGKILL, which prevents cleanup of temp files, connections, and locks.

No backpressure on dispatch

PR #57787 (Nov 2025) by @yousefkadah -- community attempt to add queue depth notifications via a maxPendingJobs property. Not merged. There is no queue depth checking in the dispatch path. If the dispatch rate exceeds the consumption rate, the queue grows without limit.

Failed job providers crash on corrupted payload

Issue #59635 (Apr 2026, OPEN) reported by @ruttydm. UUID-based failed job providers use json_decode($payload, true)['uuid'] without null check. Corrupted payloads crash the provider and the failure record is permanently lost.


Current status

Across the 14 findings in this audit and the 9 community-reported issues above, the resolution status as of April 2026:

  • 3 partially addressed: migrationBatchSize config added but defaults to unlimited (#7, PR #43310), ThrottlesExceptions docblock corrected but default remains 0 (#8, PR #36642), Windows timeout workaround documented but no auto-detection (#15002)
  • 3 closed as intentional design decisions: ShouldBeUnique lock behaviour (#49890, "no-fix for us"), chain jobs dropped on restart (#45426, "expected behavior"), retryUntil/maxTries mutual exclusion (#35199, by design)
  • 4 currently open: OOM infinite retry (#58207), worker silent loop (#59517), corrupted payload crash (#59635), job memory not released (#56395)
  • 13 closed without framework code changes

Some of the closed issues are from older Laravel versions and may have been addressed indirectly through major version changes. Some reflect intentional design trade-offs that reasonable people can disagree on. I've included them because the safety implications exist regardless of whether the behaviour is intentional.

I verified all 14 findings and 9 community reports against v13.4.0 source, with spot-checks against v13.6.0 (April 2026). The oldest linked issues date back to January 2015 (#7046) and August 2016 (#15002). The same code patterns, same defaults, and same behaviours are still present.


What the docs and support could improve

Documentation

  • Add a "Queue Safety" section to the queue docs. The eight job-level properties (retry bounds, timeout, backoff, failed() handler, HTTP timeouts, idempotency guard, lock expiry, platform cost ceiling) are scattered across different sections. A single page showing them together with a recommended safe configuration would help.
  • Document the Vapor tries configuration. VaporWorkCommand defines --tries=0, but the QueueHandler runtime passes SQS_TRIES ?? 3. queue:work defaults to --tries=1. Three layers, three different values, none documented together.
  • Document the queue-timeout cost model on Vapor. On pay-per-millisecond Lambda billing, queue-timeout is the cost ceiling per failed attempt: memory-gb x timeout-seconds x $0.0000166667/GB-s. At 2048MB x 900s that is ~$0.03 per stuck attempt, multiplied by retry count and volume. The Vapor docs name queue-timeout as a configuration option but don't explain the cost model, and setting it high "to be safe" is the dominant shape of runaway bills.
  • Document the $timeout vs retry_after interaction. When a job's $timeout exceeds its connection's retry_after (default 90s on both database and Redis), the reservation can expire mid-flight and a second worker can pick up the same job. This is the single most-cited queue gotcha in community writeups and is not explained in the queue docs.
  • Document the retryUntil / maxTries mutual exclusion. When retryUntil() returns a future timestamp, maxTries is skipped entirely.
  • Document the maxExceptions OOM limitation. It only works for catchable exceptions. Fatal errors bypass the counter entirely.
  • Document the ShouldBeUnique lock lifecycle. $uniqueFor is unset by default, which on Redis and Database drivers produces a lock with no TTL that persists until manually released. Issue #49890 was closed as completed, but the safety implication of the default remains. Laravel 13.6.0 added #[DebounceFor] (PR #59507) as an attribute-driven alternative with last-writer-wins semantics; document it alongside the interface-based options.
  • Add a serverless queue checklist to the Vapor docs covering the eight job-level properties above, the queue-timeout cost model, AWS budget alerts, an SQS dead letter queue with maxReceiveCount, and a cost-based kill switch.

Support

  • Offer a job review when customers report unexpected queue costs. Ask: "Can we review your jobs to see if they are producing long-running expensive invocations that will rack up an excessive bill?" Walk through them in order of cost impact: queue-timeout first (the dominant cost multiplier, commonly set high "to be safe"), HTTP timeouts inside job bodies second (a hung call pins the worker at the ceiling), and the eight-property checklist at the top of this post third. Note that Lambda is built for short bursty request-response work, not long-running queue jobs; a 15-minute queue-timeout on pay-per-millisecond billing inverts Lambda's value proposition. If the work is inherently slow or I/O-heavy, Lambda is the wrong shape for it.
  • Link to safe configuration examples in cost-related support replies -- the SafeJob base class and the apply-spec prompt at the top of this post are both drop-in references.
  • Consider proactive dashboard warnings for the cost-shape signals, not just one property. Useful alerts: a job's SQS ApproximateReceiveCount crossing N without a class-level $tries; queue-timeout multiplied by attempt count exceeding a per-job cost threshold; $timeout greater than the connection's retry_after. A single-property alert catches some incidents; the shape of the 2023 billing incident required several layers to align.

Disclaimer

  • This audit was AI-assisted using Claude Code against Laravel Framework v13.4.0 with later spot-checks against v13.6.0. Every finding includes the exact file path and method name so you can verify it yourself.
  • I have not tested every suggested fix in production. Some may have trade-offs or edge cases I haven't considered.
  • Some findings may have been addressed in ways I haven't identified. I have not used Laravel Vapor since 2023 and do not currently have an account, so Vapor-specific observations are based on the public vapor-core source code, not runtime testing. Things may have changed.
  • These are starting points, not finished PRs. Published in good faith for the benefit of the community.

There is a reasonable design philosophy where the framework intentionally leaves these guardrails to the developer. The tools exist ($tries, $maxExceptions, $backoff, $timeout) and it's the developer's responsibility to configure them. Changing defaults is a breaking change for anyone relying on current behaviour, and some of these suggestions would need careful migration paths.

Where I respectfully disagree is on the scaffold. Sidekiq ships with 25 retries and exponential backoff. Symfony Messenger, Go Asynq, and Google Cloud Tasks all bake in retry config by default. Laravel's make:job generates a class with nothing. The tools exist. The scaffold doesn't tell you to use them.

If any of these findings are inaccurate or have already been addressed, I'm happy to update this post with corrections.

Revisions 23 changes since publish
  • / Open the post with an eight-property safety checklist
    + The TL;DR, excerpt, and opening paragraph now lead with the eight things every queued job needs: explicit retry bounds, a timeout shorter than retry_after, non-zero backoff, a failed() handler, HTTP timeouts on every external call, an idempotency guard where the job writes external state, explicit lock expiry on ShouldBeUnique / WithoutOverlapping, and on serverless platforms a tight queue-timeout plus a hard cost kill switch. The scaffold gives you none of them.
    Notes

    The post now opens with the same checklist structure that the 23 findings below work through in detail, so a reader scanning the first paragraph has the full surface area without having to read to the recommendations section. Each item in the eight-property list has a corresponding finding later in the post.

  • / Add the 'Lambda is for short bursts, not long-running work' subsection
    + A new subsection after 'Whose fault is it?' compares the same 2023 mechanism (3 SQS redeliveries x up to 15 minutes x 2GB x pay-per-millisecond billing) across Lambda/Vapor ($218 actual), non-serverless environments like Forge or Ploi ($0 incremental, backed-up queue), Laravel Cloud with a capped queue worker (bounded to provisioned compute), Cloudflare Workers (CPU-time-gated at 10s free / up to 5 min paid, tied to the HTTP request lifecycle, with the caveat that a hung fetch() doesn't count against CPU time), Vercel Functions (10s default on Hobby, 60s on Pro, configurable up to 800s on Pro with Fluid Compute), and Lambda with a default 60s queue-timeout on the same code (~$15 instead of $218). The lesson: Lambda's pricing model rewards short bursts and punishes long-running work. For inherently quick background work (parsing inbound emails, saving attachments, rendering small PDFs, calling an API), Cloudflare Workers or Vercel Functions give you low default timeout caps and pay-per-request billing that keeps costs predictable. For longer or memory-heavy tasks, non-serverless environments absorb the mistake class entirely.
    Notes

    Written to address a specific framing gap: the bill shape is a Lambda pricing-model mismatch, not a Laravel queue defect. Platform limits fact-checked against Cloudflare Workers Limits docs and Vercel Functions Limitations docs.

  • / Rewrite 'What likely caused my billing incidents' with the 3 x 15 min cost math and retire the 'infinite retry loop' framing
    + The section now opens with an explicit statement that the 2023 incident was not a literal infinite retry loop, and shows the math. On Vapor, QueueHandler invokes vapor:work with --tries=SQS_TRIES ?? 3. Each problem email got up to 3 SQS redeliveries, each running to the full 15-minute queue-timeout ceiling because of uncapped file_get_contents() hanging on slow image origins. That's up to 45 minutes of billed 2048MB Lambda time per problem message, ~$0.09 per message, times ~350 problem emails per day over 7 days, arriving at $218.90 without anything retrying forever. The first 'What was a skill issue' bullet now leads with queue-timeout: 900 set 'to be safe' as the dominant cost multiplier, and the 'Whose fault is it?' wrap-up names the file_get_contents HTTP timeout added the next day (Aug 21) as the fix that actually turned off the money hose, dropping per-attempt cost by roughly 30x.
    Notes

    The primary evidence (Lambda pricing + vapor.yml git history) points to queue-timeout as the single biggest cost multiplier, not missing $tries. Surfacing the mechanism at source-level detail lets readers pattern-match their own incidents accurately. The OOM bug in Finding #1 is still a real framework bug documented in its own section; this revision names the 2023 mechanism specifically as a different path to the same outcome.

  • / Frame the non-serverless contrast as 'this shape of problem isn't framework-specific'
    + The lesson line in 'Lambda is for short bursts' reads 'This shape of problem isn't framework-specific; the same unsafe patterns cause problems on every platform.' The non-serverless alternative in the same subsection is 'Non-serverless environments (Forge, Ploi, Laravel Cloud, DigitalOcean droplet)' with 'possibly dropped jobs' as the symptom and 'No excessive bill' as the outcome. The adjacent 'Do we actually need seatbelts or guardrails?' paragraph carries the same framing: queue safety is a shared discipline across every platform, serverless just surfaces the cost quickly in dollars, while non-serverless hides it in slower throughput and dropped work.
    Notes

    'Framework-specific' avoids naming any one framework as the cause. Non-serverless accepts that Laravel Cloud and similar platform-hosted options sit alongside Forge / Ploi rather than outside the category.

  • / Reframe retry policy as mutually exclusive; lower recommended $tries default from 5 to 3
    Recommendations section listed $tries and retryUntil() as if both could be set additively for belt-and-braces safety. The default recommended value was $tries = 5. Finding #11 implied that adding $tries alongside retryUntil() would provide a hard cap.
    + Recommendations section now frames the two as mutually exclusive policies (count-based vs time-based) with $maxExceptions as the shared hygiene item that works with both. The recommended default is $tries = 3, with '1 if you want to be strict and only run once' as the idempotency opt-out. Finding #11 now states explicitly that adding $tries does not help when retryUntil() is set, with a link to Worker.php line 612 on 13.x and framework PR #35214 as the source-level proof.
    Notes

    Framework PR #35214 (Mohamed Said, Nov 2020) added the `!$job->retryUntil()` guard in `Worker::markJobAsFailedIfWillExceedMaxAttempts` that causes the worker to ignore $tries whenever retryUntil() is set. Taylor's comment on issue #35199 is the official design intent: 'retryUntil and maxTries are sort of mutually exclusive; when using retryUntil I would use maxExceptions.' The 3 default matches both Laravel Vapor's effective runtime default (SQS_TRIES ?? 3) and the agent apply-spec.

  • / Strengthen Finding #8 and Recommendations: $backoff is bypassed when middleware catches exceptions
    The $backoff bullet implied that a job-level $backoff always applies. Finding #8 described the ThrottlesExceptions default of 0 but didn't show that the middleware's catch-and-release path bypasses the job's $backoff entirely.
    + The $backoff bullet now notes that middleware like ThrottlesExceptions and RateLimited bypasses the job-level $backoff; the correct fix there is ->backoff($minutes) chained on the middleware construction itself. Finding #8 now explains the handle() catch-block flow at source level.
    Notes

    ThrottlesExceptions middleware catches exceptions in its `handle()` and calls `$job->release($this->retryAfterMinutes * 60)` directly (default retryAfterMinutes is 0), rather than re-throwing. The exception never reaches the worker's handleJobException, so calculateBackoff never consults the job's $backoff property. Setting ->backoff($minutes) on the middleware construction is required. Verified live on laravel/framework 13.x branch.

  • / Tighten ShouldBeUnique lock-leak scope; add ShouldBeUniqueUntilProcessing and #[DebounceFor] alternatives
    The ShouldBeUnique section said 'if the job fails or the worker crashes, the lock persists.' That lumped normal-exception failures together with worker-death, which is too broad. The section didn't mention ShouldBeUniqueUntilProcessing (documented release-early alternative) or #[DebounceFor] (Laravel 13.6 attribute-driven alternative).
    + The section now distinguishes the two paths. On a normal exception, CallQueuedHandler::failed() releases the lock via ensureUniqueJobLockIsReleased() at CallQueuedHandler.php line 334. The lock only leaks when failed() never runs: worker terminated mid-handle() (OOM, SIGKILL, container eviction), or the $deleteWhenMissingModels + missing-model path from Issue #49890. ShouldBeUniqueUntilProcessing is now called out as the release-early alternative (narrower window, less leak risk, allows concurrent dispatch after processing begins). Laravel 13.6.0's #[DebounceFor] attribute (PR #59507) is named as the last-writer-wins alternative for deduplicating a burst of dispatches.
    Notes

    Traced the release path in CallQueuedHandler (success path line 70, failure path line 334, both call ensureUniqueJobLockIsReleased). Confirmed the leak only occurs when failed() never runs. ShouldBeUniqueUntilProcessing surfaced by a sentence-by-sentence audit of laravel.com/docs/13.x/queues against the blog.

  • / Cross-reference $timeout and retry_after; correct Redis retry_after default
    The $timeout bullet said 'Without this, jobs can run until Lambda's 15-minute ceiling' but didn't mention the retry_after interaction. Where retry_after was discussed, Redis was listed as 60s.
    + The bullet now notes: $timeout must be shorter than the connection's retry_after config value (default 90 on both database and Redis in a fresh Laravel scaffold); if $timeout exceeds retry_after, the same job can be picked up and executed twice. Cross-references Finding #10 for the duplicate-execution scenario. Agent apply-spec and verify steps corrected in line with the same 90/90 default.
    Notes

    Verified against laravel/laravel master config/queue.php: database=90, beanstalkd=90, redis=90. The '60 for Redis' value was incorrect in the TL;DR checklist and the agent apply-spec.

  • / Add null-delegation precision to the Finding #4 body
    + Finding #4 now includes one bold sentence in the body: 'Null does not itself mean unlimited: Worker.php line 578 replaces null with the worker's --tries argument before the unlimited-if-zero check at line 586.' So null delegates to the worker layer, and unlimited only happens when that layer also resolves to 0.
    Notes

    The heading still reads 'null (unlimited)' as a description of the common observed behaviour on Vapor with default settings, but the body now makes the two-step mechanism explicit so a reader on queue:work (which defaults to --tries=1) knows null is not directly unlimited there.

  • / Corrected three citations
    Finding #3 cited Issue #50330 (Mar 2024, @nickma42) as a supporting reference for the WithoutOverlapping middleware race. Finding #6 cited PR #45876 (Jan 2023, @khepin) as the real-world example without noting its merge state. Finding #12 listed Issue #59402 (Mar 2026) under 'Real-world' alongside the still-open #56395.
    + Finding #3 now notes Issue #50330 is about Illuminate\Console\Scheduling's withoutOverlapping() + onOneServer(), a different code path from the queue middleware, so it's not a supporting reference here. Finding #6 labels PR #45876 as closed without merging. Finding #12 describes Issue #59402 as closed once the Pipeline cleanup in PR #59415 and PR #59330 made the Job::$instance chain GC-eligible indirectly, and notes that the underlying retention still ships because those Pipeline PRs were themselves closed.
    Notes

    Verified against current GitHub state via gh CLI: #50330 is a scheduler issue (verified from issue body), PR #45876 state is CLOSED with mergedAt null, #59402 state is CLOSED with stateReason NOT_PLANNED, authored by the reporter.

  • / Link PR #59553 as the community follow-up to Issue #59517
    + The community section for 'Worker enters infinite silent loop on non-database exceptions' (#59517) now links PR #59553 by @webpatser, which proposed --max-pop-exceptions to address the condition and was closed by maintainers.
    Notes

    Adds a community attempt so the no-fix-accepted framing is load-bearing on evidence rather than inference.

  • / Correct Issue #49890 close reason to completed
    Issue #49890 (ShouldBeUnique lock lifecycle) was cited as closed as 'no-fix for us'.
    + Issue #49890 is now cited as closed as completed, with the safety implication of the default (unset $uniqueFor produces a lock with no TTL on Redis and Database drivers) still flagged as the reason to set $uniqueFor explicitly.
    Notes

    GitHub API shows issue #49890 state_reason = completed (closed 2024-05-07 by driesvints, assigned to themsaid); the 'no-fix for us' quote was not accurate to the recorded close reason.

  • / Reorder Hard Limits by leverage; lead with queue-memory + queue-timeout cost formula
    + queue-memory + queue-timeout is now the first bullet in the 'Hard limits' list, with the explicit cost-ceiling formula memory-gb x timeout-seconds x $0.0000166667/GB-s. The list lead now says 'ordered by leverage'. A note on the queue:work bullet clarifies that SQS_TRIES=0 is the correct setting when running an SQS DLQ with maxReceiveCount as the retry ceiling (so Laravel releases the message back and the DLQ handles the cap at the infrastructure layer). Cost-based kill switch moved up to fourth alongside the other containment measures.
    Notes

    Same rationale as the skill-issue reorder: the biggest cost control on serverless queues is how long each failed attempt is allowed to burn. The DLQ note closes a gap: SQS_TRIES=0 is a legitimate config when DLQ maxReceiveCount is the retry ceiling, not universally a footgun.

  • / Add scope-framing lead to the Recommendations property list
    + Inserted a one-line lead: 'five properties, set explicitly on every queued job. The caveats under each item explain scope and edge cases, they don't weaken the rule.'
    Notes

    Counter to the risk that four accumulated caveats (retryUntil exclusion, $backoff middleware bypass, platform-dependent $tries, ShouldBeUnique lock-leak scope) start reading as exceptions to the rule rather than scope-limiting notes. The lead frames them as scope, not exceptions.

  • / Add 'Show prompts for AI agents' button near the top of the post
    + A centered button near the top opens a native-dialog modal containing a paste-ready apply-spec. The spec is structured as a main goal plus nine Goal / Issue / Why / Fix blocks covering retry bounds, backoff, per-attempt timeout, exception ceiling that survives OOM, a failed() handler, ShouldBeUnique lock expiry, WithoutOverlapping middleware expireAfter, ThrottlesExceptions middleware backoff, and bounded external HTTP calls. A 'Copy all instructions' button in the modal writes the full spec to clipboard for pasting into Claude Code, Cursor, Windsurf, Copilot, or any similar coding agent. The spec text is authored inside a <pre class="agent-prompts-plaintext"> block so it also appears in the /articles/<slug>.md and /llms-full.txt endpoints for agents that fetch article content directly.
    Notes

    Paired with the audit prompt already at the top of the post: audit-first to evaluate your jobs, apply-spec second to fix what the audit reports. Structuring each recommendation as Goal / Issue / Why / Fix lets an agent parse the rationale per recommendation rather than receiving a flat bullet list.

  • / Refresh 'What the docs and support could improve' for the eight-property framing
    + Documentation subsection reframed around the eight concerns (retry bounds, timeout, backoff, failed() handler, HTTP timeouts, idempotency guard, lock expiry, platform cost ceiling) instead of just the four property names. New bullets added for the queue-timeout cost model on Vapor (explicit memory-gb x timeout-seconds x $0.0000166667/GB-s formula) and the $timeout vs retry_after interaction. ShouldBeUnique bullet now mentions #[DebounceFor] as the Laravel 13.6 alternative. Serverless checklist expanded to cover the eight concerns plus cost ceiling and kill switch. Support subsection now offers a job review framed in cost-impact order (queue-timeout first, HTTP timeouts second, eight-concern checklist third), with a note that Lambda is built for short bursty work, not long-running queues. Dashboard warning bullet broadened from single-property ($tries) to a multi-signal shape covering ApproximateReceiveCount, queue-timeout x attempt count, and $timeout > retry_after.
    Notes

    Docs and Support section now matches the post's own recalibrated narrative: the 2023 incident was a layered shape, not a single-property failure, so the advocacy asks should track the same multi-signal shape.

  • / Update version references to Laravel Framework v13.4.0 through v13.6.0
    + Main body: 'Verified against Laravel Framework v13.4.0 through v13.6.0.' How I found these: 'v13.4.0 original audit; spot-checks against v13.6.0.' Current Status: 'against v13.4.0 source, with spot-checks against v13.6.0 (April 2026).' Disclaimer: 'against Laravel Framework v13.4.0 with later spot-checks against v13.6.0.'
    Notes

    v13.6.0 shipped on 2026-04-21 with two queue-relevant additions (#[DebounceFor] via PR #59507, attempts column TINYINT->SMALLINT via PR #59718). The post was originally audited against v13.4.0; honest 'spot-check' framing avoids overclaiming a full re-audit.

  • / Correct Finding #4: PR #29385 was merged in Laravel 6.0, not rejected
    [PR #29385] (Aug 2019) by @SjorsO -- attempted to change the default from 0 to 1. Not merged. The PR states: "Changing the default solves the problem of broken jobs getting stuck in an infinite loop when you forget to pass the queue worker a --tries flag."
    + [PR #29385] (Aug 2019) by @SjorsO -- changed the `queue:work` default from 0 to 1 in Laravel 6.0. The PR states: "Changing the default solves the problem of broken jobs getting stuck in an infinite loop when you forget to pass the queue worker a --tries flag." This partially addressed the issue but didn't unify the other layers: job-level `$tries` still defaults to null, `VaporWorkCommand` still defines `--tries=0` in its signature, and `SQS_TRIES` is a separate runtime concern.
    Notes

    Verified via GitHub API: PR #29385 state is MERGED, title "[6.0] Change default job attempts from 0 to 1". The original "Not merged" framing was factually wrong and weakened the argument. The correction strengthens Finding #4: Laravel partially fixed this in 2019 at the `queue:work` layer, but the three-layer contradiction (job class, worker command, runtime) was never unified. The billing incidents referenced in the post happened on Vapor, which the 2019 fix didn't cover.

  • / Add scaffold-gap analysis and `job.queued.stub` fix to Finding #4
    + The real gap is the scaffold. `php artisan make:job` generates from `job.queued.stub`. That template has been touched nine times since 2020 -- formatting, imports, PHP type declarations, `ShouldBeUnique` added and removed. Never `$tries`. Never `$backoff`. Never `$timeout`. Never `failed()`. Every queued job generated in every new Laravel app since 2020 ships with nothing. (Suggested fix extended with: "And update `job.queued.stub` to include `$tries`, `$backoff`, and `$timeout` -- commented out if you want, just so the developer sees them.")
    Notes

    Direct complement to the PR #29385 correction. The scaffold gap is the actionable observation that the "Not merged" framing missed entirely. Stub commit history verified: 9 modifications since 2020, zero added retry properties. Every queued job generated in every new Laravel app since 2020 ships bare, which directly produced both billing incidents and the in-the-wild cases later in the post.

  • / Tighten "Whose fault is it?" framing: scaffold is bare, guardrails missing at the layer you write
    Unsafe defaults -- `$tries` null, backoff 0, `WithoutOverlapping` lock infinite, `maxExceptions` bypassed by OOM -- meant my mistakes had no guardrails. Every other major queue system defaults to bounded retries.
    + Unsafe defaults -- `make:job` stub is bare, `$tries` null, backoff 0, `WithoutOverlapping` lock infinite, `maxExceptions` bypassed by OOM -- meant my mistakes had no guardrails at the layer I actually wrote.
    Notes

    The "every other major queue system" claim was too broad. BullMQ defaults to 0 attempts (fail immediately). Celery requires explicit self.retry(). The revised framing drops the overreaching comparison and names the real problem: the scaffold developers see every day ships without any safety properties.

  • / Replace broad "Laravel does the opposite" line with concrete scaffold comparisons
    Where I respectfully disagree is on the defaults. Every other major queue system I compared (Sidekiq, BullMQ, Celery, Google Cloud Tasks) defaults to bounded retries and requires developers to opt in to unlimited behaviour. Laravel does the opposite.
    + Where I respectfully disagree is on the scaffold. Sidekiq ships with 25 retries and exponential backoff. Symfony Messenger, Go Asynq, and Google Cloud Tasks all bake in retry config by default. Laravel's `make:job` generates a class with nothing. The tools exist. The scaffold doesn't tell you to use them.
    Notes

    Replaces an overgeneralised claim with specific, verifiable scaffold comparisons. Narrows the disagreement to the scaffold layer, which is the actual pain point, rather than making a broad claim about "defaults" that doesn't survive scrutiny for every queue system.

  • / Add tested evidence for retryUntil + no $tries interaction (Finding #6)
    + I tested this. Job with `retryUntil(10s)` and no `$tries`, run against `queue:work --tries=1`: 275 retries in 3 seconds. The worker's `--tries` flag is ignored when `retryUntil()` is set. Worker-level safety is no protection if the job uses `retryUntil()` without also setting `$tries`.
    Notes

    Moves Finding #6 from "this could happen" to "I ran this and 275 retries fired in 3 seconds." Concrete reproduction strengthens the claim and gives readers a clear signal: worker-level `--tries` does not save you when `retryUntil()` is in play.

  • / Clarify that `$tries = 0` means infinite retries in Laravel, not zero
    `$tries = 5` -- hard cap on total attempts. Don't rely on platform defaults -- set this explicitly on every job.
    + `$tries = 5` -- hard cap on total attempts. Don't rely on platform defaults -- set this explicitly on every job. And don't use `$tries = 0` -- in Laravel that means infinite retries, not zero. Use a positive integer.
    Notes

    Readers coming from other queue systems (where 0 commonly means "do not retry") fall into this trap. Making the non-intuitive Laravel behaviour explicit inline with the recommended code prevents the misread.