BullMQ
  • What is BullMQ
  • Quick Start
  • API Reference
  • Changelogs
    • v4
    • v3
    • v2
    • v1
  • Guide
    • Introduction
    • Connections
    • Queues
      • Auto-removal of jobs
      • Adding jobs in bulk
      • Global Concurrency
      • Removing Jobs
    • Workers
      • Auto-removal of jobs
      • Concurrency
      • Graceful shutdown
      • Stalled Jobs
      • Sandboxed processors
      • Pausing queues
    • Jobs
      • FIFO
      • LIFO
      • Job Ids
      • Job Data
      • Deduplication
      • Delayed
      • Repeatable
      • Prioritized
      • Removing jobs
      • Stalled
      • Getters
    • Job Schedulers
      • Repeat Strategies
      • Repeat options
      • Manage Job Schedulers
    • Flows
      • Adding flows in bulk
      • Get Flow Tree
      • Fail Parent
      • Continue Parent
      • Remove Dependency
      • Ignore Dependency
      • Remove Child Dependency
    • Metrics
      • Prometheus
    • Rate limiting
    • Parallelism and Concurrency
    • Retrying failing jobs
    • Returning job data
    • Events
      • Create Custom Events
    • Telemetry
      • Getting started
      • Running Jaeger
      • Running a simple example
    • QueueScheduler
    • Redis™ Compatibility
      • Dragonfly
    • Redis™ hosting
      • AWS MemoryDB
      • AWS Elasticache
    • Architecture
    • NestJs
      • Producers
      • Queue Events Listeners
    • Going to production
    • Migration to newer versions
    • Troubleshooting
  • Patterns
    • Adding jobs in bulk across different queues
    • Manually processing jobs
    • Named Processor
    • Flows
    • Idempotent jobs
    • Throttle jobs
    • Process Step Jobs
    • Failing fast when Redis is down
    • Stop retrying jobs
    • Timeout jobs
    • Timeout for Sandboxed processors
    • Redis Cluster
  • BullMQ Pro
    • Introduction
    • Install
    • Observables
      • Cancelation
    • Groups
      • Getters
      • Rate limiting
      • Local group rate limit
      • Concurrency
      • Local group concurrency
      • Max group size
      • Pausing groups
      • Prioritized intra-groups
      • Sandboxes for groups
    • Telemetry
    • Batches
    • NestJs
      • Producers
      • Queue Events Listeners
      • API Reference
      • Changelog
    • API Reference
    • Changelog
    • New Releases
    • Support
  • Bull
    • Introduction
    • Install
    • Quick Guide
    • Important Notes
    • Reference
    • Patterns
      • Persistent connections
      • Message queue
      • Returning Job Completions
      • Reusing Redis Connections
      • Redis cluster
      • Custom backoff strategy
      • Debugging
      • Manually fetching jobs
  • Python
    • Introduction
    • Changelog
Powered by GitBook

Copyright (c) Taskforce.sh Inc.

On this page

Was this helpful?

  1. Patterns

Timeout for Sandboxed processors

A pattern for applying time-to-live to sandboxed processors.

When you are working with sandboxed processors, every job is run in a separate process. This opens the possibility to implement a time-to-live (TTL) mechanism, that kills the process if it has not been able to complete in a reasonable time.

It is important to understand that killing a process can have unintented consecuences, for instance it could be killed in the middle of a writing transaction to a file, that would most likely result in a corrupt file. However, this is kind of the best that is possible to achieve in a runtime as NodeJS which based on asynchronous calls within an event loop. There is currently no known method to achieve this functionality in a more robust way.

This pattern tries to be as possible, but please, keep in mind the trade-offs mentioned above. The pattern uses two timeouts so that it is possible to have a cleanup operation to minimize the effects of a hard kill of the process. Obviously if the cleanup itself hangs, or if the cleanup is not correctly implemented, we can still end killing some database connections or writing operantions right in the middle, with the potential negative outcomes.

// This processor will timeout in 30 seconds.
const MAX_TTL = 30_000;

// The processor will have a cleanup timeout of 5 seconds.
const CLEANUP_TTL = 5_000;

// We use a custom exit code to mark the TTL, but any would do in practice
// as long as it is < 256 (Due to Unix limitation to 8 bits per exit code)
const TTL_EXIT_CODE = 10;

module.exports = async function (job) {
  let hasCompleted = false;
  const harKillTimeout = setTimeout(() => {
    if (!hasCompleted) {
      process.exit(TTL_EXIT_CODE);
    }
  }, MAX_TTL);

  const softKillTimeout = setTimeout(async () => {
    if (!hasCompleted) {
      await doCleanup(job);
    }
  }, CLEANUP_TTL);

  try {
    // If doAsyncWork is CPU intensive and blocks NodeJS loop forever,
    // the timeout will never be triggered either.
    await doAsyncWork(job);
    hasCompleted = true;
  } finally {
    // Important to clear the timeouts before returning as this process will be reused.
    clearTimeout(harKillTimeout);
    clearTimeout(softKillTimeout);
  }
};

const doAsyncWork = async job => {
  // Simulate a long running operation.
  await new Promise(resolve => setTimeout(resolve, 10000));
};

const doCleanup = async job => {
  // Simulate a cleanup operation.
  await job.updateProgress(50);
};

There are some very important points to consider with this pattern.

  • If the processor has hanged because there is an infinite loop that does not let the NodeJS event loop to run, the TTL timeouts will never be called.

  • We keep a hasCompleted flag so that we can cover the edge case where the async work has just completed at the same time the timeout is triggered.

  • When using this pattern it is very useful to put debug logs in strategic places to see where the job actually gets stuck when it is killed due to the TTL being exceeded.

PreviousTimeout jobsNextRedis Cluster

Last updated 2 months ago

Was this helpful?