Timeout for Sandboxed processors
A pattern for applying time-to-live to sandboxed processors.
When working with sandboxed processors, every job runs in a separate process. This enables the implementation of a time-to-live (TTL) mechanism that terminates the process if it fails to complete within a reasonable time.
It is important to understand that killing a process can have unintended consequences. For instance, it could be terminated during a write transaction to a file, likely resulting in a corrupted file. However, this is often the best approach possible in a runtime like NodeJS, which relies on asynchronous calls within an event loop. There is currently no known method to achieve this functionality more robustly.
This pattern strives to be as safe as possible, but please keep in mind the trade-offs mentioned above. The pattern uses two timeouts to allow for a cleanup operation to minimize the effects of a hard process termination. However, if the cleanup itself hangs or is incorrectly implemented, it may still result in terminated database connections or interrupted write operations, with potentially negative outcomes.
// This processor will timeout in 30 seconds.
const MAX_TTL = 30_000;
// The processor will have a cleanup timeout of 5 seconds.
const CLEANUP_TTL = 5_000;
// We use a custom exit code to mark the TTL, but any would do in practice
// as long as it is < 256 (Due to Unix limitation to 8 bits per exit code)
const TTL_EXIT_CODE = 10;
module.exports = async function (job) {
let hasCompleted = false;
const harKillTimeout = setTimeout(() => {
if (!hasCompleted) {
process.exit(TTL_EXIT_CODE);
}
}, MAX_TTL);
const softKillTimeout = setTimeout(async () => {
if (!hasCompleted) {
await doCleanup(job);
}
}, CLEANUP_TTL);
try {
// If doAsyncWork is CPU intensive and blocks NodeJS loop forever,
// the timeout will never be triggered either.
await doAsyncWork(job);
hasCompleted = true;
} finally {
// Important to clear the timeouts before returning as this process will be reused.
clearTimeout(harKillTimeout);
clearTimeout(softKillTimeout);
}
};
const doAsyncWork = async job => {
// Simulate a long running operation.
await new Promise(resolve => setTimeout(resolve, 10000));
};
const doCleanup = async job => {
// Simulate a cleanup operation.
await job.updateProgress(50);
};
There are some very important points to consider with this pattern.
If the processor has hanged because there is an infinite loop that does not let the NodeJS event loop to run, the TTL timeouts will never be called.
We keep a
hasCompleted
flag so that we can cover the edge case where the async work has just completed at the same time the timeout is triggered.When using this pattern it is very useful to put debug logs in strategic places to see where the job actually gets stuck when it is killed due to the TTL being exceeded.
Last updated
Was this helpful?