Long Running Celery Tasks with Zero Downtime Updates
Lazy_Economy_6851 discusses challenges in redeploying an app with Celery-backed long-running jobs, aiming for zero downtime. The post explores risks to running tasks and seeks peer advice on deployment strategies.
Summary
Lazy_Economy_6851 describes a DevOps challenge involving an application that processes user-submitted “validation tasks.” The backend leverages Celery (for distributed task management), Redis (as a broker), and MySQL (to track state), supporting jobs that can last up to an hour. This architecture is containerized using Docker and orchestrated through Coolify, which offers built-in blue-green deployment features.
The core concern is deploying updates or upgrades to the Celery worker service without any downtime or interruption to in-flight tasks. The author specifically does not want any currently running jobs to be lost or restarted during a deployment or environment switch.
Deployment Environment
- Task runner: Celery
- Broker: Redis
- Tracking: MySQL
- Orchestration: Docker (via Coolify)
- Deployment Methods Considered: Blue-green or environment switching
Key Challenges
- Long-lived tasks risk disruption: Stopping or redeploying Celery workers without proper handling can terminate or lose running tasks.
- Blue-green deployments: While these reduce downtime for API/web layers, they may not safeguard long-running background jobs.
- Desire for zero downtime: The target is seamless upgrades and no job loss.
Sought Advice
- Strategies or real-world solutions for redeploying Celery workers that do not disrupt active jobs.
- Approaches to orchestrate rolling upgrades, worker draining, or handoff gracefully within Coolify or Docker setups.
Potential Next Steps
The author is aware of general deployment concepts (like blue-green deployments) but is searching for nuanced experiences or best practices relating to Celery and similar stack deployments.
Ideas Often Considered for Similar Scenarios
- Worker Draining: Configure Celery workers to finish current tasks before shutting down (
--no-force-shutdown
/--graceful
options), or signal withSIGTERM
to allow for graceful exit after current jobs complete. - Rolling Deployments: Upgrade workers one at a time so no job is left unhandled.
- Sticky Queues: Use persistent state-tracking in MySQL/Redis, to potentially resume or retry failed/incomplete jobs.
- Instance Overlap: Spin up new worker containers before decommissioning old ones.
Conclusion
This community post is a request for input on zero-downtime deployment strategies tailored to Celery-based long-running job systems, sharing pain points and looking for practical guidance from peers who have solved similar issues.
This post appeared first on “Reddit DevOps”. Read the entire article here