Lazy_Economy_6851 discusses challenges in redeploying an app with Celery-backed long-running jobs, aiming for zero downtime. The post explores risks to running tasks and seeks peer advice on deployment strategies.

Summary

Lazy_Economy_6851 describes a DevOps challenge involving an application that processes user-submitted “validation tasks.” The backend leverages Celery (for distributed task management), Redis (as a broker), and MySQL (to track state), supporting jobs that can last up to an hour. This architecture is containerized using Docker and orchestrated through Coolify, which offers built-in blue-green deployment features.

The core concern is deploying updates or upgrades to the Celery worker service without any downtime or interruption to in-flight tasks. The author specifically does not want any currently running jobs to be lost or restarted during a deployment or environment switch.

Deployment Environment

  • Task runner: Celery
  • Broker: Redis
  • Tracking: MySQL
  • Orchestration: Docker (via Coolify)
  • Deployment Methods Considered: Blue-green or environment switching

Key Challenges

  • Long-lived tasks risk disruption: Stopping or redeploying Celery workers without proper handling can terminate or lose running tasks.
  • Blue-green deployments: While these reduce downtime for API/web layers, they may not safeguard long-running background jobs.
  • Desire for zero downtime: The target is seamless upgrades and no job loss.

Sought Advice

  • Strategies or real-world solutions for redeploying Celery workers that do not disrupt active jobs.
  • Approaches to orchestrate rolling upgrades, worker draining, or handoff gracefully within Coolify or Docker setups.

Potential Next Steps

The author is aware of general deployment concepts (like blue-green deployments) but is searching for nuanced experiences or best practices relating to Celery and similar stack deployments.

Ideas Often Considered for Similar Scenarios

  • Worker Draining: Configure Celery workers to finish current tasks before shutting down (--no-force-shutdown / --graceful options), or signal with SIGTERM to allow for graceful exit after current jobs complete.
  • Rolling Deployments: Upgrade workers one at a time so no job is left unhandled.
  • Sticky Queues: Use persistent state-tracking in MySQL/Redis, to potentially resume or retry failed/incomplete jobs.
  • Instance Overlap: Spin up new worker containers before decommissioning old ones.

Conclusion

This community post is a request for input on zero-downtime deployment strategies tailored to Celery-based long-running job systems, sharing pain points and looking for practical guidance from peers who have solved similar issues.

This post appeared first on “Reddit DevOps”. Read the entire article here