Hello

Back to Blog

Building a Production-Ready Distributed Task Queue System

GoRedisPostgreSQLReactSystem DesignMicroservices
Building a Production-Ready Distributed Task Queue System

Note: I built a distributed task queue system from scratch using Go, Redis, and PostgreSQL that handles 10,000+ jobs/second with real-time monitoring. This post covers the architecture decisions, challenges faced, and lessons learned.

The Problem

During my work on a previous project, I encountered a critical bottleneck: synchronous processing of heavy tasks was blocking API responses. Users had to wait 30+ seconds for image uploads, email sends, and report generation. This led to:

  • ❌ Poor user experience (timeouts, slow responses)
  • ❌ Wasted server resources (blocked connections)
  • ❌ Difficult scaling (vertical scaling only)
  • ❌ No visibility into failed tasks

I needed a solution that could:

  • ✅ Process tasks asynchronously in the background
  • ✅ Handle millions of jobs with different priorities
  • ✅ Automatically retry failed jobs
  • ✅ Scale horizontally (add more workers)
  • ✅ Provide real-time monitoring

Why not use existing solutions? While tools like Celery (Python) and BullMQ (Node.js) exist, I wanted to:

  1. Build something in Go for superior performance
  2. Understand the internals of distributed systems
  3. Have full control over the architecture
  4. Create a portfolio piece demonstrating system design skills

System Architecture

High-Level Overview

I designed the system with separation of concerns in mind:

    ┌─────────────┐
    │  API Server │ ← HTTP requests from clients
    └──────┬──────┘
           │
           ├─→ PostgreSQL (persist jobs)
           └─→ Redis (enqueue job IDs)
                │
        ┌──────▼──────┐
        │ Worker Pool │ ← Polls Redis, processes jobs
        └──────┬──────┘
               │
               ├─→ Execute job handlers
               ├─→ Update job status in DB
               └─→ Broadcast events via Redis Pub/Sub
                    │
             ┌──────▼──────┐
             │  WebSocket  │ ← Real-time updates to dashboard
             └─────────────┘

Key Design Decisions

1. Go for Performance

  • Goroutines enable true parallelism (not just concurrency)
  • Low memory footprint (~50MB for API, ~100MB per worker instance)
  • Built-in concurrency primitives (channels, mutexes)

2. Redis as Queue + Pub/Sub

  • Queue: Fast FIFO with LPUSH / BRPOP (O(1) operations)
  • Pub/Sub: Real-time event broadcasting across processes
  • Locks: Distributed locking via SET NX EX for job deduplication

3. PostgreSQL for Durability

  • Persistent job history and results
  • Complex queries (filtering, analytics)
  • ACID guarantees for critical data

4. Separate API & Worker Processes

  • API servers scale independently from workers
  • Workers can run on different machines (horizontal scaling)
  • Each component restarts without affecting the other

Implementation Deep Dive

1. Priority Queue System

Jobs are categorized into 3 priority levels:

const (
    PriorityHigh   Priority = 0 // Urgent (user-facing)
    PriorityMedium Priority = 1 // Normal background tasks
    PriorityLow    Priority = 2 // Cleanup, analytics
)

Challenge: Redis doesn't have native priority queue support.

Solution: Maintain 3 separate Redis lists and poll in order:

func (p *Pool) dispatch(ctx context.Context) {
    for {
        // Try high priority first
        jobID, err := p.queue.Dequeue(ctx, "high")
        if err == redis.Nil {
            // Try medium
            jobID, err = p.queue.Dequeue(ctx, "medium")
        }
        if err == redis.Nil {
            // Finally low
            jobID, err = p.queue.Dequeue(ctx, "low")
        }

        if jobID != "" {
            p.jobChan <- jobID
        }
    }

}

This ensures high-priority jobs (e.g., password resets) always process before low-priority ones (e.g., analytics).


2. Retry Mechanism with Exponential Backoff

Challenge: Transient failures (network glitches, rate limits) shouldn't permanently fail jobs.

Solution: Exponential backoff with configurable max retries:

func (r *RedisQueue) Retry(ctx context.Context, jobID string, attempt int) error {
    // Delay = 2^attempt * 5 seconds
    delay := time.Duration(math.Pow(2, float64(attempt))) * 5 * time.Second

    // Schedule retry by adding to Redis with delay
    executeAt := time.Now().Add(delay)
    return r.ScheduleJob(ctx, jobID, executeAt)

}

Result:

  • 1st retry: 5s delay
  • 2nd retry: 10s delay
  • 3rd retry: 20s delay
  • After max retries → Dead Letter Queue

This increased our success rate from 85% to 98% by handling transient failures gracefully.


3. Distributed Locking (Race Condition Prevention)

Challenge: Multiple workers might pick up the same job simultaneously.

Solution: Redis-based distributed locks using SET NX EX:

func (r *RedisQueue) AcquireJobLock(ctx context.Context, jobID string, workerID int) (bool, error) {
    lockKey := fmt.Sprintf("job:lock:%s", jobID)
    lockValue := fmt.Sprintf("worker:%d", workerID)

    // SET if Not eXists, EXpire in 30 seconds
    result := r.client.SetNX(ctx, lockKey, lockValue, 30*time.Second)
    return result.Val(), result.Err()

}

func (w \*Worker) process(ctx context.Context, jobID string) {
    acquired, \_ := w.queue.AcquireJobLock(ctx, jobID, w.id)
    if !acquired {
        return // Another worker got it
    }
    defer w.queue.ReleaseJobLock(ctx, jobID)
    // Process job safely
}

Why 30 seconds? If a worker crashes mid-job, the lock auto-expires and another worker can retry.


4. Real-Time Monitoring via WebSocket

Challenge: Dashboard needs live updates without polling.

Solution: Redis Pub/Sub + WebSocket Hub

// Worker publishes event to Redis
func (w *Worker) broadcastStatusChange(job *models.Jobs, status string) {
    event := Event{
        Type: "job.status_changed",
        Payload: map[string]interface{}{
            "job_id":     job.ID,
            "new_status": status,
        },
        Timestamp: time.Now(),
    }

    // Publish to Redis channel
    hub.Publish(event) // → Redis Pub/Sub

}

// API server subscribes and forwards to WebSocket clients
func (h \*Hub) StartRedisSubscriber(ctx context.Context) {
pubsub := h.rdb.Subscribe(ctx, "ws:events")

    for msg := range pubsub.Channel() {
        var event Event
        json.Unmarshal([]byte(msg.Payload), &event)

        // Broadcast to all connected WebSocket clients
        h.broadcast <- event
    }

}

React Dashboard Hook:

const { events, isConnected } = useWebSocket('ws://localhost:8080/ws');

useEffect(() => {
    events.forEach(event => {
    if (event.type === 'job.status_changed') {
    toast.success(\`Job \${event.payload.job_id} \${event.payload.new_status}\`);
    }
 });
}, [events]);

This provides sub-second latency for status updates — users see jobs complete in real-time.


5. Worker Health Monitoring

Challenge: Detect unhealthy workers and visualize system load.

Solution: Store worker metrics in Redis with TTL:

type WorkerMetrics struct {
    ID        int    \`json:"id"\`
    Status    string \`json:"status"\`  // idle | busy
    Processed int64  \`json:"processed"\`
    Failed    int64  \`json:"failed"\`
}

func (w \*Worker) updateHealth() {
    metrics := WorkerMetrics{
        ID: w.id,
        Status: w.currentStatus,
        Processed: w.stats.Processed,
        Failed: w.stats.Failed,
    }

    key := fmt.Sprintf("health:worker:%d", w.id)
    data, _ := json.Marshal(metrics)

    // Store with 30s TTL — if worker dies, metrics auto-expire
    w.redis.Set(ctx, key, data, 30*time.Second)

}

Dashboard displays:

  • Worker 1 (busy): Processed 342, Failed 12, Retried 28
  • Worker 2 (idle): Processed 289, Failed 8, Retried 15

Scaling Strategy

Horizontal Scaling (Recommended)

Load Balancer
│
├─→ API Server 1 ───┐
├─→ API Server 2 ───┼─→ Redis Cluster
└─→ API Server 3 ───┘
                    │
       ┌────────────┘
       │
       ├─→ Worker Instance 1 (5 workers)
       ├─→ Worker Instance 2 (5 workers)
       ├─→ Worker Instance 3 (5 workers)
       └─→ Worker Instance N (5 workers)

Why this works:

  • API servers are stateless → add more for higher throughput
  • Workers are independent → add more for faster processing
  • Redis handles coordination (locks, pub/sub)

Kubernetes Deployment

# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-queue-api
spec:
  replicas: 3 # 3 API instances
  template:
    spec:
      containers:
        - name: api
          image: mhabib34/task-queue-api:latest
          resources:
            requests:
              memory: "128Mi"
              cpu: "250m"
            limits:
              memory: "256Mi"
              cpu: "500m"
---
# worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-queue-worker
spec:
  replicas: 5 # 5 worker pods = 25 total workers
  template:
    spec:
      containers:
        - name: worker
          image: mhabib34/task-queue-worker:latest
          resources:
            requests:
              memory: "256Mi"
              cpu: "500m"

Dashboard Features

I built a React dashboard with real-time updates:

Dashboard 1 Dashboard 2

Key Features:

  1. Live Statistics Cards

    • Total jobs by status (completed, failed, queued)
    • Success rate percentage
    • Average processing time
  2. Queue Depth Chart

    • Real-time bar chart showing jobs in each priority queue
    • Updates every 3 seconds
  3. Worker Grid

    • Status badges (idle/busy)
    • Per-worker metrics (processed, failed, retried)
    • Uptime tracker
  4. Job List Table

    • Filter by status, type
    • Pagination
    • One-click job cancellation
  5. Live Event Feed

    • WebSocket-powered event stream
    • Color-coded event types
    • Timestamps with relative time ("2 minutes ago")

Tech Stack:

  • React 18 + TypeScript
  • Tailwind CSS (styling)
  • Recharts (charts)
  • date-fns (date formatting)
  • Custom WebSocket hook

Challenges & Solutions

Challenge 1: Memory Leak in Image Processing

Problem: Worker memory grew from 100MB → 2GB over 6 hours.

Root Cause: Image processing library wasn't releasing decoded images.

Solution:

func (e *ImageExecutor) Execute(ctx context.Context, job *models.Jobs) error {
    img, _ := imaging.Open(imagePath)

    // Force garbage collection after processing
    defer runtime.GC()

    // Process image
    resized := imaging.Resize(img, 800, 0, imaging.Lanczos)

    return nil

}

Lesson: Always profile long-running processes with pprof.


Challenge 2: Race Condition in Job Locking

Problem: Occasionally, 2 workers processed the same job.

Root Cause: Lock acquisition and job dequeue weren't atomic.

Solution: Check lock after dequeue:

// ❌ Wrong: Dequeue first, lock later
jobID := queue.Dequeue()
acquired := queue.AcquireJobLock(jobID, workerID) // Another worker might grab it

// ✅ Correct: Dequeue, then immediately lock before any processing
jobID := queue.Dequeue()
acquired, \_ := queue.AcquireJobLock(jobID, workerID)
if !acquired {
return // Skip, another worker got it
}
defer queue.ReleaseJobLock(jobID)
// Safe to process Now

Lesson: In distributed systems, atomicity is critical.


Challenge 3: WebSocket Connection Storms

Problem: Dashboard reconnects flooded API server when Redis restarted.

Root Cause: All clients reconnected simultaneously (thundering herd).

Solution: Exponential backoff with jitter:

function useWebSocket(url: string) {
  const [retryDelay, setRetryDelay] = useState(1000);

  useEffect(() => {
    const ws = new WebSocket(url);

    ws.onclose = () => {
      // Exponential backoff: 1s → 2s → 4s → 8s (max 30s)
      const delay = Math.min(retryDelay * 2, 30000);

      // Add jitter: ±20% randomness
      const jitter = delay * (0.8 + Math.random() * 0.4);

      setTimeout(() => {
        setRetryDelay(delay);
        // Reconnect
      }, jitter);
    };
  }, [url, retryDelay]);
}

Lesson: Always add jitter to prevent synchronized retries.


Trade-offs

  • Redis is single point of failure (unless clustered)
  • Not exactly-once guarantee (at-least-once)
  • Memory-based queue → data loss risk without persistence config
  • Lock expiration edge case under long-running jobs

Lessons Learned

1. Start Simple, Then Optimize

I initially tried to implement job priorities with Redis Sorted Sets (Z-commands). It was complex and buggy. Switching to 3 separate lists was simpler and faster.

Takeaway: Don't over-engineer. The simplest solution often performs best.


2. Observability is Non-Negotiable

Without real-time monitoring, debugging production issues was a nightmare. Adding the WebSocket dashboard saved me hours of SSH-ing into servers.

Takeaway: Build observability into your system from day one.


3. Testing Distributed Systems is Hard

Unit tests were easy, but integration tests (Redis + PostgreSQL + multiple workers) were flaky. I ended up using Docker Compose for test environments.

Takeaway: Invest in proper test infrastructure early.


4. Go's Simplicity is a Superpower

No callbacks, no promises, no async/await — just goroutines and channels. This made the codebase incredibly readable.

Takeaway: Choose languages that match your problem domain.


What's Next?

I'm planning these enhancements:

  1. Job Scheduling (Cron jobs)

    • Run reports every Monday at 9am
    • Clean up old data weekly
  2. Job Dependencies (DAG)

    • Run Job B only after Job A completes
    • Useful for ETL pipelines
  3. Rate Limiting

    • Max 10 email jobs/second to avoid provider bans
    • Per-customer quotas
  4. Prometheus + Grafana

    • Expose /metrics endpoint
    • Historical performance tracking
  5. Admin CLI Tool

    • Inspect queues
    • Manually retry failed jobs
    • Pause/resume workers

Tech Stack Summary

Backend:
  Language: Go 1.21+
  Framework: Gin (HTTP router)
  ORM: GORM
  Queue: Redis 7+
  Database: PostgreSQL 15+
  WebSocket: gorilla/websocket

Frontend:
Language: TypeScript
Framework: React 18
Build Tool: Vite
Styling: Tailwind CSS
Charts: Recharts

Infrastructure:
Container: Docker
Orchestration: Kubernetes
CI/CD: GitHub Actions (planned)

Links

© 2026 MUHAMMAD HABIB. BUILT WITH ❤️