Building a Production-Ready Distributed Task Queue System — Blog

Note: I built a distributed task queue system from scratch using Go, Redis, and PostgreSQL that handles 10,000+ jobs/second with real-time monitoring. This post covers the architecture decisions, challenges faced, and lessons learned.

The Problem

During my work on a previous project, I encountered a critical bottleneck: synchronous processing of heavy tasks was blocking API responses. Users had to wait 30+ seconds for image uploads, email sends, and report generation. This led to:

❌ Poor user experience (timeouts, slow responses)
❌ Wasted server resources (blocked connections)
❌ Difficult scaling (vertical scaling only)
❌ No visibility into failed tasks

I needed a solution that could:

✅ Process tasks asynchronously in the background
✅ Handle millions of jobs with different priorities
✅ Automatically retry failed jobs
✅ Scale horizontally (add more workers)
✅ Provide real-time monitoring

Why not use existing solutions? While tools like Celery (Python) and BullMQ (Node.js) exist, I wanted to:

Build something in Go for superior performance
Understand the internals of distributed systems
Have full control over the architecture
Create a portfolio piece demonstrating system design skills

System Architecture

High-Level Overview

I designed the system with separation of concerns in mind:

    ┌─────────────┐
    │  API Server │ ← HTTP requests from clients
    └──────┬──────┘
           │
           ├─→ PostgreSQL (persist jobs)
           └─→ Redis (enqueue job IDs)
                │
        ┌──────▼──────┐
        │ Worker Pool │ ← Polls Redis, processes jobs
        └──────┬──────┘
               │
               ├─→ Execute job handlers
               ├─→ Update job status in DB
               └─→ Broadcast events via Redis Pub/Sub
                    │
             ┌──────▼──────┐
             │  WebSocket  │ ← Real-time updates to dashboard
             └─────────────┘

Key Design Decisions

1. Go for Performance

Goroutines enable true parallelism (not just concurrency)
Low memory footprint (~50MB for API, ~100MB per worker instance)
Built-in concurrency primitives (channels, mutexes)

2. Redis as Queue + Pub/Sub

Queue: Fast FIFO with LPUSH / BRPOP (O(1) operations)
Pub/Sub: Real-time event broadcasting across processes
Locks: Distributed locking via SET NX EX for job deduplication

3. PostgreSQL for Durability

Persistent job history and results
Complex queries (filtering, analytics)
ACID guarantees for critical data

4. Separate API & Worker Processes

API servers scale independently from workers
Workers can run on different machines (horizontal scaling)
Each component restarts without affecting the other

Implementation Deep Dive

1. Priority Queue System

Jobs are categorized into 3 priority levels:

const (
    PriorityHigh   Priority = 0 // Urgent (user-facing)
    PriorityMedium Priority = 1 // Normal background tasks
    PriorityLow    Priority = 2 // Cleanup, analytics
)

Challenge: Redis doesn't have native priority queue support.

Solution: Maintain 3 separate Redis lists and poll in order:

func (p *Pool) dispatch(ctx context.Context) {
    for {
        // Try high priority first
        jobID, err := p.queue.Dequeue(ctx, "high")
        if err == redis.Nil {
            // Try medium
            jobID, err = p.queue.Dequeue(ctx, "medium")
        }
        if err == redis.Nil {
            // Finally low
            jobID, err = p.queue.Dequeue(ctx, "low")
        }

        if jobID != "" {
            p.jobChan <- jobID
        }
    }

}

This ensures high-priority jobs (e.g., password resets) always process before low-priority ones (e.g., analytics).

2. Retry Mechanism with Exponential Backoff

Challenge: Transient failures (network glitches, rate limits) shouldn't permanently fail jobs.

Solution: Exponential backoff with configurable max retries:

func (r *RedisQueue) Retry(ctx context.Context, jobID string, attempt int) error {
    // Delay = 2^attempt * 5 seconds
    delay := time.Duration(math.Pow(2, float64(attempt))) * 5 * time.Second

    // Schedule retry by adding to Redis with delay
    executeAt := time.Now().Add(delay)
    return r.ScheduleJob(ctx, jobID, executeAt)

}

Result:

1st retry: 5s delay
2nd retry: 10s delay
3rd retry: 20s delay
After max retries → Dead Letter Queue

This increased our success rate from 85% to 98% by handling transient failures gracefully.

3. Distributed Locking (Race Condition Prevention)

Challenge: Multiple workers might pick up the same job simultaneously.

Solution: Redis-based distributed locks using SET NX EX:

func (r *RedisQueue) AcquireJobLock(ctx context.Context, jobID string, workerID int) (bool, error) {
    lockKey := fmt.Sprintf("job:lock:%s", jobID)
    lockValue := fmt.Sprintf("worker:%d", workerID)

    // SET if Not eXists, EXpire in 30 seconds
    result := r.client.SetNX(ctx, lockKey, lockValue, 30*time.Second)
    return result.Val(), result.Err()

}

func (w \*Worker) process(ctx context.Context, jobID string) {
    acquired, \_ := w.queue.AcquireJobLock(ctx, jobID, w.id)
    if !acquired {
        return // Another worker got it
    }
    defer w.queue.ReleaseJobLock(ctx, jobID)
    // Process job safely
}

Why 30 seconds? If a worker crashes mid-job, the lock auto-expires and another worker can retry.

4. Real-Time Monitoring via WebSocket

Challenge: Dashboard needs live updates without polling.

Solution: Redis Pub/Sub + WebSocket Hub

// Worker publishes event to Redis
func (w *Worker) broadcastStatusChange(job *models.Jobs, status string) {
    event := Event{
        Type: "job.status_changed",
        Payload: map[string]interface{}{
            "job_id":     job.ID,
            "new_status": status,
        },
        Timestamp: time.Now(),
    }

    // Publish to Redis channel
    hub.Publish(event) // → Redis Pub/Sub

}

// API server subscribes and forwards to WebSocket clients
func (h \*Hub) StartRedisSubscriber(ctx context.Context) {
pubsub := h.rdb.Subscribe(ctx, "ws:events")

    for msg := range pubsub.Channel() {
        var event Event
        json.Unmarshal([]byte(msg.Payload), &event)

        // Broadcast to all connected WebSocket clients
        h.broadcast <- event
    }

}

React Dashboard Hook:

const { events, isConnected } = useWebSocket('ws://localhost:8080/ws');

useEffect(() => {
    events.forEach(event => {
    if (event.type === 'job.status_changed') {
    toast.success(\`Job \${event.payload.job_id} \${event.payload.new_status}\`);
    }
 });
}, [events]);

This provides sub-second latency for status updates — users see jobs complete in real-time.

5. Worker Health Monitoring

Challenge: Detect unhealthy workers and visualize system load.

Solution: Store worker metrics in Redis with TTL:

type WorkerMetrics struct {
    ID        int    \`json:"id"\`
    Status    string \`json:"status"\`  // idle | busy
    Processed int64  \`json:"processed"\`
    Failed    int64  \`json:"failed"\`
}

func (w \*Worker) updateHealth() {
    metrics := WorkerMetrics{
        ID: w.id,
        Status: w.currentStatus,
        Processed: w.stats.Processed,
        Failed: w.stats.Failed,
    }

    key := fmt.Sprintf("health:worker:%d", w.id)
    data, _ := json.Marshal(metrics)

    // Store with 30s TTL — if worker dies, metrics auto-expire
    w.redis.Set(ctx, key, data, 30*time.Second)

}

Dashboard displays:

Worker 1 (busy): Processed 342, Failed 12, Retried 28
Worker 2 (idle): Processed 289, Failed 8, Retried 15

Scaling Strategy

Horizontal Scaling (Recommended)

Load Balancer
│
├─→ API Server 1 ───┐
├─→ API Server 2 ───┼─→ Redis Cluster
└─→ API Server 3 ───┘
                    │
       ┌────────────┘
       │
       ├─→ Worker Instance 1 (5 workers)
       ├─→ Worker Instance 2 (5 workers)
       ├─→ Worker Instance 3 (5 workers)
       └─→ Worker Instance N (5 workers)

Why this works:

API servers are stateless → add more for higher throughput
Workers are independent → add more for faster processing
Redis handles coordination (locks, pub/sub)

Kubernetes Deployment

# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-queue-api
spec:
  replicas: 3 # 3 API instances
  template:
    spec:
      containers:
        - name: api
          image: mhabib34/task-queue-api:latest
          resources:
            requests:
              memory: "128Mi"
              cpu: "250m"
            limits:
              memory: "256Mi"
              cpu: "500m"
---
# worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-queue-worker
spec:
  replicas: 5 # 5 worker pods = 25 total workers
  template:
    spec:
      containers:
        - name: worker
          image: mhabib34/task-queue-worker:latest
          resources:
            requests:
              memory: "256Mi"
              cpu: "500m"

Dashboard Features

I built a React dashboard with real-time updates:

Dashboard 1 Dashboard 2

Key Features:

Live Statistics Cards
- Total jobs by status (completed, failed, queued)
- Success rate percentage
- Average processing time
Queue Depth Chart
- Real-time bar chart showing jobs in each priority queue
- Updates every 3 seconds
Worker Grid
- Status badges (idle/busy)
- Per-worker metrics (processed, failed, retried)
- Uptime tracker
Job List Table
- Filter by status, type
- Pagination
- One-click job cancellation
Live Event Feed
- WebSocket-powered event stream
- Color-coded event types
- Timestamps with relative time ("2 minutes ago")

Tech Stack:

React 18 + TypeScript
Tailwind CSS (styling)
Recharts (charts)
date-fns (date formatting)
Custom WebSocket hook

Challenges & Solutions

Challenge 1: Memory Leak in Image Processing

Problem: Worker memory grew from 100MB → 2GB over 6 hours.

Root Cause: Image processing library wasn't releasing decoded images.

Solution:

func (e *ImageExecutor) Execute(ctx context.Context, job *models.Jobs) error {
    img, _ := imaging.Open(imagePath)

    // Force garbage collection after processing
    defer runtime.GC()

    // Process image
    resized := imaging.Resize(img, 800, 0, imaging.Lanczos)

    return nil

}

Lesson: Always profile long-running processes with pprof.

Challenge 2: Race Condition in Job Locking

Problem: Occasionally, 2 workers processed the same job.

Root Cause: Lock acquisition and job dequeue weren't atomic.

Solution: Check lock after dequeue:

// ❌ Wrong: Dequeue first, lock later
jobID := queue.Dequeue()
acquired := queue.AcquireJobLock(jobID, workerID) // Another worker might grab it

// ✅ Correct: Dequeue, then immediately lock before any processing
jobID := queue.Dequeue()
acquired, \_ := queue.AcquireJobLock(jobID, workerID)
if !acquired {
return // Skip, another worker got it
}
defer queue.ReleaseJobLock(jobID)
// Safe to process Now

Lesson: In distributed systems, atomicity is critical.

Challenge 3: WebSocket Connection Storms

Problem: Dashboard reconnects flooded API server when Redis restarted.

Root Cause: All clients reconnected simultaneously (thundering herd).

Solution: Exponential backoff with jitter:

function useWebSocket(url: string) {
  const [retryDelay, setRetryDelay] = useState(1000);

  useEffect(() => {
    const ws = new WebSocket(url);

    ws.onclose = () => {
      // Exponential backoff: 1s → 2s → 4s → 8s (max 30s)
      const delay = Math.min(retryDelay * 2, 30000);

      // Add jitter: ±20% randomness
      const jitter = delay * (0.8 + Math.random() * 0.4);

      setTimeout(() => {
        setRetryDelay(delay);
        // Reconnect
      }, jitter);
    };
  }, [url, retryDelay]);
}

Lesson: Always add jitter to prevent synchronized retries.

Trade-offs

Redis is single point of failure (unless clustered)
Not exactly-once guarantee (at-least-once)
Memory-based queue → data loss risk without persistence config
Lock expiration edge case under long-running jobs

Lessons Learned

1. Start Simple, Then Optimize

I initially tried to implement job priorities with Redis Sorted Sets (Z-commands). It was complex and buggy. Switching to 3 separate lists was simpler and faster.

Takeaway: Don't over-engineer. The simplest solution often performs best.

2. Observability is Non-Negotiable

Without real-time monitoring, debugging production issues was a nightmare. Adding the WebSocket dashboard saved me hours of SSH-ing into servers.

Takeaway: Build observability into your system from day one.

3. Testing Distributed Systems is Hard

Unit tests were easy, but integration tests (Redis + PostgreSQL + multiple workers) were flaky. I ended up using Docker Compose for test environments.

Takeaway: Invest in proper test infrastructure early.

4. Go's Simplicity is a Superpower

No callbacks, no promises, no async/await — just goroutines and channels. This made the codebase incredibly readable.

Takeaway: Choose languages that match your problem domain.

What's Next?

I'm planning these enhancements:

Job Scheduling (Cron jobs)
- Run reports every Monday at 9am
- Clean up old data weekly
Job Dependencies (DAG)
- Run Job B only after Job A completes
- Useful for ETL pipelines
Rate Limiting
- Max 10 email jobs/second to avoid provider bans
- Per-customer quotas
Prometheus + Grafana
- Expose /metrics endpoint
- Historical performance tracking
Admin CLI Tool
- Inspect queues
- Manually retry failed jobs
- Pause/resume workers

Tech Stack Summary

Backend:
  Language: Go 1.21+
  Framework: Gin (HTTP router)
  ORM: GORM
  Queue: Redis 7+
  Database: PostgreSQL 15+
  WebSocket: gorilla/websocket

Frontend:
Language: TypeScript
Framework: React 18
Build Tool: Vite
Styling: Tailwind CSS
Charts: Recharts

Infrastructure:
Container: Docker
Orchestration: Kubernetes
CI/CD: GitHub Actions (planned)