Skip to content

Building a Production-Ready Email Delivery Engine

Building a Production-Ready Email Delivery Engine

Delivering email reliably at scale requires careful architecture and resilience patterns. This comprehensive guide covers building a production-ready email delivery engine, drawing from real production experience with an email server that processes thousands of messages daily.

Table of Contents

  1. Delivery Architecture Overview
  2. Worker Pool and Redis Queue
  3. MX Resolution with Caching
  4. Circuit Breaker Pattern
  5. Exponential Backoff with Jitter
  6. TLS Handling Strategy
  7. Relay Host Support
  8. Defensive Coding: Nil Pointer Fix
  9. Metrics and Monitoring
  10. Testing Approach
  11. Performance Tuning
  12. Production Deployment

Delivery Architecture Overview

The delivery engine is built around a worker pool pattern with Redis as the message queue, providing horizontal scalability and fault tolerance.

graph TD
    A[Incoming Email] --> B[Redis Queue]
    B --> C[Worker 1]
    B --> D[Worker 2]
    B --> E[Worker 3]
    B --> F[Worker 4]
    C --> G[Circuit Breaker per Domain]
    D --> G
    E --> G
    F --> G
    G --> H[MX Resolver]
    H --> I[Relay Host / Direct SMTP]
    I --> J[Recipient Server]
    K[Recovery Worker] --> B
    L[Monitoring] --> M[Prometheus Metrics]

Core Components

ComponentPurposeImplementation
Worker PoolConcurrent delivery processing4 workers (configurable)
Redis QueueMessage persistence and schedulingSorted sets with retry times
Circuit BreakerPer-domain resilience5 failures → open, 5min timeout
MX ResolverDNS lookup for recipient domains5-minute TTL cache
TLS HandlerSecure delivery with fallbackSTARTTLS with graceful degradation

Worker Pool and Redis Queue

The worker pool model provides parallelism while Redis ensures durability across restarts.

Redis Queue Implementation

The queue uses Redis sorted sets with timestamps as scores, enabling efficient retrieval of ready messages.

// @filename: main.go
// Message structure stored in Redis
type Message struct {
    ID          string    `json:"id"`
    Sender      string    `json:"sender"`
    Recipients  []string  `json:"recipients"`
    MessagePath string    `json:"message_path"`
    Size        int64     `json:"size"`
    Attempts    int       `json:"attempts"`
    MaxAttempts int       `json:"max_attempts"`
    LastAttempt time.Time `json:"last_attempt,omitempty"`
    NextAttempt time.Time `json:"next_attempt"`
    Domain      string    `json:"domain"` // For circuit breaker
    Status      Status    `json:"status"`
    CreatedAt   time.Time `json:"created_at"`
}

// Dequeue: Atomically move message from pending to processing
func (q *RedisQueue) Dequeue(ctx context.Context) (*Message, error) {
    now := float64(time.Now().UnixNano())

    // Get messages ready for delivery
    results, _ := q.client.ZRangeByScoreWithScores(ctx, q.pendingKey(), &redis.ZRangeBy{
        Min:   "-inf",
        Max:   fmt.Sprintf("%f", now),
        Count: 1,
    }).Result()

    if len(results) == 0 {
        return nil, nil // No messages ready
    }

    msgID := results[0].Member.(string)

    // Atomic transaction: move from pending to processing
    pipe := q.client.TxPipeline()
    pipe.ZRem(ctx, q.pendingKey(), msgID)
    pipe.SAdd(ctx, q.processingKey(), msgID)
    _, _ = pipe.Exec(ctx)

    return q.GetMessage(ctx, msgID)
}

Worker Pool Architecture

Each worker is a goroutine that continuously polls the queue and delivers messages.

// @filename: main.go
// Worker: Delivery goroutine
func (e *Engine) worker(id int) {
    defer e.wg.Done()

    for {
        select {
        case <-e.ctx.Done():
            return
        default:
        }

        msg, err := e.queue.Dequeue(e.ctx)
        if err != nil {
            time.Sleep(time.Second)
            continue
        }

        if msg == nil {
            time.Sleep(500 * time.Millisecond) // Back off when empty
            continue
        }

        e.deliverMessage(msg)
    }
}

Redis Connection Pooling

Configure Redis for production reliability with proper connection pooling:

// @filename: main.go
opts, _ := redis.ParseURL(cfg.RedisURL)
opts.MaxRetries = 3
opts.MinRetryBackoff = 100 * time.Millisecond
opts.MaxRetryBackoff = 1 * time.Second
opts.DialTimeout = 5 * time.Second
opts.ReadTimeout = 3 * time.Second
opts.WriteTimeout = 3 * time.Second
opts.PoolSize = 10
opts.MinIdleConns = 5
opts.MaxIdleConns = 10
opts.ConnMaxIdleTime = 5 * time.Minute
opts.ConnMaxLifetime = 30 * time.Minute

MX Resolution with Caching

Efficient DNS resolution reduces load and improves delivery speed.

MX Resolver Implementation

// @filename: main.go
type MXResolver struct {
    cache    sync.Map // domain -> *cachedMX
    resolver *net.Resolver
    ttl      time.Duration
}

type MXRecord struct {
    Host       string
    Preference uint16
    ExpiresAt  time.Time
}

// Lookup with caching
func (r *MXResolver) Lookup(ctx context.Context, domain string) ([]MXRecord, error) {
    // Check cache first
    if cached, ok := r.cache.Load(domain); ok {
        c := cached.(*cachedMX)
        if time.Now().Before(c.expiresAt) {
            return c.records, nil
        }
        r.cache.Delete(domain)
    }

    // Perform DNS lookup
    records, err := r.lookupMX(ctx, domain)
    if err != nil {
        return nil, err
    }

    // Cache results
    expiresAt := time.Now().Add(r.ttl)
    r.cache.Store(domain, &cachedMX{
        records:   records,
        expiresAt: expiresAt,
    })

    return records, nil
}

A Record Fallback

Per RFC 5321, if no MX records exist, the domain’s A record is used as a fallback:

// @filename: main.go
func (r *MXResolver) lookupAFallback(ctx context.Context, domain string) ([]MXRecord, error) {
    addrs, err := r.resolver.LookupHost(ctx, domain)
    if err != nil || len(addrs) == 0 {
        return nil, ErrNoMXRecords
    }

    return []MXRecord{
        {
            Host:       domain,
            Preference: 0,
        },
    }, nil
}

SSRF Protection

Filter out private/reserved IP addresses to prevent Server-Side Request Forgery:

// @filename: handlers.go
func isPrivateIP(ipStr string) bool {
    ip := net.ParseIP(ipStr)
    if ip == nil {
        return true
    }

    if ip.IsLoopback() || ip.IsPrivate() || ip.IsLinkLocalUnicast() {
        return true
    }

    reservedRanges := []string{
        "100.64.0.0/10",   // Carrier-grade NAT
        "192.0.2.0/24",    // TEST-NET-1
        "198.51.100.0/24", // TEST-NET-2
        "203.0.113.0/24",  // TEST-NET-3
    }

    for _, cidr := range reservedRanges {
        _, network, _ := net.ParseCIDR(cidr)
        if network.Contains(ip) {
            return true
        }
    }

    return false
}

Circuit Breaker Pattern

Circuit breakers prevent cascading failures by blocking requests to problematic domains.

Circuit Breaker States

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: 5 consecutive failures
    Open --> HalfOpen: 5 min timeout
    HalfOpen --> Closed: 2 successes
    HalfOpen --> Open: 1 failure

Per-Domain Circuit Breaker

Each recipient domain gets its own circuit breaker:

// @filename: main.go
type CircuitBreaker struct {
    config         Config
    state          int32 // atomic State
    failureCount   int64
    successCount   int64
    lastFailureTime int64
}

type State int32

const (
    StateClosed    State = iota
    StateOpen
    StateHalfOpen
)

Configuration

// @filename: main.go
breakers: resilience.NewBreakerRegistry(func(key string) resilience.Config {
    return resilience.Config{
        Name:             "smtp:" + key,
        FailureThreshold: 5,                    // Open after 5 failures
        SuccessThreshold: 2,                    // Close after 2 successes
        Timeout:          5 * time.Minute,       // 5 min to half-open
        HalfOpenMaxCalls: 2,                   // Max 2 calls in half-open
        ExecutionTimeout: 2 * time.Minute,       // 2 min per attempt
    }
})

Execution Through Circuit Breaker

// @filename: main.go
func (e *Engine) deliverMessage(msg *queue.Message) {
    breaker := e.breakers.Get(msg.Domain)

    if breaker == nil {
        // Defensive: Handle nil breaker (bug prevention)
        err := fmt.Errorf("invalid domain: %q", msg.Domain)
        e.queue.Fail(ctx, msg.ID, err.Error())
        return
    }

    if breaker.State() == resilience.StateOpen {
        // Circuit is open, defer delivery
        e.queue.Retry(ctx, msg.ID, ErrCircuitOpen)
        return
    }

    // Attempt delivery through circuit breaker
    err := breaker.Execute(ctx, func(ctx context.Context) error {
        return e.attemptDelivery(ctx, msg)
    })

    if err != nil {
        // Handle failure based on error type
        if isPermanentError(err) {
            e.queue.Fail(ctx, msg.ID, err.Error())
        } else {
            e.queue.Retry(ctx, msg.ID, err)
        }
    }
}

Circuit Breaker Execution

// @filename: main.go
func (cb *CircuitBreaker) Execute(ctx context.Context, fn func(ctx context.Context) error) error {
    if err := cb.beforeRequest(); err != nil {
        return err
    }

    // Execute with timeout and panic recovery
    execCtx := ctx
    if cb.config.ExecutionTimeout > 0 {
        execCtx, _ = context.WithTimeout(ctx, cb.config.ExecutionTimeout)
    }

    err := fn(execCtx)
    cb.afterRequest(err)
    return err
}

Exponential Backoff with Jitter

Retry intervals prevent overwhelming recipient servers while ensuring eventual delivery.

Retry Schedule

AttemptDelayJitter Range
15 minutes4m 30s - 5m 30s
215 minutes13m 30s - 16m 30s
330 minutes27m - 33m
41 hour54m - 66m
52 hours1h 48m - 2h 12m
64 hours3h 36m - 4h 24m
78 hours7h 12m - 8h 48m
816 hours14h 24m - 17h 36m
9+24 hours21h 36m - 26h 24m

Implementation

// @filename: handlers.go
func calculateNextRetry(attempts int) time.Time {
    intervals := []time.Duration{
        5 * time.Minute,
        15 * time.Minute,
        30 * time.Minute,
        1 * time.Hour,
        2 * time.Hour,
        4 * time.Hour,
        8 * time.Hour,
        16 * time.Hour,
        24 * time.Hour,
    }

    idx := attempts - 1
    if idx < 0 {
        idx = 0
    }
    if idx >= len(intervals) {
        idx = len(intervals) - 1
    }

    base := intervals[idx]

    // Add jitter: +/- 10%
    jitterRange := int64(base / 10)
    if jitterRange > 0 {
        jitter := time.Duration(time.Now().UnixNano()%jitterRange) - time.Duration(jitterRange/2)
        base += jitter
    }

    return time.Now().Add(base)
}

Retry Logic

// @filename: main.go
func (q *RedisQueue) Retry(ctx context.Context, msgID string, lastError error) error {
    msg, err := q.GetMessage(ctx, msgID)
    if err != nil {
        return err
    }

    msg.LastError = lastError.Error()

    // Check if we should give up
    if msg.Attempts >= msg.MaxAttempts {
        return q.Fail(ctx, msgID, "max attempts exceeded")
    }

    if time.Since(msg.CreatedAt) > q.config.RetryMaxAge {
        return q.Fail(ctx, msgID, "message expired")
    }

    // Calculate next retry time with exponential backoff + jitter
    msg.NextAttempt = calculateNextRetry(msg.Attempts)
    msg.Status = StatusDeferred

    pipe := q.client.TxPipeline()
    pipe.SRem(ctx, q.processingKey(), msgID)
    pipe.ZAdd(ctx, q.pendingKey(), redis.Z{
        Score:  float64(msg.NextAttempt.UnixNano()),
        Member: msgID,
    })
    _, _ = pipe.Exec(ctx)

    return nil
}

TLS Handling Strategy

Secure delivery with graceful fallback for misconfigured servers.

STARTTLS with Fallback

// @filename: main.go
func (e *Engine) deliverToHostWithTLS(ctx context.Context, addr, hostname string, msg *queue.Message, data []byte, tryTLS bool) error {
    conn, err := e.dialer.DialContext(ctx, "tcp", net.JoinHostPort(addr, "25"))
    if err != nil {
        return err
    }
    defer conn.Close()

    client, err := smtp.NewClient(conn, hostname)
    if err != nil {
        return err
    }
    defer client.Quit()

    if err := client.Hello(e.config.Hostname); err != nil {
        return err
    }

    // Try STARTTLS if enabled
    if tryTLS {
        if ok, _ := client.Extension("STARTTLS"); ok {
            tlsConfig := &tls.Config{
                ServerName:         hostname,
                InsecureSkipVerify: !e.config.VerifyTLS,
                MinVersion:         tls.VersionTLS12,
            }
            if err := client.StartTLS(tlsConfig); err != nil {
                if e.config.RequireTLS {
                    return fmt.Errorf("STARTTLS required but failed: %w", err)
                }
                // Graceful fallback for misconfigured servers
                e.logger.WarnContext(ctx, "STARTTLS failed, falling back to plaintext",
                    "host", hostname, "error", err.Error())
                // Close current connection and retry without TLS
                return e.deliverToHostWithTLS(ctx, addr, hostname, msg, data, false)
            }
        } else if e.config.RequireTLS {
            return fmt.Errorf("STARTTLS required but not supported")
        }
    }

    // Continue with delivery
    if err := client.Mail(msg.Sender); err != nil {
        return classifyError(err)
    }

    for _, rcpt := range msg.Recipients {
        if err := client.Rcpt(rcpt); err != nil {
            // Track failures per recipient
        }
    }

    w, _ := client.Data()
    w.Write(data)
    w.Close()

    return nil
}

Relay Host Support

Smart routing for outbound delivery through smarthosts.

Relay Host Implementation

// @filename: main.go
func (e *Engine) attemptDelivery(ctx context.Context, msg *queue.Message) error {
    messageData, err := e.readAndSignMessage(ctx, msg)
    if err != nil {
        return err
    }

    // Use relay host if configured
    if e.config.RelayHost != "" {
        e.logger.DebugContext(ctx, "Using relay host", "relay", e.config.RelayHost)
        return e.deliverToRelay(ctx, msg, messageData)
    }

    // Resolve MX records and deliver directly
    mxHosts, err := e.mxResolver.LookupWithFallback(ctx, msg.Domain)
    if err != nil {
        return fmt.Errorf("MX lookup failed: %w", err)
    }

    // Try each MX host in preference order
    var lastErr error
    for _, mx := range mxHosts {
        for _, addr := range mx.Addresses {
            lastErr = e.deliverToHost(ctx, addr, mx.Host, msg, messageData)
            if lastErr == nil {
                return nil // Success
            }

            if isPermanentError(lastErr) {
                return lastErr
            }
        }
    }

    return fmt.Errorf("%w: %v", ErrAllMXFailed, lastErr)
}

Relay Delivery

// @filename: main.go
func (e *Engine) deliverToRelay(ctx context.Context, msg *queue.Message, data []byte) error {
    host, port, _ := net.SplitHostPort(e.config.RelayHost)
    if host == "" {
        host = e.config.RelayHost
        port = "25"
    }

    dialer := &net.Dialer{Timeout: e.config.ConnectTimeout}
    conn, err := dialer.DialContext(ctx, "tcp", net.JoinHostPort(host, port))
    if err != nil {
        return fmt.Errorf("relay connection failed: %w", err)
    }
    defer conn.Close()

    client, _ := smtp.NewClient(conn, host)
    defer client.Quit()

    client.Hello(e.config.Hostname)
    client.Mail(msg.Sender)

    for _, rcpt := range msg.Recipients {
        if err := client.Rcpt(rcpt); err != nil {
            // Track per-recipient failures
        }
    }

    w, _ := client.Data()
    w.Write(data)
    w.Close()

    return nil
}

Defensive Coding: Nil Pointer Fix

Production systems must handle edge cases gracefully. A real-world bug demonstrated the importance of defensive programming.

The Bug (Commit 80e557b)

+    // Check circuit breaker for this domain
+    breaker := e.breakers.Get(msg.Domain)
+    if breaker == nil {
+        err := fmt.Errorf("invalid domain: %q", msg.Domain)
+        logger.ErrorContext(ctx, "No circuit breaker available (empty domain?), failing delivery", err)
+        e.queue.Fail(ctx, msg.ID, err.Error())
+        e.mu.Lock()
+        e.totalFailed++
+        e.mu.Unlock()
+        for _, rcpt := range msg.Recipients {
+            e.logDelivery(ctx, msg.ID, msg.Sender, rcpt, "rejected", 0, err.Error())
+        }
+        return
+    }
-    breaker := e.breakers.Get(msg.Domain)
     if breaker.State() == resilience.StateOpen {

Why This Matters

  • Empty Domain: Recipient domain extraction can fail, returning empty string
  • Registry Behavior: Get() returns nil for empty keys
  • Panic Consequence: Calling breaker.State() on nil causes panic
  • Production Impact: Worker crashes, message loss, service disruption

Defensive Programming Pattern

// @filename: main.go
// BreakerRegistry.Get returns nil for empty keys
func (r *BreakerRegistry) Get(key string) *CircuitBreaker {
    if key == "" {
        return nil // Intentional: invalid key
    }
    // ... rest of implementation
}

Metrics and Monitoring

Prometheus metrics provide visibility into delivery performance and health.

7 Key Metrics

MetricTypePurpose
mailserver_messages_sent_totalCounterSuccessfully delivered messages
mailserver_messages_queued_totalCounterMessages queued for delivery
mailserver_messages_rejected_totalCounterRejected messages with reason
mailserver_messages_bounced_totalCounterBounced messages
mailserver_delivery_duration_secondsHistogramDelivery latency distribution
mailserver_delivery_retries_totalCounterTotal retry attempts
mailserver_queue_depthGaugeCurrent queue size

Metrics Definition

// @filename: main.go
var (
    MessagesSent = promauto.NewCounter(prometheus.CounterOpts{
        Name: "mailserver_messages_sent_total",
        Help: "Total number of messages sent successfully",
    })

    MessagesQueued = promauto.NewCounter(prometheus.CounterOpts{
        Name: "mailserver_messages_queued_total",
        Help: "Total number of messages queued for delivery",
    })

    MessagesRejected = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "mailserver_messages_rejected_total",
        Help: "Total number of messages rejected",
    }, []string{"reason"})

    MessagesBounced = promauto.NewCounter(prometheus.CounterOpts{
        Name: "mailserver_messages_bounced_total",
        Help: "Total number of messages that bounced",
    })

    DeliveryDuration = promauto.NewHistogram(prometheus.HistogramOpts{
        Name:    "mailserver_delivery_duration_seconds",
        Help:    "Time taken to deliver messages",
        Buckets: prometheus.ExponentialBuckets(0.1, 2, 10), // 0.1s to ~100s
    })

    DeliveryRetries = promauto.NewCounter(prometheus.CounterOpts{
        Name: "mailserver_delivery_retries_total",
        Help: "Total number of delivery retry attempts",
    })

    QueueDepth = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "mailserver_queue_depth",
        Help: "Current number of messages in the delivery queue",
    })
)

Recording Metrics

// @filename: main.go
func (e *Engine) deliverMessage(msg *queue.Message) {
    start := time.Now()

    // ... delivery logic ...

    if err == nil {
        metrics.MessagesSent.Inc()
        metrics.DeliveryDuration.Observe(time.Since(start).Seconds())
    } else if isPermanentError(err) {
        metrics.MessagesRejected.WithLabelValues("permanent").Inc()
    } else {
        metrics.DeliveryRetries.Inc()
    }
}

Prometheus Endpoint

mux.Handle("/metrics", promhttp.Handler())

Access metrics at http://localhost:8080/metrics

Testing Approach

Comprehensive testing ensures reliability across various scenarios.

Configuration Tests

// @filename: handlers.go
func TestConfig_Defaults(t *testing.T) {
    cfg := DefaultConfig()

    if cfg.Workers != 4 {
        t.Errorf("Workers = %d, want 4", cfg.Workers)
    }
    if cfg.ConnectTimeout != 30*time.Second {
        t.Errorf("ConnectTimeout = %v, want 30s", cfg.ConnectTimeout)
    }
    if cfg.MaxMessageSize != 25*1024*1024 {
        t.Errorf("MaxMessageSize = %d, want 25MB", cfg.MaxMessageSize)
    }
}

Error Classification Tests

// @filename: handlers.go
func TestIsPermanentError(t *testing.T) {
    tests := []struct {
        name string
        err  error
        want bool
    }{
        {"550 user not found", errors.New("550 User not found"), true},
        {"421 try again", errors.New("421 Try again later"), false},
        {"connection timeout", errors.New("connection timeout"), false},
        {"ErrPermanentFailure", ErrPermanentFailure, true},
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            got := isPermanentError(tt.err)
            if got != tt.want {
                t.Errorf("isPermanentError(%v) = %v, want %v", tt.err, got, tt.want)
            }
        })
    }
}

File Cleanup Tests

// @filename: handlers.go
func TestEngine_CleanupMessageFile(t *testing.T) {
    tmpDir := t.TempDir()

    e := &Engine{
        config: Config{QueuePath: tmpDir},
    }

    t.Run("cleanup existing file", func(t *testing.T) {
        path := filepath.Join(tmpDir, "test.eml")
        os.WriteFile(path, []byte("test"), 0644)

        err := e.cleanupMessageFile(path)
        if err != nil {
            t.Errorf("cleanupMessageFile() error = %v", err)
        }

        if _, err := os.Stat(path); !os.IsNotExist(err) {
            t.Error("File should have been deleted")
        }
    })

    t.Run("refuse cleanup outside queue path", func(t *testing.T) {
        otherDir := t.TempDir()
        path := filepath.Join(otherDir, "other.eml")
        os.WriteFile(path, []byte("test"), 0644)

        // Should refuse to delete (silently)
        err := e.cleanupMessageFile(path)
        if err != nil {
            t.Errorf("cleanupMessageFile() error = %v", err)
        }

        // File should still exist
        if _, err := os.Stat(path); os.IsNotExist(err) {
            t.Error("File outside queue path should NOT have been deleted")
        }
    })
}

Benchmark Tests

// @filename: handlers.go
func BenchmarkExtractDomain(b *testing.B) {
    email := "user@example.com"
    for i := 0; i < b.N; i++ {
        extractDomain(email)
    }
}

func BenchmarkIsPermanentError(b *testing.B) {
    err := errors.New("550 User not found")
    for i := 0; i < b.N; i++ {
        isPermanentError(err)
    }
}

Performance Tuning

Optimize the delivery engine for throughput and resource efficiency.

Worker Pool Sizing

Queue SizeRecommended WorkersCPU Cores
< 1002-41-2
100-10004-82-4
1000-100008-164-8
> 1000016+8+
// @filename: handlers.go
func DefaultConfig() Config {
    return Config{
        Workers:        4, // Adjust based on CPU cores
        ConnectTimeout: 30 * time.Second,
        CommandTimeout: 5 * time.Minute,
    }
}

Timeout Configuration

// @filename: main.go
type Config struct {
    Workers         int
    ConnectTimeout  time.Duration // 30s: TCP connection
    CommandTimeout  time.Duration // 5m: SMTP session
    ExecutionTimeout time.Duration // 2m: Per-domain circuit breaker
}

Redis Optimization

// @filename: main.go
// Connection pool tuning
opts.PoolSize = 10 // Max concurrent Redis connections
opts.MinIdleConns = 5 // Keep at least 5 idle connections
opts.MaxIdleConns = 10 // Max idle connections
opts.ConnMaxIdleTime = 5 * time.Minute // Close idle after 5m
opts.ConnMaxLifetime = 30 * time.Minute // Max connection age

MX Cache Tuning

// @filename: handlers.go
func DefaultMXResolverConfig() MXResolverConfig {
    return MXResolverConfig{
        CacheTTL: 5 * time.Minute, // Balance between freshness and load
        Timeout:  10 * time.Second,  // DNS lookup timeout
    }
}

Production Deployment

Docker Configuration

# @filename: Dockerfile
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY . .
RUN go build -o email-server ./cmd/server

FROM alpine:latest
RUN apk add --no-cache ca-certificates
WORKDIR /app
COPY --from=builder /app/email-server .
COPY --from=builder /app/config.yaml .
EXPOSE 25 8080 993 4190
CMD ["./email-server"]

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: email-delivery
spec:
  replicas: 3
  selector:
    matchLabels:
      app: email-delivery
  template:
    metadata:
      labels:
        app: email-delivery
    spec:
      containers:
        - name: delivery
          image: email-server:latest
          ports:
            - containerPort: 8080
          env:
            - name: REDIS_URL
              valueFrom:
                secretKeyRef:
                  name: email-secrets
                  key: redis-url
            - name: WORKER_COUNT
              value: '8'
            - name: RELAY_HOST
              value: 'smtp.relay.example.com:587'
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: 2000m
              memory: 2Gi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5

Monitoring Stack

# Prometheus scraping configuration
scrape_configs:
  - job_name: 'email-delivery'
    static_configs:
      - targets: ['email-delivery:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

Alerting Rules

groups:
  - name: email_delivery
    rules:
      - alert: HighQueueDepth
        expr: mailserver_queue_depth > 1000
        for: 5m
        annotations:
          summary: 'Queue depth is too high'
      - alert: HighDeliveryLatency
        expr: histogram_quantile(0.95, rate(mailserver_delivery_duration_seconds_bucket[5m])) > 60
        for: 10m
        annotations:
          summary: '95th percentile delivery latency exceeds 60s'
      - alert: HighBounceRate
        expr: rate(mailserver_messages_bounced_total[5m]) / rate(mailserver_messages_sent_total[5m]) > 0.1
        for: 15m
        annotations:
          summary: 'Bounce rate exceeds 10%'

Key Takeaways

  1. Circuit breakers prevent cascading failures by isolating problematic domains
  2. Exponential backoff with jitter prevents thundering herds on retry
  3. Defensive programming handles edge cases like nil circuit breakers
  4. Comprehensive metrics provide visibility into system health
  5. Worker pools enable horizontal scaling for increased throughput
  6. TLS fallback ensures delivery to misconfigured servers while maintaining security
  7. MX caching reduces DNS load and improves delivery speed

This architecture has been battle-tested in production, handling thousands of daily email deliveries with high reliability and performance.

Go Backend Email Server SMTP Circuit Breaker Resilience Patterns Production Systems Distributed Systems
Share:

Continue Reading

Quantum Computing for Developers: A Practical Guide to the Future of Computing

A comprehensive introduction to quantum computing for classical developers. Learn the fundamentals of qubits, quantum gates, and quantum algorithms. Explore practical implementations using Qiskit and Cirq, understand quantum machine learning basics, and discover how to get started with quantum simulators in the NISQ era.

Read article
GoBackendConcurrency

IMAP IDLE Implementation: From Crashes to Production

A deep dive into implementing real-time email notifications using IMAP IDLE, chronicling three crashes, library bugs, and the journey to production-grade instant email delivery with Go. Learn about goroutine race conditions, go-imap v1 vs v2, and O(1) file access optimizations.

Read article
GoIMAPIDLE

Mastering Observability with OpenTelemetry: A Comprehensive Guide

Learn how to implement comprehensive observability in your applications using OpenTelemetry. This guide covers traces, metrics, and logs across Node.js, Python, and Go, with real-world examples and production best practices.

Read article
Node.jsJavaScriptBackend