Building a Production-Ready Email Delivery Engine
Building a Production-Ready Email Delivery Engine
Delivering email reliably at scale requires careful architecture and resilience patterns. This comprehensive guide covers building a production-ready email delivery engine, drawing from real production experience with an email server that processes thousands of messages daily.
Table of Contents
- Delivery Architecture Overview
- Worker Pool and Redis Queue
- MX Resolution with Caching
- Circuit Breaker Pattern
- Exponential Backoff with Jitter
- TLS Handling Strategy
- Relay Host Support
- Defensive Coding: Nil Pointer Fix
- Metrics and Monitoring
- Testing Approach
- Performance Tuning
- Production Deployment
Delivery Architecture Overview
The delivery engine is built around a worker pool pattern with Redis as the message queue, providing horizontal scalability and fault tolerance.
graph TD
A[Incoming Email] --> B[Redis Queue]
B --> C[Worker 1]
B --> D[Worker 2]
B --> E[Worker 3]
B --> F[Worker 4]
C --> G[Circuit Breaker per Domain]
D --> G
E --> G
F --> G
G --> H[MX Resolver]
H --> I[Relay Host / Direct SMTP]
I --> J[Recipient Server]
K[Recovery Worker] --> B
L[Monitoring] --> M[Prometheus Metrics]
Core Components
| Component | Purpose | Implementation |
|---|---|---|
| Worker Pool | Concurrent delivery processing | 4 workers (configurable) |
| Redis Queue | Message persistence and scheduling | Sorted sets with retry times |
| Circuit Breaker | Per-domain resilience | 5 failures → open, 5min timeout |
| MX Resolver | DNS lookup for recipient domains | 5-minute TTL cache |
| TLS Handler | Secure delivery with fallback | STARTTLS with graceful degradation |
Worker Pool and Redis Queue
The worker pool model provides parallelism while Redis ensures durability across restarts.
Redis Queue Implementation
The queue uses Redis sorted sets with timestamps as scores, enabling efficient retrieval of ready messages.
// @filename: main.go
// Message structure stored in Redis
type Message struct {
ID string `json:"id"`
Sender string `json:"sender"`
Recipients []string `json:"recipients"`
MessagePath string `json:"message_path"`
Size int64 `json:"size"`
Attempts int `json:"attempts"`
MaxAttempts int `json:"max_attempts"`
LastAttempt time.Time `json:"last_attempt,omitempty"`
NextAttempt time.Time `json:"next_attempt"`
Domain string `json:"domain"` // For circuit breaker
Status Status `json:"status"`
CreatedAt time.Time `json:"created_at"`
}
// Dequeue: Atomically move message from pending to processing
func (q *RedisQueue) Dequeue(ctx context.Context) (*Message, error) {
now := float64(time.Now().UnixNano())
// Get messages ready for delivery
results, _ := q.client.ZRangeByScoreWithScores(ctx, q.pendingKey(), &redis.ZRangeBy{
Min: "-inf",
Max: fmt.Sprintf("%f", now),
Count: 1,
}).Result()
if len(results) == 0 {
return nil, nil // No messages ready
}
msgID := results[0].Member.(string)
// Atomic transaction: move from pending to processing
pipe := q.client.TxPipeline()
pipe.ZRem(ctx, q.pendingKey(), msgID)
pipe.SAdd(ctx, q.processingKey(), msgID)
_, _ = pipe.Exec(ctx)
return q.GetMessage(ctx, msgID)
}
Worker Pool Architecture
Each worker is a goroutine that continuously polls the queue and delivers messages.
// @filename: main.go
// Worker: Delivery goroutine
func (e *Engine) worker(id int) {
defer e.wg.Done()
for {
select {
case <-e.ctx.Done():
return
default:
}
msg, err := e.queue.Dequeue(e.ctx)
if err != nil {
time.Sleep(time.Second)
continue
}
if msg == nil {
time.Sleep(500 * time.Millisecond) // Back off when empty
continue
}
e.deliverMessage(msg)
}
}
Redis Connection Pooling
Configure Redis for production reliability with proper connection pooling:
// @filename: main.go
opts, _ := redis.ParseURL(cfg.RedisURL)
opts.MaxRetries = 3
opts.MinRetryBackoff = 100 * time.Millisecond
opts.MaxRetryBackoff = 1 * time.Second
opts.DialTimeout = 5 * time.Second
opts.ReadTimeout = 3 * time.Second
opts.WriteTimeout = 3 * time.Second
opts.PoolSize = 10
opts.MinIdleConns = 5
opts.MaxIdleConns = 10
opts.ConnMaxIdleTime = 5 * time.Minute
opts.ConnMaxLifetime = 30 * time.Minute
MX Resolution with Caching
Efficient DNS resolution reduces load and improves delivery speed.
MX Resolver Implementation
// @filename: main.go
type MXResolver struct {
cache sync.Map // domain -> *cachedMX
resolver *net.Resolver
ttl time.Duration
}
type MXRecord struct {
Host string
Preference uint16
ExpiresAt time.Time
}
// Lookup with caching
func (r *MXResolver) Lookup(ctx context.Context, domain string) ([]MXRecord, error) {
// Check cache first
if cached, ok := r.cache.Load(domain); ok {
c := cached.(*cachedMX)
if time.Now().Before(c.expiresAt) {
return c.records, nil
}
r.cache.Delete(domain)
}
// Perform DNS lookup
records, err := r.lookupMX(ctx, domain)
if err != nil {
return nil, err
}
// Cache results
expiresAt := time.Now().Add(r.ttl)
r.cache.Store(domain, &cachedMX{
records: records,
expiresAt: expiresAt,
})
return records, nil
}
A Record Fallback
Per RFC 5321, if no MX records exist, the domain’s A record is used as a fallback:
// @filename: main.go
func (r *MXResolver) lookupAFallback(ctx context.Context, domain string) ([]MXRecord, error) {
addrs, err := r.resolver.LookupHost(ctx, domain)
if err != nil || len(addrs) == 0 {
return nil, ErrNoMXRecords
}
return []MXRecord{
{
Host: domain,
Preference: 0,
},
}, nil
}
SSRF Protection
Filter out private/reserved IP addresses to prevent Server-Side Request Forgery:
// @filename: handlers.go
func isPrivateIP(ipStr string) bool {
ip := net.ParseIP(ipStr)
if ip == nil {
return true
}
if ip.IsLoopback() || ip.IsPrivate() || ip.IsLinkLocalUnicast() {
return true
}
reservedRanges := []string{
"100.64.0.0/10", // Carrier-grade NAT
"192.0.2.0/24", // TEST-NET-1
"198.51.100.0/24", // TEST-NET-2
"203.0.113.0/24", // TEST-NET-3
}
for _, cidr := range reservedRanges {
_, network, _ := net.ParseCIDR(cidr)
if network.Contains(ip) {
return true
}
}
return false
}
Circuit Breaker Pattern
Circuit breakers prevent cascading failures by blocking requests to problematic domains.
Circuit Breaker States
stateDiagram-v2
[*] --> Closed
Closed --> Open: 5 consecutive failures
Open --> HalfOpen: 5 min timeout
HalfOpen --> Closed: 2 successes
HalfOpen --> Open: 1 failure
Per-Domain Circuit Breaker
Each recipient domain gets its own circuit breaker:
// @filename: main.go
type CircuitBreaker struct {
config Config
state int32 // atomic State
failureCount int64
successCount int64
lastFailureTime int64
}
type State int32
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
Configuration
// @filename: main.go
breakers: resilience.NewBreakerRegistry(func(key string) resilience.Config {
return resilience.Config{
Name: "smtp:" + key,
FailureThreshold: 5, // Open after 5 failures
SuccessThreshold: 2, // Close after 2 successes
Timeout: 5 * time.Minute, // 5 min to half-open
HalfOpenMaxCalls: 2, // Max 2 calls in half-open
ExecutionTimeout: 2 * time.Minute, // 2 min per attempt
}
})
Execution Through Circuit Breaker
// @filename: main.go
func (e *Engine) deliverMessage(msg *queue.Message) {
breaker := e.breakers.Get(msg.Domain)
if breaker == nil {
// Defensive: Handle nil breaker (bug prevention)
err := fmt.Errorf("invalid domain: %q", msg.Domain)
e.queue.Fail(ctx, msg.ID, err.Error())
return
}
if breaker.State() == resilience.StateOpen {
// Circuit is open, defer delivery
e.queue.Retry(ctx, msg.ID, ErrCircuitOpen)
return
}
// Attempt delivery through circuit breaker
err := breaker.Execute(ctx, func(ctx context.Context) error {
return e.attemptDelivery(ctx, msg)
})
if err != nil {
// Handle failure based on error type
if isPermanentError(err) {
e.queue.Fail(ctx, msg.ID, err.Error())
} else {
e.queue.Retry(ctx, msg.ID, err)
}
}
}
Circuit Breaker Execution
// @filename: main.go
func (cb *CircuitBreaker) Execute(ctx context.Context, fn func(ctx context.Context) error) error {
if err := cb.beforeRequest(); err != nil {
return err
}
// Execute with timeout and panic recovery
execCtx := ctx
if cb.config.ExecutionTimeout > 0 {
execCtx, _ = context.WithTimeout(ctx, cb.config.ExecutionTimeout)
}
err := fn(execCtx)
cb.afterRequest(err)
return err
}
Exponential Backoff with Jitter
Retry intervals prevent overwhelming recipient servers while ensuring eventual delivery.
Retry Schedule
| Attempt | Delay | Jitter Range |
|---|---|---|
| 1 | 5 minutes | 4m 30s - 5m 30s |
| 2 | 15 minutes | 13m 30s - 16m 30s |
| 3 | 30 minutes | 27m - 33m |
| 4 | 1 hour | 54m - 66m |
| 5 | 2 hours | 1h 48m - 2h 12m |
| 6 | 4 hours | 3h 36m - 4h 24m |
| 7 | 8 hours | 7h 12m - 8h 48m |
| 8 | 16 hours | 14h 24m - 17h 36m |
| 9+ | 24 hours | 21h 36m - 26h 24m |
Implementation
// @filename: handlers.go
func calculateNextRetry(attempts int) time.Time {
intervals := []time.Duration{
5 * time.Minute,
15 * time.Minute,
30 * time.Minute,
1 * time.Hour,
2 * time.Hour,
4 * time.Hour,
8 * time.Hour,
16 * time.Hour,
24 * time.Hour,
}
idx := attempts - 1
if idx < 0 {
idx = 0
}
if idx >= len(intervals) {
idx = len(intervals) - 1
}
base := intervals[idx]
// Add jitter: +/- 10%
jitterRange := int64(base / 10)
if jitterRange > 0 {
jitter := time.Duration(time.Now().UnixNano()%jitterRange) - time.Duration(jitterRange/2)
base += jitter
}
return time.Now().Add(base)
}
Retry Logic
// @filename: main.go
func (q *RedisQueue) Retry(ctx context.Context, msgID string, lastError error) error {
msg, err := q.GetMessage(ctx, msgID)
if err != nil {
return err
}
msg.LastError = lastError.Error()
// Check if we should give up
if msg.Attempts >= msg.MaxAttempts {
return q.Fail(ctx, msgID, "max attempts exceeded")
}
if time.Since(msg.CreatedAt) > q.config.RetryMaxAge {
return q.Fail(ctx, msgID, "message expired")
}
// Calculate next retry time with exponential backoff + jitter
msg.NextAttempt = calculateNextRetry(msg.Attempts)
msg.Status = StatusDeferred
pipe := q.client.TxPipeline()
pipe.SRem(ctx, q.processingKey(), msgID)
pipe.ZAdd(ctx, q.pendingKey(), redis.Z{
Score: float64(msg.NextAttempt.UnixNano()),
Member: msgID,
})
_, _ = pipe.Exec(ctx)
return nil
}
TLS Handling Strategy
Secure delivery with graceful fallback for misconfigured servers.
STARTTLS with Fallback
// @filename: main.go
func (e *Engine) deliverToHostWithTLS(ctx context.Context, addr, hostname string, msg *queue.Message, data []byte, tryTLS bool) error {
conn, err := e.dialer.DialContext(ctx, "tcp", net.JoinHostPort(addr, "25"))
if err != nil {
return err
}
defer conn.Close()
client, err := smtp.NewClient(conn, hostname)
if err != nil {
return err
}
defer client.Quit()
if err := client.Hello(e.config.Hostname); err != nil {
return err
}
// Try STARTTLS if enabled
if tryTLS {
if ok, _ := client.Extension("STARTTLS"); ok {
tlsConfig := &tls.Config{
ServerName: hostname,
InsecureSkipVerify: !e.config.VerifyTLS,
MinVersion: tls.VersionTLS12,
}
if err := client.StartTLS(tlsConfig); err != nil {
if e.config.RequireTLS {
return fmt.Errorf("STARTTLS required but failed: %w", err)
}
// Graceful fallback for misconfigured servers
e.logger.WarnContext(ctx, "STARTTLS failed, falling back to plaintext",
"host", hostname, "error", err.Error())
// Close current connection and retry without TLS
return e.deliverToHostWithTLS(ctx, addr, hostname, msg, data, false)
}
} else if e.config.RequireTLS {
return fmt.Errorf("STARTTLS required but not supported")
}
}
// Continue with delivery
if err := client.Mail(msg.Sender); err != nil {
return classifyError(err)
}
for _, rcpt := range msg.Recipients {
if err := client.Rcpt(rcpt); err != nil {
// Track failures per recipient
}
}
w, _ := client.Data()
w.Write(data)
w.Close()
return nil
}
Relay Host Support
Smart routing for outbound delivery through smarthosts.
Relay Host Implementation
// @filename: main.go
func (e *Engine) attemptDelivery(ctx context.Context, msg *queue.Message) error {
messageData, err := e.readAndSignMessage(ctx, msg)
if err != nil {
return err
}
// Use relay host if configured
if e.config.RelayHost != "" {
e.logger.DebugContext(ctx, "Using relay host", "relay", e.config.RelayHost)
return e.deliverToRelay(ctx, msg, messageData)
}
// Resolve MX records and deliver directly
mxHosts, err := e.mxResolver.LookupWithFallback(ctx, msg.Domain)
if err != nil {
return fmt.Errorf("MX lookup failed: %w", err)
}
// Try each MX host in preference order
var lastErr error
for _, mx := range mxHosts {
for _, addr := range mx.Addresses {
lastErr = e.deliverToHost(ctx, addr, mx.Host, msg, messageData)
if lastErr == nil {
return nil // Success
}
if isPermanentError(lastErr) {
return lastErr
}
}
}
return fmt.Errorf("%w: %v", ErrAllMXFailed, lastErr)
}
Relay Delivery
// @filename: main.go
func (e *Engine) deliverToRelay(ctx context.Context, msg *queue.Message, data []byte) error {
host, port, _ := net.SplitHostPort(e.config.RelayHost)
if host == "" {
host = e.config.RelayHost
port = "25"
}
dialer := &net.Dialer{Timeout: e.config.ConnectTimeout}
conn, err := dialer.DialContext(ctx, "tcp", net.JoinHostPort(host, port))
if err != nil {
return fmt.Errorf("relay connection failed: %w", err)
}
defer conn.Close()
client, _ := smtp.NewClient(conn, host)
defer client.Quit()
client.Hello(e.config.Hostname)
client.Mail(msg.Sender)
for _, rcpt := range msg.Recipients {
if err := client.Rcpt(rcpt); err != nil {
// Track per-recipient failures
}
}
w, _ := client.Data()
w.Write(data)
w.Close()
return nil
}
Defensive Coding: Nil Pointer Fix
Production systems must handle edge cases gracefully. A real-world bug demonstrated the importance of defensive programming.
The Bug (Commit 80e557b)
+ // Check circuit breaker for this domain
+ breaker := e.breakers.Get(msg.Domain)
+ if breaker == nil {
+ err := fmt.Errorf("invalid domain: %q", msg.Domain)
+ logger.ErrorContext(ctx, "No circuit breaker available (empty domain?), failing delivery", err)
+ e.queue.Fail(ctx, msg.ID, err.Error())
+ e.mu.Lock()
+ e.totalFailed++
+ e.mu.Unlock()
+ for _, rcpt := range msg.Recipients {
+ e.logDelivery(ctx, msg.ID, msg.Sender, rcpt, "rejected", 0, err.Error())
+ }
+ return
+ }
- breaker := e.breakers.Get(msg.Domain)
if breaker.State() == resilience.StateOpen {
Why This Matters
- Empty Domain: Recipient domain extraction can fail, returning empty string
- Registry Behavior:
Get()returns nil for empty keys - Panic Consequence: Calling
breaker.State()on nil causes panic - Production Impact: Worker crashes, message loss, service disruption
Defensive Programming Pattern
// @filename: main.go
// BreakerRegistry.Get returns nil for empty keys
func (r *BreakerRegistry) Get(key string) *CircuitBreaker {
if key == "" {
return nil // Intentional: invalid key
}
// ... rest of implementation
}
Metrics and Monitoring
Prometheus metrics provide visibility into delivery performance and health.
7 Key Metrics
| Metric | Type | Purpose |
|---|---|---|
mailserver_messages_sent_total | Counter | Successfully delivered messages |
mailserver_messages_queued_total | Counter | Messages queued for delivery |
mailserver_messages_rejected_total | Counter | Rejected messages with reason |
mailserver_messages_bounced_total | Counter | Bounced messages |
mailserver_delivery_duration_seconds | Histogram | Delivery latency distribution |
mailserver_delivery_retries_total | Counter | Total retry attempts |
mailserver_queue_depth | Gauge | Current queue size |
Metrics Definition
// @filename: main.go
var (
MessagesSent = promauto.NewCounter(prometheus.CounterOpts{
Name: "mailserver_messages_sent_total",
Help: "Total number of messages sent successfully",
})
MessagesQueued = promauto.NewCounter(prometheus.CounterOpts{
Name: "mailserver_messages_queued_total",
Help: "Total number of messages queued for delivery",
})
MessagesRejected = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "mailserver_messages_rejected_total",
Help: "Total number of messages rejected",
}, []string{"reason"})
MessagesBounced = promauto.NewCounter(prometheus.CounterOpts{
Name: "mailserver_messages_bounced_total",
Help: "Total number of messages that bounced",
})
DeliveryDuration = promauto.NewHistogram(prometheus.HistogramOpts{
Name: "mailserver_delivery_duration_seconds",
Help: "Time taken to deliver messages",
Buckets: prometheus.ExponentialBuckets(0.1, 2, 10), // 0.1s to ~100s
})
DeliveryRetries = promauto.NewCounter(prometheus.CounterOpts{
Name: "mailserver_delivery_retries_total",
Help: "Total number of delivery retry attempts",
})
QueueDepth = promauto.NewGauge(prometheus.GaugeOpts{
Name: "mailserver_queue_depth",
Help: "Current number of messages in the delivery queue",
})
)
Recording Metrics
// @filename: main.go
func (e *Engine) deliverMessage(msg *queue.Message) {
start := time.Now()
// ... delivery logic ...
if err == nil {
metrics.MessagesSent.Inc()
metrics.DeliveryDuration.Observe(time.Since(start).Seconds())
} else if isPermanentError(err) {
metrics.MessagesRejected.WithLabelValues("permanent").Inc()
} else {
metrics.DeliveryRetries.Inc()
}
}
Prometheus Endpoint
mux.Handle("/metrics", promhttp.Handler())
Access metrics at http://localhost:8080/metrics
Testing Approach
Comprehensive testing ensures reliability across various scenarios.
Configuration Tests
// @filename: handlers.go
func TestConfig_Defaults(t *testing.T) {
cfg := DefaultConfig()
if cfg.Workers != 4 {
t.Errorf("Workers = %d, want 4", cfg.Workers)
}
if cfg.ConnectTimeout != 30*time.Second {
t.Errorf("ConnectTimeout = %v, want 30s", cfg.ConnectTimeout)
}
if cfg.MaxMessageSize != 25*1024*1024 {
t.Errorf("MaxMessageSize = %d, want 25MB", cfg.MaxMessageSize)
}
}
Error Classification Tests
// @filename: handlers.go
func TestIsPermanentError(t *testing.T) {
tests := []struct {
name string
err error
want bool
}{
{"550 user not found", errors.New("550 User not found"), true},
{"421 try again", errors.New("421 Try again later"), false},
{"connection timeout", errors.New("connection timeout"), false},
{"ErrPermanentFailure", ErrPermanentFailure, true},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := isPermanentError(tt.err)
if got != tt.want {
t.Errorf("isPermanentError(%v) = %v, want %v", tt.err, got, tt.want)
}
})
}
}
File Cleanup Tests
// @filename: handlers.go
func TestEngine_CleanupMessageFile(t *testing.T) {
tmpDir := t.TempDir()
e := &Engine{
config: Config{QueuePath: tmpDir},
}
t.Run("cleanup existing file", func(t *testing.T) {
path := filepath.Join(tmpDir, "test.eml")
os.WriteFile(path, []byte("test"), 0644)
err := e.cleanupMessageFile(path)
if err != nil {
t.Errorf("cleanupMessageFile() error = %v", err)
}
if _, err := os.Stat(path); !os.IsNotExist(err) {
t.Error("File should have been deleted")
}
})
t.Run("refuse cleanup outside queue path", func(t *testing.T) {
otherDir := t.TempDir()
path := filepath.Join(otherDir, "other.eml")
os.WriteFile(path, []byte("test"), 0644)
// Should refuse to delete (silently)
err := e.cleanupMessageFile(path)
if err != nil {
t.Errorf("cleanupMessageFile() error = %v", err)
}
// File should still exist
if _, err := os.Stat(path); os.IsNotExist(err) {
t.Error("File outside queue path should NOT have been deleted")
}
})
}
Benchmark Tests
// @filename: handlers.go
func BenchmarkExtractDomain(b *testing.B) {
email := "user@example.com"
for i := 0; i < b.N; i++ {
extractDomain(email)
}
}
func BenchmarkIsPermanentError(b *testing.B) {
err := errors.New("550 User not found")
for i := 0; i < b.N; i++ {
isPermanentError(err)
}
}
Performance Tuning
Optimize the delivery engine for throughput and resource efficiency.
Worker Pool Sizing
| Queue Size | Recommended Workers | CPU Cores |
|---|---|---|
| < 100 | 2-4 | 1-2 |
| 100-1000 | 4-8 | 2-4 |
| 1000-10000 | 8-16 | 4-8 |
| > 10000 | 16+ | 8+ |
// @filename: handlers.go
func DefaultConfig() Config {
return Config{
Workers: 4, // Adjust based on CPU cores
ConnectTimeout: 30 * time.Second,
CommandTimeout: 5 * time.Minute,
}
}
Timeout Configuration
// @filename: main.go
type Config struct {
Workers int
ConnectTimeout time.Duration // 30s: TCP connection
CommandTimeout time.Duration // 5m: SMTP session
ExecutionTimeout time.Duration // 2m: Per-domain circuit breaker
}
Redis Optimization
// @filename: main.go
// Connection pool tuning
opts.PoolSize = 10 // Max concurrent Redis connections
opts.MinIdleConns = 5 // Keep at least 5 idle connections
opts.MaxIdleConns = 10 // Max idle connections
opts.ConnMaxIdleTime = 5 * time.Minute // Close idle after 5m
opts.ConnMaxLifetime = 30 * time.Minute // Max connection age
MX Cache Tuning
// @filename: handlers.go
func DefaultMXResolverConfig() MXResolverConfig {
return MXResolverConfig{
CacheTTL: 5 * time.Minute, // Balance between freshness and load
Timeout: 10 * time.Second, // DNS lookup timeout
}
}
Production Deployment
Docker Configuration
# @filename: Dockerfile
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY . .
RUN go build -o email-server ./cmd/server
FROM alpine:latest
RUN apk add --no-cache ca-certificates
WORKDIR /app
COPY --from=builder /app/email-server .
COPY --from=builder /app/config.yaml .
EXPOSE 25 8080 993 4190
CMD ["./email-server"]
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: email-delivery
spec:
replicas: 3
selector:
matchLabels:
app: email-delivery
template:
metadata:
labels:
app: email-delivery
spec:
containers:
- name: delivery
image: email-server:latest
ports:
- containerPort: 8080
env:
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: email-secrets
key: redis-url
- name: WORKER_COUNT
value: '8'
- name: RELAY_HOST
value: 'smtp.relay.example.com:587'
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Monitoring Stack
# Prometheus scraping configuration
scrape_configs:
- job_name: 'email-delivery'
static_configs:
- targets: ['email-delivery:8080']
metrics_path: '/metrics'
scrape_interval: 15s
Alerting Rules
groups:
- name: email_delivery
rules:
- alert: HighQueueDepth
expr: mailserver_queue_depth > 1000
for: 5m
annotations:
summary: 'Queue depth is too high'
- alert: HighDeliveryLatency
expr: histogram_quantile(0.95, rate(mailserver_delivery_duration_seconds_bucket[5m])) > 60
for: 10m
annotations:
summary: '95th percentile delivery latency exceeds 60s'
- alert: HighBounceRate
expr: rate(mailserver_messages_bounced_total[5m]) / rate(mailserver_messages_sent_total[5m]) > 0.1
for: 15m
annotations:
summary: 'Bounce rate exceeds 10%'
Key Takeaways
- Circuit breakers prevent cascading failures by isolating problematic domains
- Exponential backoff with jitter prevents thundering herds on retry
- Defensive programming handles edge cases like nil circuit breakers
- Comprehensive metrics provide visibility into system health
- Worker pools enable horizontal scaling for increased throughput
- TLS fallback ensures delivery to misconfigured servers while maintaining security
- MX caching reduces DNS load and improves delivery speed
This architecture has been battle-tested in production, handling thousands of daily email deliveries with high reliability and performance.
