Building Observable Systems: Structured Logging & Tracing Guide

The 3 AM Epiphany

It was 3:14 AM on a Tuesday when the PagerDuty alert for 'Checkout-Service-Latency' fired. In the old days—circa 2018—we would have spent hours grepping through gigabytes of raw text files across four different clusters, trying to correlate a timestamp in the gateway to a cryptic error in the payment processor. Today, I clicked a single span in our Jaeger dashboard, saw a 2.4-second database lock in a downstream inventory service I didn't even know we were calling, and had a PR for the fix ready in 15 minutes. This isn't magic; it is the result of a deliberate shift toward observability-driven development.

Distributed systems are no longer the exception; they are the baseline. In 2026, even 'small' startups are running 20+ services across multi-region Kubernetes clusters or serverless environments. If you are still relying on fmt.Println or basic log.Info("something happened"), you are essentially flying a commercial jet in a storm without a radar. Observability isn't just a 'nice to have' feature for SREs; it is a core engineering discipline that separates teams that ship with confidence from teams that live in fear of their own production environment.

Structured Logging: Machine-Readable Breadcrumbs

Logging is not for humans. If you are writing logs primarily for a human to read in a terminal, you've already lost. In a modern stack, logs are data points intended for high-speed ingestion engines like ClickHouse, Vector, or ELK. Structured logging is the practice of treating every log entry as a structured object (usually JSON) rather than a string.

In 2026, we have finally moved past the 'log level' wars. The real value lies in metadata. Every log entry must include a trace_id, a service_name, an environment, and specific domain attributes like user_id or order_id. Why? Because searching for "failed to process order" is useless. Searching for order_id="7721" across all services involved in that order's lifecycle is transformative.

Implementation in Go 1.24+

Go's slog package, introduced in 1.21 and refined in subsequent versions, is now the industry standard. It provides a high-performance, structured logger that integrates natively with the context. Here is how we implement a production-grade logger that automatically extracts OpenTelemetry trace IDs.

package main

import (
	"context"
	"log/slog"
	"os"

	"go.opentelemetry.io/otel/trace"
)

// OTelHandler wraps a slog.Handler to inject trace IDs into every log record
type OTelHandler struct {
	slog.Handler
}

func (h *OTelHandler) Handle(ctx context.Context, r slog.Record) error {
	spanContext := trace.SpanContextFromContext(ctx)
	if spanContext.HasTraceID() {
		r.AddAttrs(slog.String("trace_id", spanContext.TraceID().String()))
	}
	if spanContext.HasSpanID() {
		r.AddAttrs(slog.String("span_id", spanContext.SpanID().String()))
	}
	return h.Handler.Handle(ctx, r)
}

func main() {
	// Initialize a JSON handler with our OTel wrapper
	baseHandler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
		Level: slog.LevelInfo,
	})
	logger := slog.New(&OTelHandler{baseHandler})
	slog.SetDefault(logger)

	ctx := context.Background()
	// In a real app, this ctx would contain a span from an OTel tracer
	slog.InfoContext(ctx, "processing shipment", "shipment_id", "ship_9982", "carrier", "fedex")
}

Distributed Tracing: The Thread of Ariadne

If logs are the breadcrumbs, distributed tracing is the map. A trace follows a single request as it traverses through various services, databases, and message queues. Each 'hop' is represented as a Span.

The industry has converged on OpenTelemetry (OTel). Do not build custom tracing logic. Do not use proprietary SDKs if you can avoid them. OTel provides a vendor-neutral way to collect and export telemetry data. The hardest part of tracing isn't the data collection; it's Context Propagation. This is the act of passing the trace_parent header across service boundaries. If your HTTP client doesn't pass the header, the trace breaks, and you're back to guessing.

Context Propagation in Rust

Rust's tracing ecosystem is incredibly powerful but requires strict adherence to the Span lifecycle. Using tracing-opentelemetry and axum, we can ensure every request is tracked.

use axum::{routing::get, Router};
use opentelemetry::global;
use opentelemetry_sdk::propagation::TraceContextPropagator;
use tracing::{info, instrument};
use tracing_subscriber::prelude::*;

#[tokio::main]
async fn main() {
    // Setup OTel propagator to handle W3C TraceContext headers
    global::set_text_map_propagator(TraceContextPropagator::new());

    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .install_batch(opentelemetry_sdk::runtime::Tokio)
        .expect("Failed to initialize tracer");

    let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
    tracing_subscriber::registry()
        .with(tracing_subscriber::fmt::layer().json())
        .with(telemetry)
        .init();

    let app = Router::new().route("/process", get(process_handler));

    let listener = tokio::net::TcpListener::bind("0.0.0.0:3000").await.unwrap();
    axum::serve(listener, app).await.unwrap();
}

#[instrument] // This macro automatically creates a span for the function
async fn process_handler() -> &'static str {
    info!(event = "data_lookup", database = "postgres", query_id = 42);
    "Success"
}

The Marriage of Logs and Traces

The most common mistake I see is teams treating logs and traces as separate silos. You have a logging dashboard in Datadog or Grafana, and a tracing dashboard in Jaeger or Honeycomb. This is a failure of architecture.

By injecting the trace_id into your structured logs (as shown in the Go example), you create a bidirectional link. When looking at a slow trace, you can instantly pull up all logs associated with that specific execution path. When looking at an error log, you can jump to the trace to see what happened 500ms before the error occurred in an entirely different service. This is 'Exemplar' based navigation, and it's the gold standard for high-performance teams.

The Gotchas: What the Docs Don't Tell You

Cardinality Explosion: Never, ever use a high-cardinality value (like a User ID or a raw URL with query params) as a Span Name. Most tracing backends index span names. If you have 10 million users, you'll have 10 million unique span names, which will crash your indexing layer and cost you a fortune. Use static names like GET /users/:id and put the user_id in the Attributes.
Sampling is your friend (and enemy): You cannot afford to trace 100% of requests at 50,000 requests per second. Use Head-based sampling for routine traffic (trace 1%) and Tail-based sampling for errors or slow requests (trace 100% of anything taking > 500ms).
Clock Skew: In a distributed environment, timestamps are lies. NTP helps, but spans might still appear to end before they start due to millisecond-level offsets. Always rely on the parent-child relationship in the trace metadata rather than absolute wall-clock time when calculating service-to-service latency.
Propagating the Context: If you use a background worker (like Sidekiq, Celery, or Temporal), you must manually serialize the trace context and pass it into the job payload. If you don't, your background jobs will appear as 'orphaned' traces, disconnected from the user action that triggered them.

Takeaway

Stop adding more logs. Instead, add context. Your action item for today: Pick one service and ensure that the trace_id is present in every single log line emitted. Once you can correlate a log to a trace, the 'dark matter' of your distributed system will finally become visible. Don't wait for the next 3 AM outage to start building your radar.

Beyond Print Statements: Engineering Observable Systems in 2026

The 3 AM Epiphany

Implementation in Go 1.24+

Distributed Tracing: The Thread of Ariadne

Context Propagation in Rust

The Marriage of Logs and Traces

The Gotchas: What the Docs Don't Tell You

Takeaway

Enjoyed this article?

Related Articles

Beyond the Log File: Engineering Observability for Scale in 2026

Stop Killing Your Downstreams: A Practical Guide to Resiliency in 2026

Uğur Kaval

System Design Patterns for High-Throughput Event Processing