Org Implications of Contemporary Observability Tooling

Why metrics, logs, and traces remain separate and what that means for your org.

Jun 30, 2025

The three pillars of observability. Photo by Adrien Delforge on Unsplash

Observability advice on the Internetz often boils down to “metrics for alerts, logs for troubleshooting, traces for causality.” But why do we even have these three separate pillars? In this essay, I unpack the storage and query constraints that shaped contemporary observability tooling and show how those technical details leak into developer experience and organisational design. If you're a senior dev, EM, platform engineer, or tech leader thinking about your org’s observability strategy, this is for you.

Observability, as it has come to be understood in the industry, is often described as having three pillars - metrics, logs, and traces. Vendor after vendor churns out best-practice guides on how to set up observability for your services. All advice on the Internetz tends to converge on:

Set up alerts on your metrics—this is your first line of defence to know something is wrong.
Use logs to troubleshoot.
Use traces to debug causality across service boundaries.

But why do we need three separate entities to make one system observable? The core reason is storage constraints. Tools for metrics, logs, and traces are designed to answer different questions about your system. This drives differences in how the data is structured, stored, and queried.

Here’s a high-level overview of the storage requirements and their consequences -

Metrics

Storage is optimised to answer questions like “how many events occurred in a given time range?” over continuous time windows.
Example: “Continuously evaluate 500 status codes in a 5-minute window and raise an alert if it exceeds a threshold.”
Time series databases (TSDBs) excel at these write/read patterns.
No strict requirements on ordered processing of metric samples. In a 5-minute window, it doesn’t matter if two samples arrive slightly out of order.
TSDBs face significant costs with high cardinality. Each new label value spawns a new time series, leading to expensive fan-out queries.

Logs

Storage is optimised for querying unstructured or semi-structured data.
Example: “Get me all logs with level=error, status code=500, and containing ‘nil pointer’ in the stack trace.”
Document databases or text indexes are well-suited to this kind of query.
Ordering is important because logs often need to reconstruct the exact sequence of events.
Storage must handle high-cardinality data by design.
Data volume is generally much higher than for metrics, regardless of architecture. For example, a single HTTP request might emit one metric sample but generate multiple logs (depending on implementation).

Traces

Storage is designed to make querying and visualising span trees efficient, which requires persisting parent-child relationships.
While log-trace correlation is a newer vendor feature, traces have traditionally carried dense diagnostic error information that would otherwise live in logs.
The volume of trace data can be extremely high. For example, the default OpenTelemetry GraphQL instrumentation emits traces for every field resolution- 10,000 clients asking for 50 different fields per minute generates a lot of spans.
This is why traces are typically sampled - retaining only the “interesting” traces, such as those with errors or high latency.

This is why so much of the general advice focuses on configuring alerts on your metrics, keeping your logs structured, and sampling your traces.

Developer Cognitive Load

What does all of this mean for backend developers?

They need to understand metrics, typically via Prometheus, counters, gauges, histograms, how to query them, and their limitations. (Prometheus’s cardinality explosion problem has bitten even the most experienced platform engineers more times than they’d like to admit. I know it did for me. Is this really something you want every developer to grapple with, or should they be blissfully unaware?)

They need to understand logs - structured vs. unstructured logging, log levels, and how to query logs in vendor UIs like Elastic or Splunk.

They also need to understand tracing - including instrumentation and sampling strategies.

Organisational Implications

These technical design choices have organisational consequences.

How do you manage this complexity when you have 1500 teams? Is the expectation that every team masters these concepts?

To some extent, yes. Even if you concentrate observability expertise into a platform team and bake instrumentation into “golden path” templates, teams still need to understand that there are three distinct forms of telemetry - metrics, logs, and traces - and know which to use for which purpose.

That cognitive load can't be completely outsourced - something for hiring managers to keep in mind when staffing teams - just coz observability is a platform capability, don't mean your team has observability maturity.

On Unifying the Developer Experience

To me, the need for separate metrics, logs, and traces, each with its own tooling and learning curve, is a classic case of infrastructure implementation details, specifically storage constraints, leaking into the application development domain.

What I hope for is a more unified data model that reduces this developer cognitive burden.

No, OpenTelemetry doesn’t solve this problem. It standardises formats for metrics and traces across vendors, but it doesn’t unify the data model across metrics, logs, and traces.

The only example that comes close is Honeycomb’s “everything is an event” model. They've clearly invested in storage optimisations to keep query latency low. It’ll be interesting to see whether that approach truly reduces cognitive load for developers at scale.

Expectation Setting

Despite having a dedicated observability platform team to build and maintain “golden path” instrumentation, organisations still need to invest in -

Ensuring product engineers understand the purpose and use of metrics, logs, and traces.
Providing training or documentation on querying and interpreting telemetry.
Bridging any observability knowledge gaps within teams so that engineers can effectively diagnose and resolve production issues.

In the end, observability isn't a tooling problem you can just buy your way out of. It's making sure your people know what the hell they're looking at.

Musings on Software

Discussion about this post