Do you know your unknown-unknowns?
In the world of Observability, the discussion of “unknown-unknowns” is common-place. However, having accumulated a certain number of grey hairs on my head, I remember this phrase popping up before in rather more infamous circumstances:
There are things we know that we know. There are known unknowns. That is to say there are things that we now know we don’t know. But there are also unknown unknowns. There are things we do not know we don’t know. — Donald Rumsfeld, 2002, US DoD news briefing
At first glance the idea of unknown-unknowns is just a tautology — a meaningless catch phrase good for little more than viral marketing and justifying acts of war. And, to a certain extent I think that holds true. However, there is some meaning packed into the phrase in the specific context of the Observability industry.
Honeycomb, led by the inestimable Charity Majors, define it like this:
Known-unknowns are (relatively) easy (or at least the paths are well-trodden). Unknown-unknowns are hard.
But here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers … the majority of your questions trend towards the unknown-unknown.
Debugging distributed systems looks like a long, skinny tail of almost-impossible things rarely happening. You can’t predict them all; you shouldn’t even try. https://www.honeycomb.io/blog/observability-whats-in-a-name/
And New Relic who are also trying to muscle in on the Observability game like this:
The problem is many developers cannot predict all of their software’s failure modes in advance. Often, there are simply too many possibilities, some of which are genuine unknown unknowns. You cannot fix the problem because it doesn’t even exist yet.
Conventional monitoring can’t remedy this issue. It can only track known unknowns. Following known KPIs is only as useful as the KPIs themselves. And, sometimes you track KPIs that are completely irrelevant to the problem occurring. https://newrelic.com/blog/best-practices/what-is-observability
Having worked with web/micro-service/service-oriented applications deployed in cloud environments for years, using tooling such as ELK, Prometheus and Grafana and commercial tooling like data dog and NewRelic, the big learning for me is to think beyond monitoring for outcomes I can predict.
Just like Test Driven Development was all about evolving a good design, Observability is a concern which forces us to consider the data and tools we want to be in place to effectively listen to and understand our system as it faces an unpredictable external world.