Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.
Observability is critical to the success of any application. However, defining observability is tricky. Some people confuse it with monitoring or logging, and others think it’s essentially about analytics, which is only a part of observability.
Observability, when done correctly, gives you incredible insights into the deep internal parts of your system and allows you to ask complex, improvement-focused questions, such as:
- Where is your system fragile?
- What are you doing well? What are you doing poorly?
- What should come next in your product roadmap?
- Does any code need to be reworked/rewritten?
- Where are your common points of failure?
All these are important questions to ask and can be answered with data-driven information created by implementing good observability practices.
In this article, you’ll learn what observability is, why it’s important and what kinds of problems observability helps solve. You’ll also learn about some best practices for observability and how to implement it so that you can start improving your application today.
What is observability?
Observability is how well you know what’s happening inside of your software system without writing new code.
If you were asked which of your microservices are experiencing the most errors, what the worst-performing part of your system is, or what the most common frontend error your customers are experiencing, would you be able to answer those questions? If your team has to go away and write code to answer them, it’s fair to say your system isn’t observable. This means that your system constantly becomes a game of whack-a-mole whenever new questions get asked.
Why is observability important?
Good observability allows you to make data-driven, positive business outcomes. Knowing what to work on, what to improve, and what to ignore can propel your company from success to success and save you time on things your customers don’t care about or aren’t even real issues, such as offering a language on your site that your customers most likely aren’t using.
Observability is also vitally important for new software practices. In the last few decades, software systems have become increasingly complex; however, monitoring best practices haven’t developed at the same speed. Traditionally, web development was done using something like the LAMP (Linux, Apache, MySQL, PHP/Perl/Python) stack, which is one big database with some middleware, a web layer and a caching layer. The LAMP stack is very simple and fairly trivial to debug. All you have to do is load balance all the above to scale, and any issues can be quickly triaged, fixed and released due to the monolithic nature of the application.
However, now, software offerings, frameworks, paradigms and libraries have hugely increased the complexity of their systems due to things like cloud infrastructure, distributed microservices, multiple geo locations, multiple languages, multiple software offerings, and container orchestration technology.
Observability can help you ask and answer important questions about your software system and all the different states it can go through by observing it.
According to Stripe’s The Developer Coefficient report, good observability saves around 42% of a company’s developer time, including debugging and refactoring.
What problems does observability help solve?
There are numerous benefits when you follow good observability practices and bake them directly into your software system, including the following:
Releases are faster
When you know more about your system, you can iterate quicker. You save your developers days of debugging vague, random issues.
For instance, I have experience working at a multibillion-dollar company with millions of concurrent users. One of the tasks of the whole software team was to look through the logs of the support queue and try to resolve them. However, this was an incredibly difficult task. All the team ever got in the ticket was a stack trace and a count of the error logs. This left the developers essentially looking through the code for hours, trying to track down the most likely reason for the error.
There were many cases when the (suspected) reason was fixed, passed QA, and released, but the developer was wrong, and the process had to start all over again.
Good observability takes the guesswork out of this process and can offer far more context, data and assistance to resolve issues in your system.
Incidents become easier to fix
When you have clear insights and data for key parts of your code and business, you provide your developers with the context and information they need to fix things.
A company can never fix something they don’t measure. This applies to incidents, too.
Having key information, such as the following, allows you to significantly reduce your mean time to recover from an incident:
- How do you replicate the incident?
- When does it happen?
- Is there a workaround?
- Does a service error occur when you replicate the incident?
It helps you decide what to work on
As previously stated, with the extra information you gain from good observability practices, you’re able to decide what you need to work on.
For instance, if a certain bug affects only 0.001 percent of the customer base, occurs in a rarely used language, and is easily fixed by a refresh, it makes sense to focus on more severe system bugs. This will give you the most bang for your buck regarding the time developers spend on your system, and it allows you to focus on resolving customer issues, ultimately focusing on the user experience.
With good observability, you’ll know what your customers’ biggest frustrations are, and this information can help drive your product roadmap or bug backlog.
Observability best practices
There are a few best practices that you should follow when implementing observability, including the following:
Three pillars of observability
Remember the three pillars of observability: logs, metrics, and traces. These are all different types of time-series data and can help improve your system’s observability. Using a time-series database, like InfluxDB, makes it easier to work with and effectively use these types of data.
Each of these serves as a useful and important part of the observability of your system. For instance, logs are time-stamped records of events that occurred in your system. Metrics are numeric representations of data measured over time (i.e., 100 customers used your site over a one-hour period). Traces are a representation of flow-related events through your system (i.e., a customer hitting your landing page, adding a T-shirt to their cart, and then purchasing that shirt).
Each of these offers unique and powerful insights into your system and can help you improve it.
Conduct A/B testing
A/B testing is an important tool to drive improvements in your product and your code.
By observing your system, you can make changes to your system/refactoring and directly measure the customer impact.
An example would be to move the navigation of your site from the footer to the header, where most sites normally place it. From here, you could measure the time people take to navigate to where they need to go, session duration, or time-to-purchase as a direct result of moving your navigation breadcrumb to the header.
You can get rid of the poorly performing version of your test and use your A/B test to drive your positive key performance indicator (KPI) metrics.
Don’t throw away context
For your system to truly be observable, you need to maintain as much context as possible. Everything happens within the context of time, and time-series data preserves that context. It is also metadata around the events you are observing. Context helps you to better understand the whole picture of an issue you’re facing and leads to speedier resolutions.
For instance, if your system starts to get an error at a certain time, context could be the key to truly observing and deciphering the cause. So if your system starts to get an error only on Fridays, you may realize that the errors are being caused by an automated database backup script that also takes place at that time. However, if you haven’t been capturing all the context and information around that specific log, the log in isolation is useless. A solution like InfluxDB can help with storing, managing and using this type of data.
Context includes things like the following:
- The time of your event.
- The count of your event.
- The user associated with your event.
- The day of the event.
Maintain unique IDs throughout the system
In systems where multiple parts of the system need to communicate, one single event may commonly be aliased.
For example, if your frontend page sends a customer to a payment page, you may have a unique ID for the customer that is hard to correlate to the payment they just made. This is considered an anti-pattern.
You need to ensure that all the different parts of your system are speaking one unified language. If you don’t, you’ll only ever achieve observability in a portion of your system. Once it becomes hard to correlate one error between two different systems, you’ll be back to having an unobservable system.
Observability vs. monitoring
Monitoring and observability are often confused; however, it’s important to understand their differences so that you can implement both accurately.
Monitoring deals with known unknowns. For example, if you know you don’t have a lot of information in your API that deals with your payments backend, you can add logs into it in order to monitor that system. Monitoring is generally more reactive and is used to track a particular part of your system.
Monitoring is important but is different from observability.
Observability generally deals with unknown unknowns. For example, you may not even know you don’t have much information in your payments backend system, and this is where observability comes into play. You begin to understand your system more deeply, and when you gain a deep, intricate view of your system, you can identify your holes and where you need to improve.
This is less reactive and is normally broadly termed discovery work.
In this article, you learned about the importance of observability and the common questions that regularly appear when encountering observability, such as why it’s important and what problems it solves. You also learned how observability and monitoring differ.
Kealan Parr is a senior software engineer at Amber Labs.