Introduction: the importance of cloud observability

Hard truth: cloud applications are complex beings. There can be many reasons for application complexities, and sometimes the overall complexity is an outcome of keeping each component simple on its own, but in the grand scheme of things, breaking down apps to smaller parts, means we introduce more interactions, and interactions generate complexity.

What is cloud observability and why it matters in cloud computing

In order to understand the importance of cloud observability, we first need to differentiate it from traditional monitoring, and to gain a better understanding of what observability offers beyond monitoring.

Monitoring

In an effort to overcome complexity, and contain it, we need a framework to untangle interactions and the applications’ behavior during those interactions. Cloud monitoring tools are the bread and butter in this case, but monitoring, in its basic definition, is limited to the signals regarding known behaviors and deviations from them. Monitoring basically answers two important questions:

  1. Is everything Ok?
  2. What’s wrong?

Observability

However, in this day and age, where applications change rapidly - where errors, increased latencies and anomalies are part of the app working state, those questions do not pose an adequate bar for functionality, and will not help us predict future regressions that can be a matter of time or a matter of a few more application changes.

To be able to mitigate those problems, and to maintain a broader understanding of our cloud applications and their interactions with each other, we need a richer framework that will help us answer harder, broader questions such as:

  1. What is happening?
  2. How is it happening?

The framework for these questions is Observability, and it's a crucial part of any cloud infrastructure management today.

Cloud observability components: the pillars of observability

Although there are several approaches to defining the core elements of observability, I’ll stick to the three widely, universally adopted types of observability data that are the basis for any Application Performance Monitoring (APM) stack. 

Logs

Logs are observability’s oldest, simplest pillar. Cloud native logging frameworks are one of the first things we developers look for when we adopt new languages and incept new microservices. They are loyal and descriptive for both humans and machines (although not optimized for the latter) which makes them the easiest pillar to adopt when we set out on our observability journey.  

This human-readable goodness, however, doesn’t come without a price:

  • It’s Messy. In the beginning, logs are nice and tidy, appear when they need, stay out of the way when they don’t, but soon enough, they infest the codebase, the standard output and eventually our investigation efforts, making us effectively chasing a needle in a haystack.
  • It lacks context. Although now mostly structured and labeled, logs were not built from the ground up for querying. In most cases, when logs are building up in masses from multiple sources, we’re left with a single, inadequate anchor - time. If you are a developer that has never gone through the tedious job of looking for a problem by looking at logs within an optimistic timeframe and slowly increasing it until you found your golden log after reviewing hundreds - you’re one lucky person!
  • Security scares. logs are frequently used to propagate errors, without much thought on what those errors can possibly contain, such as, in many cases, PII (Personal Identifiable Information). This means that logs can quickly become a security nightmare just as they are a developer’s best friend, especially when they are managed and sent to a 3rd party.
  • It’s expensive. Logs are the most permissive data type. Structure is recommended, not enforced, and definitely not relied upon for storage/runtime optimizations. When the retention is long and the amount is high - logs can take up considerably large storage, and to some extent - degrade your app performance.
groundcover’s Logs explorer, organizing and contextualizing logs is crucial for the long term

Metrics

Metrics are a more modern pillar (definitely in terms of adoption rate), that harness structure and math to bring a better approach for questions that involve quantities, for example:

What is the error rate of a specific problem? What’s the latency increase within a specific timeframe?

Metrics are time-series data points, and are stored in databases optimized for this data type called Time-series databases, TSDB in short. The most prominent TSDB is Prometheus which also provides the ecosystem for exposing and querying metrics, although there are other players such as VictoriaMetrics, InfluxDB and TimescaleDB that are compatible with Prometheus but bring a lot of innovation to the field.

As a more focused observability data type, metrics make an important set of questions much easier to answer, however it comes with a higher implementation cost, consisting of setting up a metrics ecosystem, actively exposing metrics from each workload, and small yet existing learning curve of how to produce and query them.

Some examples of metrics errors/sec, requests/sec, resource utilization from groundcover

Traces

While metrics and logs allow us to propagate details about a certain event in our app, traces are actual samples of it, they expose the actual data that is encapsulated within it. This X-Ray-like superpower gives engineers the opportunity to dive into an interaction between services, which in turn allows us to:

  1. Isolate the variation that causes an anomaly
  2. Reproduce a specific scenario in a controlled environment

Traces are definitely the more modern observability pillar in the bunch, and significantly improve Mean-Time-To-Detect (MTTD) by allowing engineers to reproduce several classes of incidents faster and verify that re-triggering them won’t result in recurring problems.

It’s important to mention that with the adoption of microservices, the amount of services involved in a procedure have increased significantly, and a problem within this kind of procedure can be originating in multiple services rather than in a single one, or, in other words - “several needles in the haystack”.

This challenge led to a new sub-class of tracing which is called distributed tracing - where each sample is contextualized with the samples that led to it and that are invoked after it.

However, tracing, is also incapable of supporting an observability strategy by itself and presents several challenges:

  • It’s expensive. Most Application Performance Monitoring (APM) products today monetize tracing (and especially distributed tracing) in a way that makes it difficult to implement freely.
  • It’s instrumentation-based. Most APM solutions (unlike groundcover :) ) rely on either changing the container entry point or importing an SDK that automatically hooks core libraries; this can lead to a drift in performance, behavior and in architecture.
  • Maintainability is tough. Tracing coverage is harder to retain in a modern environment where services are being mutated and replaced frequently.
Http network trace example, with that, we can reproduce an incident in a controlled environment

Alerts

Alerts are not a data pillar like logs, metrics and traces, but they are basically the day 2 operations after having observability data generated.

Observability is as good as it is actionable and proactive, and there are three stages for achieving those:

  1. Generating all the data
  2. Aggregating the data in a consumable way 
  3. Having the ability to define thresholds for important behaviors, and fire events when those are breached

An observability system that fails to provide means for alerting, is doomed to fail as it requires engineers to constantly watch and look out for problems in an ever-changing environment.

Implementing cloud observability in cloud native environments

So now that we’ve covered the foundations of cloud observability, let’s proceed towards defining what a cloud observability suite should look like.

When building your cloud observability suite, it's important to think about it like a pyramid:

  • The base of our pyramid is dependency-free, application agnostic visibility. Such a base will make sure you’re always covered to some extent, regardless of how your applications evolve
  • As we layer more observability tools, each tool will bring more clarity with a higher price of integration and maintenance
  • Our goal is to shift as much visibility towards the base, making observability less reliant on integration and maintenance where we can - without losing resolution, and while maintaining a sustainable cost of integration and ongoing maintenance of the upper layers

So how do we implement this strategy? By sticking into these values:

Choosing the right observability solution

When choosing observability tools, it’s important to use those that conform to standardized, widely adopted protocols. One example is Prometheus - the de-facto standard for metrics ingestion, and many metrics ingest solutions that will be compatible with it, which means you can choose the solution that best suits your needs.

eBPF metrics/traces alongside instrumenting applications for observability 

Logs ingestion solutions can be deployed separately from your application, which means they are already agnostic to your application.

Metrics and traces are harder to achieve dependency-free, and come with performance and integration penalties. This is where eBPF comes into the picture, and solutions like groundcover make sure you’re covered regardless of application changes, so you can implement a more focused and incremental instrumentation strategy.

Configuring metrics, logs and traces

When configuring your observability platform, it’ll either start generating data by itself (granularity varies significantly between vendors) or will wait for step-two integrations, when configuring the behavior of MLT (metrics, logs, traces) ingestion, ensure your platform allows:

  • Custom integrations - Observability is a fast-changing field, and as new tools are introduced, it's critical to not burden engineers with yet another ecosystem, so it's best to make sure our Observability platform makes it easy to incorporate new tools or make transitions towards it as painless as possible.
  • Aggregate and contextualize everything - Observability in cloud environments can get noisy. Really noisy. All those logs, metrics and traces will not necessarily make sense separately, an Observability platform should give you the big picture, and glue together the different pieces as you drill down.

Defining meaningful metrics and alerts 

Observability platforms are more than a data warehouse of native, infra-level metrics, logs and traces, they should allow two core expansions on top of those:

  1. Ingestion of custom metrics -  as your product grows, more application-level metrics should be put in place to provide higher-order data points, as not all problems/insights are easily deductible from infrastructure/network level metrics, your observability platform must be able to ingest custom defined metrics with emphasize on low integration overhead and ease of consumption.
  2. Alertability - ingesting the data and presenting in a passive way is the first step, but observability is lacking if no proactive measures are in place, particularly alerts. Alerts should be set up from day one, to allow quick response to regressions and trends.

Cloud Observability Challenges

By now it might be clear that cloud observability is a complex challenge, but one we must face as it is foundational to long term growth and velocity. 

When implementing cloud native observability tools, the following battles are probably going to be the most prominent:

Challenge Summary
So Much Data Managing the substantial volume of data generated by complex cloud applications,
including microservices, databases, and high request rates. Observability platforms
must provide context and clarity as users navigate through the data.
Privacy is a concern Ensuring data privacy and compliance due to sensitive information collected by
observability platforms. Managed platforms exporting data externally require robust
security guarantees to avoid data leaks.
Infrastructure Visibility Recognizing the impact of cloud infrastructure components (e.g., K8s, AKS/GKE/EKS)
on application behavior. Cloud observability platforms should prioritize treating
infrastructure as a crucial aspect.
Performance Issues Observability data’s role in real-time analytics, detecting severe downtime, SLA breaches, and issues demanding immediate response. Proactive measures require highly responsive and fast observability platforms.

So much data: handling the vast amount of data generated

Sometimes it seems like every piece of software today is complicated, every app is built to withstand 1m req/s before even launching, databases replicated from day one, and there are 10 microservices with a sidecar as soon as v0.0.1. We won’t discuss whether this is wrong in the first place, everything has a reason, but one thing is for sure - something needs to make sense of all this mess because otherwise no one can get a grip of what is going on.

Observability platforms are first and foremost responsible for making things make sense, buffering noise, bringing the required context, clarifying the big picture, and presenting more data as we drill down.

Privacy is a concern: ensuring data privacy and compliance

Security orientation is one of the great achievements of modern software engineering, it is something most developers interact with on a daily basis and that's a great thing.

Observability, as a core data ingestor in its core, inevitably contains sensitive information about users and services, and as such it must be audited and provide security guarantees that will allow security engineers and users to sleep well at night, one of the major considerations is where the data is actually stored. Managed observability platforms that actively export from the company boundaries to a 3rd party will obviously require special attention and compliance guarantees and additional efforts overall to prevent data leaks.

Infrastructure Visibility

In a cloud environment, application level visibility is just not enough sometimes, as the cloud infrastructure (e.g K8s, AKS/GKE/EKS components) have significant impact on the app’s behavior (memory overcommit, RBAC problems, and the list goes on), This means that our cloud observability platform must treat the infrastructure as a first class citizen, or in other words, as important as the apps themselves.

Performance issues

Observability data is not only crucial for trend detection and long term considerations, it is the basis for real time analytics, call-to-action events ranging from severe downtime, to SLA breaches, and problems that require immediate response, Thus, observability platforms must be fast, and preferably really fast.

Best Practices for Cloud Observability in Cloud Native Applications 

Now that we reviewed what should be looked for in observability solution, and where the focus should be, here are some best practices that are worth taking into consideration as the platform is being built and as it evolves. 

Defining Observability goals and objectives

As an observability stack provides broader context than a traditional monitoring stack, it's important to define goals and objectives to make sure we don’t find ourselves investing a lot of time in setup, just to invest some more significant time in usability.

As was hinted by now, two crucial goals of any observability solutions are:

  1. Low overhead - observability platform must have low overhead in both integration and in experience, if the platform makes you work for it - it means it takes your eyes away from what matters, other aspects of low overhead are the platform’s footprint and hit on throughput. This is another good place to celebrate the supremacy of eBPF. 
  2. Low MTTD - At its core, observability platforms should allow understanding of what’s wrong,  and as such, there is a very simple KPI it should be tested against - how fast did we find the problem? Observability platforms should allow all the RND stakeholders, from devops to developers to find what they are looking for fast, and even better, alert them with all the relevant details proactively.

Establishing a holistic monitoring strategy

Today, release cycles are multi-stage pipelines, from the local development cluster, to CI runners, to staging and finally production, unfortunately due to cost considerations and integration complexity, organizations are limiting their observability stack, and features within it to only a small portion of the environment.

The ability to bring observability to every step in the cycle can completely mitigate regressions before they hit production, and to identify trends that might lead to such much earlier.

Collaborating between development and operations teams 

As mentioned earlier, observability can have significant value to many different stakeholders in the organization - product, rnd, ops, sales, all of them can benefit from observability platforms.

A good observability platform provides actionable insights alongside the ability to drill down into the specific, and also allow customization that targets specific stakeholders in the organization. 

Upcoming Trends in Cloud Observability

So, what’s next? Where is cloud observability going? Given that eBPF is not the future but a battle tested, proven technology that has already transformed and continues to disrupt the observability landscape, we look past it to see what the next revolution might be. 

AI-driven observability and anomaly detection 

It’s kind of like stating the obvious by now, but after it changes the way we eat, sleep, do homework, cure diseases, and so on, it’s inevitable that AI will transform observability, especially when we were such good kids and made sure all the relevant data is so organized and meticulously labeled.

But on a more serious note, AI is definitely going to transform how and how fast we detect anomalies, even those that are the hardest to spot, that hide behind millions of requests and interactions, not to mention the benefit of complete elimination of the learning curve of incorporating new tools (e.g. new query languages, instrumentation SDKs, and so on).

Integration with DevOps and CI/CD pipelines

Observability platforms are not and should not a be a devops-first tool nor a production-first one, they present value across the entire engineering lifecycle and for every piece of software running, whether it's the developer to test how they improved latency in an upcoming version with a production sample, through the CI/CD pipelines, where engineers are trying to find out why fetching packages during build takes so long - Observability platforms can unfold the story of it all.

Advancements in distributed tracing and service mesh

Distributed tracing is currently at its early stages of adoption, and not quite a “plug and play” solution yet that can be easily (and cost-efficiently!) implemented everywhere.

With that in mind, most observability actors (including groundcover) are innovating in this field using different approaches, mainly by expanding the OpenTelemetry ecosystem, and, you guessed it - using eBPF!

Auto-instrumentation is making great progress in terms of language support and ease of use, and we at groundcover also support zero-change integration with pre-implemented tracing.

Service mesh is also worth mentioning as a solution that became much easier to adopt and generates valuable observability data (among other benefits), but also a crucial component to monitor as it can have grave impact on integrity and performance.

Whether you’ve already adopted those, or plan to in the near future, make sure you’re able to incorporate them into your observability stack with minimal effort.

Conclusion

This was a short journey through the past, present and future of observability foundations and approaches, that, like a good observability platform, lays the foundations that we believe must be the compass of your observability decisions.

At groundcover, we do our best to do the heavy lifting for you (and as you can see - it's heavy). Our focus is on providing an unbiased, agnostic observability, while allowing you to incorporate your existing instrumentations and metrics with minimal effort, and achieving both by relying on the state-of-the-art technologies and industry standard protocols so you know it will scale up with your company and your containerized applications as they grow.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.

We care about data. Check out our privacy policy.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.