Brown Bag Session: Service Monitoring

Brown Bag Session is our regular, virtual lunch meeting, during which we develop technical issues. Their goal is to improve our work, its effectiveness, and to develop the capabilities of our systems and tools. At the last session, the issue of monitoring our websites was discussed by Paweł Przybyła. Read what conclusions he came to.
Monitoring Metapack Services
The topic of our next Brown Bag session was to ensure the full observability of our systems. We use NewRelic to monitor our websites, which is an APM (Application Performance Monitoring) tool. It allows us to collect in one place a huge amount of telemetry data (events, metrics and information about distributed transactions in the form of so-called traces and spans).
These data are emitted both by our services and by all elements of our virtual infrastructure. New Relic has native AWS integration. This allows us to correlate the behavior of our websites with information about the operation of AWS websites. In the case of complex, distributed solutions and a large amount of recorded telemetry data, the challenge becomes to quickly find the cause in the case of incidents.
Monitoring versus observability
Here we come to the difference between monitoring and full observability. In most cases, monitoring informs us if something is not working. Most often then, “dashboards” or alerts based on recorded data are defined. With New Relic, you can use NRQL for this, which is very similar to SQL. However, this requires asking a specific question before the problem occurs. It’s not that easy with complex systems.
By gaining observability, we not only want to be informed that something is not working. First of all, we need to know why something is not working. We want to know that answer within minutes. Our goal is to understand how our system was operating at the time of the incident. We want to obtain this information based on a thorough analysis of reliable telemetry data, and not based on hunches and guesses.
New standards as a chance for more opportunities
When it comes to tracking distributed systems, it is always a big challenge to understand the “request” path and to correlate events and logs generated by individual services. Until recently, each monitoring solution provider implemented its own standard. They used specific http headers to send identifiers and the context of individual “spans”. In February 2020 the recommendation was published ( https://www.w3.org/ TR / trace-context / ) developed by engineers from New Relic, Google and Microsoft, among others. This standard has already been implemented by New Relic and by Microsoft in .NET Core. This gives hope for easier tracking of customer requests. Even if the services that support them will be monitored by different systems.
This topic is very important and complex. Although our solutions in this area are already very advanced, we are constantly learning new things and improving our systems. It is very valuable that we regularly discuss our solutions and challenges with engineers from New Relic and Elastic. Thanks to this, we use the best practices. At the same time, we have an influence on the development of these tools.