Cognitive and Self-Adaptive System for Effective Distributed-Tracing (using Jaeger, Open Tracing)

With the advent of distributed systems and micro-services based architecture, end-to-end execution based dynamic API tracing systems have become an important tool for effective diagnosis of API failures and performance issues. However, current execution tracing system implementations available capture only a subset (only 1-5 %) of traces to manage storage and scale constraints effectively. The distribution of randomly sampled set is heavily skewed towards the normal/consistent execution-traces, missing out on unusual/interesting execution-traces required for the purposes of effective diagnosis of API-failures and Performance issues in the application and thus affecting both the Developers and SRE teams. We proposed a solution based on Machine Learning and Cognitive approach to remove this bias in collection of traces and storage and use self-adaptive method that can dynamically adapt based on actual data. The system can self-learn to capture the traces that are of higher interest (more variations in errors, warnings, response-codes etc.) and which can add value in finding the actual root cause of an issues, while maintaining the distribution ratio intact.
The solution has certainly proven to be a game changer for the SRE teams within the org in triaging issues with complex applications based on the logs, metrics and intelligent-traces with improved time to resolve or, MTTR. This approach is a forward-looking way to approach the common issues we as SRE face in the Observability and Infrastructure space and thus provides insights & paths that next generation of solutions SRE can adopt. In other words, we have used analytical methods in the pursuit of gaining efficient reliability work and make work/budget lighter.
Using the Adaptive Sampling approach with normal distributed tracing was a data-driven decision and proved effective because:

  • Standardises distribution of collected traces.
  • Reduces storage requirements by quite a lot and helps in COGS reduction signif.

Date

Nov 20 2025
Expired!

Time

11:15 - 12:00

Location

Jacobi