Introduction: Why Observability Matters
In today’s world of distributed systems, cloud-native applications and microservices, maintaining visibility into system behavior is more critical than ever. At Vodafone, as we continue to expand and modernize our digital infrastructure, the need for reliable performance, rapid issue resolution and proactive monitoring has become a top priority.
The Problem: Traditional Monitoring Falls Short
Traditional monitoring tools focus primarily on metrics and logs, often leading to blind spots in distributed architectures. Key challenges include:
- Limited Visibility – When a request moves through multiple microservices, it’s hard to track and debug.
- Scattered Data – Different teams use different tools, making it difficult to connect the dots.
- Slower Issue Resolution – Without a unified observability approach, identifying the root cause of an issue takes time.
OpenTelemetry: A Unified Observability Framework
To solve these challenges, Vodafone adopted OpenTelemetry, a flexible, open-source observability framework that provides end-to-end visibility across distributed systems. By integrating OpenTelemetry into our infrastructure, we have transformed how we monitor, debug and optimize our services.
Learn more about OpenTelemetry:
In this article, we’ll explore:
- What OpenTelemetry is and why we chose it.
- How we integrated it into Vodafone’s infrastructure.
- The benefits we gained and challenges we faced.
🚀 Let’s dive in!
What is OpenTelemetry?
OpenTelemetry (OTel) is an open-source observability framework for collecting, processing and exporting telemetry data—such as traces, metrics and logs. It helps developers monitor system performance and diagnose issues by providing a standardized way to instrument applications and services, regardless of their programming language, infrastructure, or runtime environment.
How OTel works
OTel is made up of several key components:
- Instrumentation (APIs & SDKs): Used to collect traces, metrics and logs from applications through manual or auto instrumentation.
- Collectors: Units responsible for processing and exporting telemetry data.
- Exporters: Send telemetry data to monitoring tools like Prometheus, Jaeger, Grafana, or Datadog.
- Context Propagation: Ensures that a request’s journey across multiple services is tracked as a single flow.
Why we chose OTel at Vodafone
We evaluated several observability solutions before selecting OTel. Here’s why it stood out:
- Vendor-Neutral & Open-Source: No lock-in to a specific monitoring provider.
- Unified Data Collection: A single standard for traces, metrics and logs across all services.
- Multi-Language Support: Works with Java, Python, Node.js, Go and more.
- Scalability: Designed for large, complex infrastructures like ours.
By adopting OTel, we established a unified observability framework across our systems, enabling better monitoring, debugging and performance optimization.
How OTel traces requests across systems
In a distributed system, a single request can pass through multiple microservices, databases and external APIs. To track this journey, OTel uses traces and spans.
What are Traces and Spans?
- Trace → Represents the entire journey of an operation across multiple services.
- Span → Represents a unit of operation or step within the trace (e.g., an HTTP request, database query).
Example:
If a user logs into a Vodafone app, their request may go through:
- API Gateway (Receives request) → Span 1
- Authentication Service (Validates user) → Span 2
- User Database (Fetches user data) → Span 3
Each span belongs to the same Trace ID, linking them together as a trace.
Inside a Span: Understanding Its Structure
Every span contains important details that help track requests, debug failures and optimize performance. Below is an example of a span in JSON format, representing the Authentication Service step (Span 2) from our login example:
{
"trace_id": "1234abcd",
"span_id": "abcd5678",
"parent_span_id": "xyz123",
"name": "validate-user",
"start_time": "2025-03-19T12:00:00Z",
"end_time": "2025-03-19T12:00:02Z",
"attributes": {
"http.method": "POST",
"http.status_code": 200,
"auth.method": "OAuth"
}
}
Here, the trace_id
ensures that this span is linked to the entire request flow, while the span_id
uniquely identifies this step. The parent_span_id
connects it to the API Gateway, which triggered the authentication. The span also captures timing details and attributes, such as the HTTP method and authentication type.
For more details, refer to the OpenTelemetry documentation on Traces.
Context Propagation: Connecting Spans Across Services
In distributed systems, each service typically runs independently. So, when Service A calls Service B, how does Service B know it’s part of the same trace?
Answer: Context Propagation.
When a service starts a new span, OTel automatically:
- Generates a Trace ID (if it's the first service in the request, also called parent span).
- Passes the Trace ID & Span ID to the next service (via HTTP headers, Kafka messages, gRPC metadata, etc.).
- Extracts the Trace ID in the next service, ensuring it continues the same trace.
Example: Passing Context in an HTTP Request
- Service A (Caller) - Injects Trace Context
const { context, propagation } = require('@opentelemetry/api');
const axios = require('axios');
// Create an empty headers object
const headers = {};
// Inject the trace context into the headers
propagation.inject(context.active(), headers);
// Make the HTTP request with injected trace headers
axios.get('<http://service-b:5000/process>', { headers })
.then(response => console.log('Request successful:', response.status))
.catch(error => console.error('Request failed:', error));
- Service B (Receiver) - Extracts Trace Context
const { context, propagation, trace } = require('@opentelemetry/api');
const express = require('express');
const app = express();
const tracer = trace.getTracer('example-tracer');
app.use(express.json());
app.post('/process', (req, res) => {
// Extract trace context from incoming request headers
const extractedContext = propagation.extract(context.active(), req.headers);
// Start a new span **within the extracted context**
context.with(extractedContext, () => {
const span = tracer.startSpan('process-request');
try {
console.log('Continuing the trace from Service A!');
res.send('Processed request successfully.');
} finally {
span.end();
}
});
});
app.listen(5000, () => console.log('Service B running on port 5000'));
Now, both services share the same Trace ID and are linked together.
For an in-depth understanding, see the OpenTelemetry documentation on Context Propagation.
Exporters & Processors
Once traces, metrics and logs are collected, OTel needs a way to process and export this data to monitoring and observability platforms. This is where Exporters and Processors come into play.
Exporters
An Exporter is responsible for sending telemetry data to external systems for storage, visualization and analysis. It acts as the mechanism that transmits the collected data to the appropriate destination.
How Exporters Work:
After the data collection and processing phases, the exporter sends the telemetry data to the configured destination (such as Jaeger, Datadog, or others).
You can use a variety of exporters based on your preferred monitoring tools, including:
- Jaeger Exporter: Sends trace data to Jaeger for visualization.
- Prometheus Exporter: Exports metric data to Prometheus.
- OTLP Exporter: Sends data using the OTel Protocol (OTLP) to backends like the OTel Collector.
- Elasticsearch Exporter: Exports log data to Elasticsearch for storage, search and analysis.
Here’s a simple example of configuring a Jaeger Exporter to send trace data:
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { trace } = require('@opentelemetry/api');
// Configure the Jaeger exporter
const jaegerExporter = new JaegerExporter({
host: 'localhost',
port: 6831,
});
// Set up the TracerProvider and add the exporter
const tracerProvider = new NodeTracerProvider();
tracerProvider.addSpanProcessor(new BatchSpanProcessor(jaegerExporter));
// Register the TracerProvider globally
tracerProvider.register();
// Retrieve a tracer instance (use this in your application)
const tracer = trace.getTracer('vodafone-tracer');
console.log('OpenTelemetry Tracer with Jaeger Exporter initialized');
module.exports = { tracer };
In this example, we're exporting trace data to a Jaeger agent running locally. The JaegerExporter sends traces to the agent and the BatchSpanProcessor ensures that spans are batched and exported efficiently.
Processors
A Processor in OTel is used to manipulate telemetry data before it is exported. It allows you to customize how data is handled or modify its format.
There are two primary types of processors:
- Span Processors: Focus on processing trace spans.
- Metric Processors: Deal with metrics collection.
How Processors Work:
- Span Processors: These handle operations on spans (trace data). For example, you can batch spans together for efficient export, add attributes to spans, or filter spans based on specific criteria.
- Metric Processors: These can aggregate or modify metric data before sending it to exporters.
Common Use Cases for Processors:
- Batching: Instead of exporting each span immediately, you can batch multiple spans together for more efficient export.
- Filtering: You might want to filter out unnecessary data (e.g., traces that aren’t important for analysis).
- Adding Custom Attributes: Add custom attributes to the spans or metrics, such as additional metadata or tags.
Example: Using a Span Processor
In the earlier Jaeger exporter example, we used a BatchSpanProcessor. This processor batches spans before sending them to Jaeger to reduce overhead.
In that case:
- BatchSpanProcessor collects spans and exports them in batches, reducing the number of individual export operations.
Why Exporters & Processors Matter for Observability
In large, distributed systems, fine-grained control over how telemetry data is processed and where it’s sent is essential for performance optimization and reliable monitoring. By configuring exporters and processors:
- You ensure that telemetry data is sent to the right places for analysis.
- You can manipulate data for more efficient export (e.g., batching).
- You gain customized control over which data is collected and exported.
Integrating OTel into Vodafone’s Infrastructure
At Vodafone, the shift towards OTel was driven by the need for comprehensive observability across our diverse technology stack. From web applications to a broad array of backend systems, OTel allows us to gather critical telemetry data, ensuring we can effectively monitor and troubleshoot our systems.
OTel Across Vodafone’s Systems
The integration of OTel spans frontend, middleware and backend layers within Vodafone’s infrastructure. This ensures full visibility into the entire request lifecycle, allowing us to monitor performance, detect bottlenecks and optimize the user experience across different systems.
Frontend Channels
- Next.js & React Applications: OTel is integrated into our web applications to capture key performance metrics, including user interactions, page load times and HTTP request performance. This helps us identify frontend bottlenecks and optimize rendering speed and responsiveness.
Middleware & API Layer
- Backend for Frontend (BFF): Vodafone’s BFF services (built in Node.js) act as a dedicated API layer between frontend applications and backend systems. OTel traces requests as they pass through the BFF, giving visibility into how frontend applications communicate with middleware and backend services.
- DXL Middleware & Microservices: The DXL (Digital Experience Layer) acts as Vodafone’s core middleware, handling business logic and interactions with backend services. OTel instruments our microservices (built with Spring Boot, Quarkus) to track API calls, database queries (MongoDB) and inter-service communication.
Backend Systems (Ongoing & Future Plans)
- Databases & Storage: Vodafone Greece is exploring OTel instrumentation for several of our backend systems. The goal is to improve query performance insights, latency analysis and error tracking as we continue enhancing observability.
- Business-Critical Backend Services: As part of our broader observability strategy, we are assessing OTel’s role in monitoring key backend services to help drive greater end-to-end visibility as Vodafone evolves its observability approach.
Integration with Existing Observability Stack
Before adopting OTel, Vodafone relied on custom observability and monitoring solutions that were tightly coupled to specific frameworks and systems. Each system had its own isolated monitoring approach, leading to inconsistencies and limited visibility across the entire infrastructure.
With the adoption of OTel, Vodafone is actively transitioning towards a framework-agnostic, standardized observability approach that aims to provide end-to-end visibility across layers.
- Unified Observability with OTel: By standardizing on OTel, we can now collect, process and export telemetry data consistently, regardless of the underlying technology stack.
- Datadog as the Future of Visualization: While Vodafone previously used custom solutions, we are now shifting towards Datadog as our primary visualization and monitoring tool. OTel’s flexible exporters allow us to seamlessly route observability data into Datadog, enabling better correlation of logs, metrics and traces in a unified dashboard.
- Future-Proofing with Open Standards: OTel provides Vodafone with vendor-neutral observability, allowing us to integrate with multiple backends without reworking our instrumentation. This approach enhances our ability to scale and adapt as our technology stack evolves.
By adopting OTel and Datadog, Vodafone is building a modern, scalable observability stack that unifies monitoring across all services, improves troubleshooting and enhances performance optimization efforts.
How OTel Integrates with Vodafone’s Observability Stack
Vodafone’s integration of OTel was designed to complement and expand our existing observability tools. The key aspects of how OTel fits into our stack are as follows:
- Instrumentation: OTel is applied across both auto-instrumented and manual approaches:
- Auto-instrumentation: We leveraged OTel’s auto-instrumentation capabilities to capture data from common services and libraries, such as HTTP requests, database queries and service-to-service communication.
- Manual instrumentation : Custom instrumentation allows us to capture specific data points that aren’t automatically tracked by OTel, such as business-specific logic and logging events relevant to our operations.
- Trace Context Propagation: OTel’s context propagation ensures that user requests and interactions are tracked as they move across different services. By injecting trace context into HTTP headers and messages, we can trace each user’s journey from the frontend (React and Next.js apps) all the way to the backend (BFFs, microservices, MongoDB) and vice versa.
- Exporters: OTel provides flexible exporters that send telemetry data to the configured destination. For Vodafone, we primarily use:
- OTLP Exporter: The OTel Protocol (OTLP) exporter sends traces, metrics and logs from our application stack to Datadog for analysis and visualization.
Challenges Faced During Integration
Implementing OTel across Vodafone’s complex and large-scale infrastructure presented several challenges:
- Adapting OTel to Vodafone Greece’s Scale: Handling high volumes of telemetry data required fine-tuning sampling rates, batching and data aggregation to ensure optimal performance. We had to balance comprehensive monitoring with system efficiency, avoiding unnecessary overhead while maintaining the depth of insights needed for troubleshooting and optimization.
- Data Consistency Across Microservices: As Vodafone’s backend is comprised of microservices, we had to ensure that telemetry data was consistently captured and propagated across services. This included implementing trace context propagation between services running in different languages and frameworks, ensuring a seamless flow of telemetry data from request initiation to backend processing.
Benefits of OTel at Vodafone
The integration of OTel has provided Vodafone with numerous benefits:
- Unified Observability: OTel has allowed us to standardize our observability framework across layers, giving us full visibility into user interactions, system health and performance.
- End-to-End Tracing: We can now trace user requests from React and Next.js frontend apps, through our Node.js BFF and microservices and all the way to the MongoDB and PostgreSQL databases, providing comprehensive insights into system performance and behavior.
- Improved Troubleshooting: With OTel, we’ve been able to detect issues faster, from slow frontend page loads to backend database query performance. By visualizing traces, logs and metrics together in Datadog, we can more easily identify and fix performance bottlenecks or system failures.
- Scalability and Flexibility: OTel’s vendor-neutral architecture gives us the flexibility to switch between monitoring tools (such as Datadog and Grafana) without the need to re-instrument our code. This scalability ensures that we can adapt as our observability needs evolve.
Conclusion
OpenTelemetry (OTel) has become a key component of Vodafone’s observability strategy, providing us with the tools necessary to monitor and optimize the performance of our systems. Through OTel’s implementation, we've gained deeper visibility into our infrastructure, enabling quicker troubleshooting and ongoing improvements.
As Vodafone continues to scale and evolve, OTel will play an essential role in ensuring we maintain a high-quality user experience, optimize backend operations and swiftly address any operational challenges that arise.