Implementation documentation for observability Solution
Overview
This documentation explains the implementation of our observability stack using OpenTelemetry, Grafana Alloy, Loki, Tempo, and Prometheus. This stack provides comprehensive monitoring and observability for our applications hosted on Azure Kubernetes Service (AKS).
How Grafana Alloy works
Grafana Alloy acts as the primary OpenTelemetry collector, aggregating metrics, logs, and traces from various sources. It processes the collected data and forwards it to Loki, Tempo, and Prometheus for storage and querying.
Configuration of Grafana Alloy
The detailed configuration of Grafana Alloy is available from: alloy-values.yaml.
Explanation of the configuration
- otelcol.exporter.otlp: Configures the OTLP exporter to send data to specified endpoints.
- otelcol.receiver.otlp: Sets up OTLP receivers for gRPC and HTTP protocols, defining how metrics, logs, and traces are received and forwarded.
- otelcol.processor.memory_limiter: Implements a memory limiter to control memory usage, ensuring stability.
- otelcol.processor.batch: Batches telemetry data before forwarding to improve performance and efficiency.
- logging: Configures logging level and format for troubleshooting and monitoring Alloy’s operation.
- otelcol.exporter.loki: Configures the exporter to send logs to Loki.
- loki.write: Defines the endpoint for sending logs to Loki.
- otelcol.exporter.otlp: Configures the exporter to send traces to Tempo.
- otelcol.exporter.prometheus: Configures the exporter to send metrics to Prometheus.
- prometheus.remote_write: Defines the endpoint for sending metrics to Prometheus.
Exposing Alloy services
The following service ports expose Alloy, making it accessible within the Kubernetes cluster:
- name: "grpc"
port: 4317
targetPort: 4317
protocol: "TCP"
- name: "http"
port: 4318
targetPort: 4318
protocol: "TCP"
Accessible via: alloy.monitoring.svc.cluster.local
Instrumentation of Metrics, Logs, and Traces
To send metrics, logs, and traces to Grafana Alloy, configure your applications to use the OpenTelemetry SDKs and set the endpoint to alloy.monitoring.svc.cluster.local.
Example for Metrics
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OtlpExporter
metrics.set_meter_provider(MeterProvider())
meter = metrics.get_meter(__name__)
exporter = OtlpExporter(endpoint="alloy.monitoring.svc.cluster.local:4317")
Example for Logs
from opentelemetry import logs
from opentelemetry.sdk.logs import LogEmitterProvider
from opentelemetry.exporter.otlp.proto.grpc.log_exporter import OtlpExporter
logs.set_log_emitter_provider(LogEmitterProvider())
emitter = logs.get_log_emitter(__name__)
exporter = OtlpExporter(endpoint="alloy.monitoring.svc.cluster.local:4317")
Example for Traces
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OtlpExporter
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
exporter = OtlpExporter(endpoint="alloy.monitoring.svc.cluster.local:4317")
Forwarding Data to Loki, Tempo, and Prometheus
Grafana Alloy processes and forwards the received data to the respective backends.
- Logs are forwarded to Loki at
http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push
- Traces are forwarded to Tempo at
http://tempo.monitoring.svc.cluster.local:4317
- Metrics are forwarded to Prometheus at
http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/api/v1/write
Using Grafana for Visualization
Grafana is available at https://grafana.inspection.alpha.canada.ca. Access is granted through the agency's GitHub organization.
Setting Up Grafana
- Login to Grafana using your GitHub credentials.
- Navigate to Dashboards to explore different directories like admin, devsecops, finesse, and nachet.
- Use one of the three data sources:
- Prometheus for metrics
- Loki for logs
- Tempo for traces
Creating Dashboards
- Go to Dashboards > New Dashboard.
- Select the data source (Prometheus, Loki, or Tempo).
- Create visualizations and configure panels as needed.
Current Dashboards
Most dashboards are under the devsecops folder, including:
- Node Health Monitoring
- ArgoCD Activity
- Ingress-NGINX Traffic
- Vulnerability Scans by Trivy and Falco
Future Updates
This documentation will be updated as we incorporate dashboards for client applications such as Fertiscan, Finesse and Nachet.