Files

Christophe Vila 76fa88e9c3 added README.md

2026-05-29 17:28:22 +02:00

11 KiB

Raw Blame History

Full OpenTelemetry Observability Stack on Kubernetes

Architecture

┌─────────────────┐      OTLP (gRPC :4317)
│  Quarkus App    │──────────────────────────┐
│  (metrics,      │                          │
│   traces, logs) │                          ▼
│                 │              ┌──────────────────────┐
│ + ServiceMonitor│              │  OpenTelemetry        │
│ + PrometheusRule│              │  Collector            │
└─────────────────┘              │  (gateway mode)       │
                                 └───┬───────┬───────┬───┘
                                     │       │       │
                          metrics    │       │       │  logs
                          (remote    │       │       │  (otlphttp
                           write)    │       │       │   → Loki)
                                     │       │       │
                                     ▼       │       ▼
                              ┌──────────┐   │   ┌──────────┐
                              │Prometheus│   │   │  Loki    │
                              │  + Thanos│   │   └──────────┘
                              │  Sidecar │   │
                              └────┬─────┘   │  traces (otlp)
                                   │         │
                          uploads  │         │
                          blocks   │         ▼
                                   ▼      ┌──────────┐
                            ┌──────────┐  │  Tempo   │
                            │  MinIO   │  └──────────┘
                            │  (S3)    │
                            └────┬─────┘
                                 │
                    ┌────────────┼────────────┐
                    ▼            ▼            ▼
             ┌──────────┐ ┌──────────┐ ┌──────────┐
             │  Thanos  │ │  Thanos  │ │  Thanos  │
             │  Store   │ │Compactor │ │  Query   │
             │ Gateway  │ │          │ │          │
             └──────────┘ └──────────┘ └─────┬────┘
                                             │
                                      ┌──────────┐
                                      │ Grafana  │
                                      │(included)│
                                      └──────────┘

Components

Component	Role	Helm Chart
MinIO	S3-compatible object storage (Thanos)	minio/minio
kube-prometheus-stack	Prometheus + Thanos Sidecar + Grafana	prometheus-community/kube-prometheus-stack
Thanos	Query + Store Gateway + Compactor	bitnami/thanos
Loki	Log aggregation	grafana/loki
Tempo	Distributed tracing	grafana/tempo
OTel Collector	Unified telemetry pipeline	open-telemetry/opentelemetry-collector
Quarkus App	Demo microservice + ServiceMonitor	Custom manifests

Prerequisites

# A running Kubernetes cluster (minikube, kind, k3s, etc.)
# Helm 3.x installed
# kubectl configured

Deployment Steps

1. Add Helm Repositories

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add minio https://charts.min.io/
helm repo update

2. Create Namespace

kubectl create namespace observability

3. Deploy MinIO (S3-compatible object storage)

helm install minio minio/minio \
  -n observability \
  -f helm-values/minio-values.yaml \
  --version 5.3.0 \
  --wait

4. Create the Thanos Object Storage Secret

# Pre-configured to point to the in-cluster MinIO — no edits needed.
kubectl apply -f k8s/thanos-objstore-secret.yaml -n observability

5. Deploy kube-prometheus-stack (with Thanos Sidecar)

helm install kube-prom prometheus-community/kube-prometheus-stack \
  -n observability \
  -f helm-values/kube-prometheus-stack-values.yaml \
  --version 65.1.1 \
  --wait

6. Deploy Thanos (Query + Store Gateway + Compactor)

helm install thanos bitnami/thanos \
  -n observability \
  -f helm-values/thanos-values.yaml \
  --version 15.7.25 \
  --wait

7. Deploy Loki

helm install loki grafana/loki \
  -n observability \
  -f helm-values/loki-values.yaml \
  --version 6.16.0 \
  --wait

8. Deploy Tempo

helm install tempo grafana/tempo \
  -n observability \
  -f helm-values/tempo-values.yaml \
  --version 1.10.3 \
  --wait

9. Deploy OpenTelemetry Collector

helm install otel-collector open-telemetry/opentelemetry-collector \
  -n observability \
  -f helm-values/otel-collector-values.yaml \
  --version 0.108.0 \
  --wait

10. Configure Grafana Datasources

kubectl apply -f k8s/grafana-datasources.yaml -n observability
# Restart Grafana to pick up the new datasources
kubectl rollout restart deployment kube-prom-grafana -n observability

11. Build and Deploy the Quarkus App

# Option A: Build locally and push to your registry
cd quarkus-app
docker build -t your-registry/otel-quarkus-demo:latest .
docker push your-registry/otel-quarkus-demo:latest

# Option B: For local clusters (minikube/kind), load directly
# minikube: eval $(minikube docker-env) && docker build -t otel-quarkus-demo:latest .
# kind:     docker build -t otel-quarkus-demo:latest . && kind load docker-image otel-quarkus-demo:latest

# Deploy (includes ServiceMonitor + PrometheusRule — monitoring config ships with the app)
kubectl apply -f k8s/quarkus-app.yaml -n observability

12. Access Grafana

# Port-forward Grafana
kubectl port-forward svc/kube-prom-grafana 3000:80 -n observability

# Default credentials: admin / prom-operator
# Open http://localhost:3000
# Two Prometheus datasources available:
#   - "Prometheus" → local (7-day retention)
#   - "Thanos"     → long-term via Thanos Query

# Port-forward MinIO Console (optional — inspect Thanos blocks)
kubectl port-forward svc/minio-console 9001:9001 -n observability
# Open http://localhost:9001  (minio / minio123)

13. Generate Traffic

# Port-forward the Quarkus app
kubectl port-forward svc/otel-quarkus-demo 8080:8080 -n observability

# Run the traffic generator
bash scripts/generate-traffic.sh

Troubleshooting Session Guide

Scenario 1: High Latency Investigation (Traces)

Open Grafana → Explore → Tempo
Run query: { resource.service.name = "otel-quarkus-demo" }
Look for traces with high duration on /api/orders endpoint
Drill into spans — the processOrder span has a simulated delay
Check span attributes: order.item_count, order.total_price, order.processing_type
Notice: orders with processing_type=complex take longer

Scenario 2: Error Rate Spike (Metrics + Logs)

Grafana → Explore → Prometheus
Query: rate(http_server_requests_seconds_count{status=~"5.."}[5m])
Compare with: rate(http_server_requests_seconds_count{status="200"}[5m])
Notice the /api/orders endpoint has occasional 500s

Switch to Loki and correlate:

{service_name="otel-quarkus-demo"} |= "ERROR"

Find the error logs — click the TraceID derived field to jump directly to the trace in Tempo

Scenario 3: Custom Business Metrics

Grafana → Explore → Prometheus
Query custom metrics:
- orders_total — total orders processed (counter)
- orders_amount_total — total revenue (counter)
- order_processing_duration_seconds — order processing time (histogram)
- inventory_level — current inventory per product (gauge)
Build a dashboard:
- Orders/sec: rate(orders_total[5m])
- Revenue/min: rate(orders_amount_total[1m]) * 60
- P99 latency: histogram_quantile(0.99, rate(order_processing_duration_seconds_bucket[5m]))
- Inventory levels: inventory_level

Scenario 4: Correlating Logs ↔ Traces ↔ Metrics

Start from a metric alert: high error rate on orders
In Loki, filter by time range and find error logs:
```
{service_name="otel-quarkus-demo"} |= "ERROR"
```
Click the TraceID link on any log line (Loki extracts it from OTLP structured metadata)
In Tempo, search by trace ID — see the full request flow
In the trace, find span events with error details
Check related metrics for that time window to see broader impact

Useful PromQL Queries

# Request rate by endpoint
rate(http_server_requests_seconds_count[5m])

# Error rate percentage
100 * rate(http_server_requests_seconds_count{status=~"5.."}[5m])
  / rate(http_server_requests_seconds_count[5m])

# P95 latency by endpoint
histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))

# Custom: orders per second by status
rate(orders_total[5m])

# Custom: average order value
rate(orders_amount_total[5m]) / rate(orders_total[5m])

Useful LogQL Queries

# All app logs (service.name becomes service_name as a Loki stream label)
{service_name="otel-quarkus-demo"}

# Errors only
{service_name="otel-quarkus-demo"} |= "ERROR"

# Filter by trace ID (stored in OTLP structured metadata)
{service_name="otel-quarkus-demo"} | trace_id = "your-trace-id-here"

# Filter by structured metadata attributes
{service_name="otel-quarkus-demo"} | severity_text = "ERROR"

# Slow request logs
{service_name="otel-quarkus-demo"} |= "processed" |= "complex"

11 KiB Raw Blame History

Full OpenTelemetry Observability Stack on Kubernetes

Architecture

Components

Prerequisites

Deployment Steps

1. Add Helm Repositories

2. Create Namespace

3. Deploy MinIO (S3-compatible object storage)

4. Create the Thanos Object Storage Secret

5. Deploy kube-prometheus-stack (with Thanos Sidecar)

6. Deploy Thanos (Query + Store Gateway + Compactor)

7. Deploy Loki

8. Deploy Tempo

9. Deploy OpenTelemetry Collector

10. Configure Grafana Datasources

11. Build and Deploy the Quarkus App

12. Access Grafana

13. Generate Traffic

Troubleshooting Session Guide

Scenario 1: High Latency Investigation (Traces)

Scenario 2: Error Rate Spike (Metrics + Logs)

Scenario 3: Custom Business Metrics

Scenario 4: Correlating Logs ↔ Traces ↔ Metrics

Useful PromQL Queries

Useful LogQL Queries

11 KiB

Raw Blame History