11 KiB
11 KiB
Full OpenTelemetry Observability Stack on Kubernetes
Architecture
┌─────────────────┐ OTLP (gRPC :4317)
│ Quarkus App │──────────────────────────┐
│ (metrics, │ │
│ traces, logs) │ ▼
│ │ ┌──────────────────────┐
│ + ServiceMonitor│ │ OpenTelemetry │
│ + PrometheusRule│ │ Collector │
└─────────────────┘ │ (gateway mode) │
└───┬───────┬───────┬───┘
│ │ │
metrics │ │ │ logs
(remote │ │ │ (otlphttp
write) │ │ │ → Loki)
│ │ │
▼ │ ▼
┌──────────┐ │ ┌──────────┐
│Prometheus│ │ │ Loki │
│ + Thanos│ │ └──────────┘
│ Sidecar │ │
└────┬─────┘ │ traces (otlp)
│ │
uploads │ │
blocks │ ▼
▼ ┌──────────┐
┌──────────┐ │ Tempo │
│ MinIO │ └──────────┘
│ (S3) │
└────┬─────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Thanos │ │ Thanos │ │ Thanos │
│ Store │ │Compactor │ │ Query │
│ Gateway │ │ │ │ │
└──────────┘ └──────────┘ └─────┬────┘
│
┌──────────┐
│ Grafana │
│(included)│
└──────────┘
Components
| Component | Role | Helm Chart |
|---|---|---|
| MinIO | S3-compatible object storage (Thanos) | minio/minio |
| kube-prometheus-stack | Prometheus + Thanos Sidecar + Grafana | prometheus-community/kube-prometheus-stack |
| Thanos | Query + Store Gateway + Compactor | bitnami/thanos |
| Loki | Log aggregation | grafana/loki |
| Tempo | Distributed tracing | grafana/tempo |
| OTel Collector | Unified telemetry pipeline | open-telemetry/opentelemetry-collector |
| Quarkus App | Demo microservice + ServiceMonitor | Custom manifests |
Prerequisites
# A running Kubernetes cluster (minikube, kind, k3s, etc.)
# Helm 3.x installed
# kubectl configured
Deployment Steps
1. Add Helm Repositories
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add minio https://charts.min.io/
helm repo update
2. Create Namespace
kubectl create namespace observability
3. Deploy MinIO (S3-compatible object storage)
helm install minio minio/minio \
-n observability \
-f helm-values/minio-values.yaml \
--version 5.3.0 \
--wait
4. Create the Thanos Object Storage Secret
# Pre-configured to point to the in-cluster MinIO — no edits needed.
kubectl apply -f k8s/thanos-objstore-secret.yaml -n observability
5. Deploy kube-prometheus-stack (with Thanos Sidecar)
helm install kube-prom prometheus-community/kube-prometheus-stack \
-n observability \
-f helm-values/kube-prometheus-stack-values.yaml \
--version 65.1.1 \
--wait
6. Deploy Thanos (Query + Store Gateway + Compactor)
helm install thanos bitnami/thanos \
-n observability \
-f helm-values/thanos-values.yaml \
--version 15.7.25 \
--wait
7. Deploy Loki
helm install loki grafana/loki \
-n observability \
-f helm-values/loki-values.yaml \
--version 6.16.0 \
--wait
8. Deploy Tempo
helm install tempo grafana/tempo \
-n observability \
-f helm-values/tempo-values.yaml \
--version 1.10.3 \
--wait
9. Deploy OpenTelemetry Collector
helm install otel-collector open-telemetry/opentelemetry-collector \
-n observability \
-f helm-values/otel-collector-values.yaml \
--version 0.108.0 \
--wait
10. Configure Grafana Datasources
kubectl apply -f k8s/grafana-datasources.yaml -n observability
# Restart Grafana to pick up the new datasources
kubectl rollout restart deployment kube-prom-grafana -n observability
11. Build and Deploy the Quarkus App
# Option A: Build locally and push to your registry
cd quarkus-app
docker build -t your-registry/otel-quarkus-demo:latest .
docker push your-registry/otel-quarkus-demo:latest
# Option B: For local clusters (minikube/kind), load directly
# minikube: eval $(minikube docker-env) && docker build -t otel-quarkus-demo:latest .
# kind: docker build -t otel-quarkus-demo:latest . && kind load docker-image otel-quarkus-demo:latest
# Deploy (includes ServiceMonitor + PrometheusRule — monitoring config ships with the app)
kubectl apply -f k8s/quarkus-app.yaml -n observability
12. Access Grafana
# Port-forward Grafana
kubectl port-forward svc/kube-prom-grafana 3000:80 -n observability
# Default credentials: admin / prom-operator
# Open http://localhost:3000
# Two Prometheus datasources available:
# - "Prometheus" → local (7-day retention)
# - "Thanos" → long-term via Thanos Query
# Port-forward MinIO Console (optional — inspect Thanos blocks)
kubectl port-forward svc/minio-console 9001:9001 -n observability
# Open http://localhost:9001 (minio / minio123)
13. Generate Traffic
# Port-forward the Quarkus app
kubectl port-forward svc/otel-quarkus-demo 8080:8080 -n observability
# Run the traffic generator
bash scripts/generate-traffic.sh
Troubleshooting Session Guide
Scenario 1: High Latency Investigation (Traces)
- Open Grafana → Explore → Tempo
- Run query:
{ resource.service.name = "otel-quarkus-demo" } - Look for traces with high duration on
/api/ordersendpoint - Drill into spans — the
processOrderspan has a simulated delay - Check span attributes:
order.item_count,order.total_price,order.processing_type - Notice: orders with
processing_type=complextake longer
Scenario 2: Error Rate Spike (Metrics + Logs)
- Grafana → Explore → Prometheus
- Query:
rate(http_server_requests_seconds_count{status=~"5.."}[5m]) - Compare with:
rate(http_server_requests_seconds_count{status="200"}[5m]) - Notice the
/api/ordersendpoint has occasional 500s - Switch to Loki and correlate:
{service_name="otel-quarkus-demo"} |= "ERROR" - Find the error logs — click the
TraceIDderived field to jump directly to the trace in Tempo
Scenario 3: Custom Business Metrics
- Grafana → Explore → Prometheus
- Query custom metrics:
orders_total— total orders processed (counter)orders_amount_total— total revenue (counter)order_processing_duration_seconds— order processing time (histogram)inventory_level— current inventory per product (gauge)
- Build a dashboard:
- Orders/sec:
rate(orders_total[5m]) - Revenue/min:
rate(orders_amount_total[1m]) * 60 - P99 latency:
histogram_quantile(0.99, rate(order_processing_duration_seconds_bucket[5m])) - Inventory levels:
inventory_level
- Orders/sec:
Scenario 4: Correlating Logs ↔ Traces ↔ Metrics
- Start from a metric alert: high error rate on orders
- In Loki, filter by time range and find error logs:
{service_name="otel-quarkus-demo"} |= "ERROR" - Click the
TraceIDlink on any log line (Loki extracts it from OTLP structured metadata) - In Tempo, search by trace ID — see the full request flow
- In the trace, find span events with error details
- Check related metrics for that time window to see broader impact
Useful PromQL Queries
# Request rate by endpoint
rate(http_server_requests_seconds_count[5m])
# Error rate percentage
100 * rate(http_server_requests_seconds_count{status=~"5.."}[5m])
/ rate(http_server_requests_seconds_count[5m])
# P95 latency by endpoint
histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))
# Custom: orders per second by status
rate(orders_total[5m])
# Custom: average order value
rate(orders_amount_total[5m]) / rate(orders_total[5m])
Useful LogQL Queries
# All app logs (service.name becomes service_name as a Loki stream label)
{service_name="otel-quarkus-demo"}
# Errors only
{service_name="otel-quarkus-demo"} |= "ERROR"
# Filter by trace ID (stored in OTLP structured metadata)
{service_name="otel-quarkus-demo"} | trace_id = "your-trace-id-here"
# Filter by structured metadata attributes
{service_name="otel-quarkus-demo"} | severity_text = "ERROR"
# Slow request logs
{service_name="otel-quarkus-demo"} |= "processed" |= "complex"