# Full OpenTelemetry Observability Stack on Kubernetes ## Architecture ``` ┌─────────────────┐ OTLP (gRPC :4317) │ Quarkus App │──────────────────────────┐ │ (metrics, │ │ │ traces, logs) │ ▼ │ │ ┌──────────────────────┐ │ + ServiceMonitor│ │ OpenTelemetry │ │ + PrometheusRule│ │ Collector │ └─────────────────┘ │ (gateway mode) │ └───┬───────┬───────┬───┘ │ │ │ metrics │ │ │ logs (remote │ │ │ (otlphttp write) │ │ │ → Loki) │ │ │ ▼ │ ▼ ┌──────────┐ │ ┌──────────┐ │Prometheus│ │ │ Loki │ │ + Thanos│ │ └──────────┘ │ Sidecar │ │ └────┬─────┘ │ traces (otlp) │ │ uploads │ │ blocks │ ▼ ▼ ┌──────────┐ ┌──────────┐ │ Tempo │ │ MinIO │ └──────────┘ │ (S3) │ └────┬─────┘ │ ┌────────────┼────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Thanos │ │ Thanos │ │ Thanos │ │ Store │ │Compactor │ │ Query │ │ Gateway │ │ │ │ │ └──────────┘ └──────────┘ └─────┬────┘ │ ┌──────────┐ │ Grafana │ │(included)│ └──────────┘ ``` ## Components | Component | Role | Helm Chart | |-----------------------|-----------------------------------------|-------------------------------------------| | MinIO | S3-compatible object storage (Thanos) | minio/minio | | kube-prometheus-stack | Prometheus + Thanos Sidecar + Grafana | prometheus-community/kube-prometheus-stack | | Thanos | Query + Store Gateway + Compactor | bitnami/thanos | | Loki | Log aggregation | grafana/loki | | Tempo | Distributed tracing | grafana/tempo | | OTel Collector | Unified telemetry pipeline | open-telemetry/opentelemetry-collector | | Quarkus App | Demo microservice + ServiceMonitor | Custom manifests | ## Prerequisites ```bash # A running Kubernetes cluster (minikube, kind, k3s, etc.) # Helm 3.x installed # kubectl configured ``` ## Deployment Steps ### 1. Add Helm Repositories ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo add grafana https://grafana.github.io/helm-charts helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts helm repo add bitnami https://charts.bitnami.com/bitnami helm repo add minio https://charts.min.io/ helm repo update ``` ### 2. Create Namespace ```bash kubectl create namespace observability ``` ### 3. Deploy MinIO (S3-compatible object storage) ```bash helm install minio minio/minio \ -n observability \ -f helm-values/minio-values.yaml \ --version 5.3.0 \ --wait ``` ### 4. Create the Thanos Object Storage Secret ```bash # Pre-configured to point to the in-cluster MinIO — no edits needed. kubectl apply -f k8s/thanos-objstore-secret.yaml -n observability ``` ### 5. Deploy kube-prometheus-stack (with Thanos Sidecar) ```bash helm install kube-prom prometheus-community/kube-prometheus-stack \ -n observability \ -f helm-values/kube-prometheus-stack-values.yaml \ --version 65.1.1 \ --wait ``` ### 6. Deploy Thanos (Query + Store Gateway + Compactor) ```bash helm install thanos bitnami/thanos \ -n observability \ -f helm-values/thanos-values.yaml \ --version 15.7.25 \ --wait ``` ### 7. Deploy Loki ```bash helm install loki grafana/loki \ -n observability \ -f helm-values/loki-values.yaml \ --version 6.16.0 \ --wait ``` ### 8. Deploy Tempo ```bash helm install tempo grafana/tempo \ -n observability \ -f helm-values/tempo-values.yaml \ --version 1.10.3 \ --wait ``` ### 9. Deploy OpenTelemetry Collector ```bash helm install otel-collector open-telemetry/opentelemetry-collector \ -n observability \ -f helm-values/otel-collector-values.yaml \ --version 0.108.0 \ --wait ``` ### 10. Configure Grafana Datasources ```bash kubectl apply -f k8s/grafana-datasources.yaml -n observability # Restart Grafana to pick up the new datasources kubectl rollout restart deployment kube-prom-grafana -n observability ``` ### 11. Build and Deploy the Quarkus App ```bash # Option A: Build locally and push to your registry cd quarkus-app docker build -t your-registry/otel-quarkus-demo:latest . docker push your-registry/otel-quarkus-demo:latest # Option B: For local clusters (minikube/kind), load directly # minikube: eval $(minikube docker-env) && docker build -t otel-quarkus-demo:latest . # kind: docker build -t otel-quarkus-demo:latest . && kind load docker-image otel-quarkus-demo:latest # Deploy (includes ServiceMonitor + PrometheusRule — monitoring config ships with the app) kubectl apply -f k8s/quarkus-app.yaml -n observability ``` ### 12. Access Grafana ```bash # Port-forward Grafana kubectl port-forward svc/kube-prom-grafana 3000:80 -n observability # Default credentials: admin / prom-operator # Open http://localhost:3000 # Two Prometheus datasources available: # - "Prometheus" → local (7-day retention) # - "Thanos" → long-term via Thanos Query # Port-forward MinIO Console (optional — inspect Thanos blocks) kubectl port-forward svc/minio-console 9001:9001 -n observability # Open http://localhost:9001 (minio / minio123) ``` ### 13. Generate Traffic ```bash # Port-forward the Quarkus app kubectl port-forward svc/otel-quarkus-demo 8080:8080 -n observability # Run the traffic generator bash scripts/generate-traffic.sh ``` --- ## Troubleshooting Session Guide ### Scenario 1: High Latency Investigation (Traces) 1. **Open Grafana → Explore → Tempo** 2. Run query: `{ resource.service.name = "otel-quarkus-demo" }` 3. Look for traces with high duration on `/api/orders` endpoint 4. Drill into spans — the `processOrder` span has a simulated delay 5. Check span attributes: `order.item_count`, `order.total_price`, `order.processing_type` 6. Notice: orders with `processing_type=complex` take longer ### Scenario 2: Error Rate Spike (Metrics + Logs) 1. **Grafana → Explore → Prometheus** 2. Query: `rate(http_server_requests_seconds_count{status=~"5.."}[5m])` 3. Compare with: `rate(http_server_requests_seconds_count{status="200"}[5m])` 4. Notice the `/api/orders` endpoint has occasional 500s 5. **Switch to Loki** and correlate: ``` {service_name="otel-quarkus-demo"} |= "ERROR" ``` 6. Find the error logs — click the `TraceID` derived field to jump directly to the trace in Tempo ### Scenario 3: Custom Business Metrics 1. **Grafana → Explore → Prometheus** 2. Query custom metrics: - `orders_total` — total orders processed (counter) - `orders_amount_total` — total revenue (counter) - `order_processing_duration_seconds` — order processing time (histogram) - `inventory_level` — current inventory per product (gauge) 3. Build a dashboard: - Orders/sec: `rate(orders_total[5m])` - Revenue/min: `rate(orders_amount_total[1m]) * 60` - P99 latency: `histogram_quantile(0.99, rate(order_processing_duration_seconds_bucket[5m]))` - Inventory levels: `inventory_level` ### Scenario 4: Correlating Logs ↔ Traces ↔ Metrics 1. Start from a **metric alert**: high error rate on orders 2. In **Loki**, filter by time range and find error logs: ``` {service_name="otel-quarkus-demo"} |= "ERROR" ``` 3. Click the `TraceID` link on any log line (Loki extracts it from OTLP structured metadata) 4. In **Tempo**, search by trace ID — see the full request flow 5. In the trace, find span events with error details 6. Check related **metrics** for that time window to see broader impact ### Useful PromQL Queries ```promql # Request rate by endpoint rate(http_server_requests_seconds_count[5m]) # Error rate percentage 100 * rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) # P95 latency by endpoint histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) # Custom: orders per second by status rate(orders_total[5m]) # Custom: average order value rate(orders_amount_total[5m]) / rate(orders_total[5m]) ``` ### Useful LogQL Queries ```logql # All app logs (service.name becomes service_name as a Loki stream label) {service_name="otel-quarkus-demo"} # Errors only {service_name="otel-quarkus-demo"} |= "ERROR" # Filter by trace ID (stored in OTLP structured metadata) {service_name="otel-quarkus-demo"} | trace_id = "your-trace-id-here" # Filter by structured metadata attributes {service_name="otel-quarkus-demo"} | severity_text = "ERROR" # Slow request logs {service_name="otel-quarkus-demo"} |= "processed" |= "complex" ```