diff --git a/README.md b/README.md new file mode 100644 index 0000000..8fcfd81 --- /dev/null +++ b/README.md @@ -0,0 +1,294 @@ +# Full OpenTelemetry Observability Stack on Kubernetes + +## Architecture + +``` +┌─────────────────┐ OTLP (gRPC :4317) +│ Quarkus App │──────────────────────────┐ +│ (metrics, │ │ +│ traces, logs) │ ▼ +│ │ ┌──────────────────────┐ +│ + ServiceMonitor│ │ OpenTelemetry │ +│ + PrometheusRule│ │ Collector │ +└─────────────────┘ │ (gateway mode) │ + └───┬───────┬───────┬───┘ + │ │ │ + metrics │ │ │ logs + (remote │ │ │ (otlphttp + write) │ │ │ → Loki) + │ │ │ + ▼ │ ▼ + ┌──────────┐ │ ┌──────────┐ + │Prometheus│ │ │ Loki │ + │ + Thanos│ │ └──────────┘ + │ Sidecar │ │ + └────┬─────┘ │ traces (otlp) + │ │ + uploads │ │ + blocks │ ▼ + ▼ ┌──────────┐ + ┌──────────┐ │ Tempo │ + │ MinIO │ └──────────┘ + │ (S3) │ + └────┬─────┘ + │ + ┌────────────┼────────────┐ + ▼ ▼ ▼ + ┌──────────┐ ┌──────────┐ ┌──────────┐ + │ Thanos │ │ Thanos │ │ Thanos │ + │ Store │ │Compactor │ │ Query │ + │ Gateway │ │ │ │ │ + └──────────┘ └──────────┘ └─────┬────┘ + │ + ┌──────────┐ + │ Grafana │ + │(included)│ + └──────────┘ +``` + +## Components + +| Component | Role | Helm Chart | +|-----------------------|-----------------------------------------|-------------------------------------------| +| MinIO | S3-compatible object storage (Thanos) | minio/minio | +| kube-prometheus-stack | Prometheus + Thanos Sidecar + Grafana | prometheus-community/kube-prometheus-stack | +| Thanos | Query + Store Gateway + Compactor | bitnami/thanos | +| Loki | Log aggregation | grafana/loki | +| Tempo | Distributed tracing | grafana/tempo | +| OTel Collector | Unified telemetry pipeline | open-telemetry/opentelemetry-collector | +| Quarkus App | Demo microservice + ServiceMonitor | Custom manifests | + +## Prerequisites + +```bash +# A running Kubernetes cluster (minikube, kind, k3s, etc.) +# Helm 3.x installed +# kubectl configured +``` + +## Deployment Steps + +### 1. Add Helm Repositories + +```bash +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +helm repo add grafana https://grafana.github.io/helm-charts +helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts +helm repo add bitnami https://charts.bitnami.com/bitnami +helm repo add minio https://charts.min.io/ +helm repo update +``` + +### 2. Create Namespace + +```bash +kubectl create namespace observability +``` + +### 3. Deploy MinIO (S3-compatible object storage) + +```bash +helm install minio minio/minio \ + -n observability \ + -f helm-values/minio-values.yaml \ + --version 5.3.0 \ + --wait +``` + +### 4. Create the Thanos Object Storage Secret + +```bash +# Pre-configured to point to the in-cluster MinIO — no edits needed. +kubectl apply -f k8s/thanos-objstore-secret.yaml -n observability +``` + +### 5. Deploy kube-prometheus-stack (with Thanos Sidecar) + +```bash +helm install kube-prom prometheus-community/kube-prometheus-stack \ + -n observability \ + -f helm-values/kube-prometheus-stack-values.yaml \ + --version 65.1.1 \ + --wait +``` + +### 6. Deploy Thanos (Query + Store Gateway + Compactor) + +```bash +helm install thanos bitnami/thanos \ + -n observability \ + -f helm-values/thanos-values.yaml \ + --version 15.7.25 \ + --wait +``` + +### 7. Deploy Loki + +```bash +helm install loki grafana/loki \ + -n observability \ + -f helm-values/loki-values.yaml \ + --version 6.16.0 \ + --wait +``` + +### 8. Deploy Tempo + +```bash +helm install tempo grafana/tempo \ + -n observability \ + -f helm-values/tempo-values.yaml \ + --version 1.10.3 \ + --wait +``` + +### 9. Deploy OpenTelemetry Collector + +```bash +helm install otel-collector open-telemetry/opentelemetry-collector \ + -n observability \ + -f helm-values/otel-collector-values.yaml \ + --version 0.108.0 \ + --wait +``` + +### 10. Configure Grafana Datasources + +```bash +kubectl apply -f k8s/grafana-datasources.yaml -n observability +# Restart Grafana to pick up the new datasources +kubectl rollout restart deployment kube-prom-grafana -n observability +``` + +### 11. Build and Deploy the Quarkus App + +```bash +# Option A: Build locally and push to your registry +cd quarkus-app +docker build -t your-registry/otel-quarkus-demo:latest . +docker push your-registry/otel-quarkus-demo:latest + +# Option B: For local clusters (minikube/kind), load directly +# minikube: eval $(minikube docker-env) && docker build -t otel-quarkus-demo:latest . +# kind: docker build -t otel-quarkus-demo:latest . && kind load docker-image otel-quarkus-demo:latest + +# Deploy (includes ServiceMonitor + PrometheusRule — monitoring config ships with the app) +kubectl apply -f k8s/quarkus-app.yaml -n observability +``` + +### 12. Access Grafana + +```bash +# Port-forward Grafana +kubectl port-forward svc/kube-prom-grafana 3000:80 -n observability + +# Default credentials: admin / prom-operator +# Open http://localhost:3000 +# Two Prometheus datasources available: +# - "Prometheus" → local (7-day retention) +# - "Thanos" → long-term via Thanos Query + +# Port-forward MinIO Console (optional — inspect Thanos blocks) +kubectl port-forward svc/minio-console 9001:9001 -n observability +# Open http://localhost:9001 (minio / minio123) +``` + +### 13. Generate Traffic + +```bash +# Port-forward the Quarkus app +kubectl port-forward svc/otel-quarkus-demo 8080:8080 -n observability + +# Run the traffic generator +bash scripts/generate-traffic.sh +``` + +--- + +## Troubleshooting Session Guide + +### Scenario 1: High Latency Investigation (Traces) + +1. **Open Grafana → Explore → Tempo** +2. Run query: `{ resource.service.name = "otel-quarkus-demo" }` +3. Look for traces with high duration on `/api/orders` endpoint +4. Drill into spans — the `processOrder` span has a simulated delay +5. Check span attributes: `order.item_count`, `order.total_price`, `order.processing_type` +6. Notice: orders with `processing_type=complex` take longer + +### Scenario 2: Error Rate Spike (Metrics + Logs) + +1. **Grafana → Explore → Prometheus** +2. Query: `rate(http_server_requests_seconds_count{status=~"5.."}[5m])` +3. Compare with: `rate(http_server_requests_seconds_count{status="200"}[5m])` +4. Notice the `/api/orders` endpoint has occasional 500s +5. **Switch to Loki** and correlate: + ``` + {service_name="otel-quarkus-demo"} |= "ERROR" + ``` +6. Find the error logs — click the `TraceID` derived field to jump directly to the trace in Tempo + +### Scenario 3: Custom Business Metrics + +1. **Grafana → Explore → Prometheus** +2. Query custom metrics: + - `orders_total` — total orders processed (counter) + - `orders_amount_total` — total revenue (counter) + - `order_processing_duration_seconds` — order processing time (histogram) + - `inventory_level` — current inventory per product (gauge) +3. Build a dashboard: + - Orders/sec: `rate(orders_total[5m])` + - Revenue/min: `rate(orders_amount_total[1m]) * 60` + - P99 latency: `histogram_quantile(0.99, rate(order_processing_duration_seconds_bucket[5m]))` + - Inventory levels: `inventory_level` + +### Scenario 4: Correlating Logs ↔ Traces ↔ Metrics + +1. Start from a **metric alert**: high error rate on orders +2. In **Loki**, filter by time range and find error logs: + ``` + {service_name="otel-quarkus-demo"} |= "ERROR" + ``` +3. Click the `TraceID` link on any log line (Loki extracts it from OTLP structured metadata) +4. In **Tempo**, search by trace ID — see the full request flow +5. In the trace, find span events with error details +6. Check related **metrics** for that time window to see broader impact + +### Useful PromQL Queries + +```promql +# Request rate by endpoint +rate(http_server_requests_seconds_count[5m]) + +# Error rate percentage +100 * rate(http_server_requests_seconds_count{status=~"5.."}[5m]) + / rate(http_server_requests_seconds_count[5m]) + +# P95 latency by endpoint +histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) + +# Custom: orders per second by status +rate(orders_total[5m]) + +# Custom: average order value +rate(orders_amount_total[5m]) / rate(orders_total[5m]) +``` + +### Useful LogQL Queries + +```logql +# All app logs (service.name becomes service_name as a Loki stream label) +{service_name="otel-quarkus-demo"} + +# Errors only +{service_name="otel-quarkus-demo"} |= "ERROR" + +# Filter by trace ID (stored in OTLP structured metadata) +{service_name="otel-quarkus-demo"} | trace_id = "your-trace-id-here" + +# Filter by structured metadata attributes +{service_name="otel-quarkus-demo"} | severity_text = "ERROR" + +# Slow request logs +{service_name="otel-quarkus-demo"} |= "processed" |= "complex" +```