295 lines
11 KiB
Markdown
295 lines
11 KiB
Markdown
# Full OpenTelemetry Observability Stack on Kubernetes
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────┐ OTLP (gRPC :4317)
|
|
│ Quarkus App │──────────────────────────┐
|
|
│ (metrics, │ │
|
|
│ traces, logs) │ ▼
|
|
│ │ ┌──────────────────────┐
|
|
│ + ServiceMonitor│ │ OpenTelemetry │
|
|
│ + PrometheusRule│ │ Collector │
|
|
└─────────────────┘ │ (gateway mode) │
|
|
└───┬───────┬───────┬───┘
|
|
│ │ │
|
|
metrics │ │ │ logs
|
|
(remote │ │ │ (otlphttp
|
|
write) │ │ │ → Loki)
|
|
│ │ │
|
|
▼ │ ▼
|
|
┌──────────┐ │ ┌──────────┐
|
|
│Prometheus│ │ │ Loki │
|
|
│ + Thanos│ │ └──────────┘
|
|
│ Sidecar │ │
|
|
└────┬─────┘ │ traces (otlp)
|
|
│ │
|
|
uploads │ │
|
|
blocks │ ▼
|
|
▼ ┌──────────┐
|
|
┌──────────┐ │ Tempo │
|
|
│ MinIO │ └──────────┘
|
|
│ (S3) │
|
|
└────┬─────┘
|
|
│
|
|
┌────────────┼────────────┐
|
|
▼ ▼ ▼
|
|
┌──────────┐ ┌──────────┐ ┌──────────┐
|
|
│ Thanos │ │ Thanos │ │ Thanos │
|
|
│ Store │ │Compactor │ │ Query │
|
|
│ Gateway │ │ │ │ │
|
|
└──────────┘ └──────────┘ └─────┬────┘
|
|
│
|
|
┌──────────┐
|
|
│ Grafana │
|
|
│(included)│
|
|
└──────────┘
|
|
```
|
|
|
|
## Components
|
|
|
|
| Component | Role | Helm Chart |
|
|
|-----------------------|-----------------------------------------|-------------------------------------------|
|
|
| MinIO | S3-compatible object storage (Thanos) | minio/minio |
|
|
| kube-prometheus-stack | Prometheus + Thanos Sidecar + Grafana | prometheus-community/kube-prometheus-stack |
|
|
| Thanos | Query + Store Gateway + Compactor | bitnami/thanos |
|
|
| Loki | Log aggregation | grafana/loki |
|
|
| Tempo | Distributed tracing | grafana/tempo |
|
|
| OTel Collector | Unified telemetry pipeline | open-telemetry/opentelemetry-collector |
|
|
| Quarkus App | Demo microservice + ServiceMonitor | Custom manifests |
|
|
|
|
## Prerequisites
|
|
|
|
```bash
|
|
# A running Kubernetes cluster (minikube, kind, k3s, etc.)
|
|
# Helm 3.x installed
|
|
# kubectl configured
|
|
```
|
|
|
|
## Deployment Steps
|
|
|
|
### 1. Add Helm Repositories
|
|
|
|
```bash
|
|
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
|
helm repo add grafana https://grafana.github.io/helm-charts
|
|
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
|
|
helm repo add bitnami https://charts.bitnami.com/bitnami
|
|
helm repo add minio https://charts.min.io/
|
|
helm repo update
|
|
```
|
|
|
|
### 2. Create Namespace
|
|
|
|
```bash
|
|
kubectl create namespace observability
|
|
```
|
|
|
|
### 3. Deploy MinIO (S3-compatible object storage)
|
|
|
|
```bash
|
|
helm install minio minio/minio \
|
|
-n observability \
|
|
-f helm-values/minio-values.yaml \
|
|
--version 5.3.0 \
|
|
--wait
|
|
```
|
|
|
|
### 4. Create the Thanos Object Storage Secret
|
|
|
|
```bash
|
|
# Pre-configured to point to the in-cluster MinIO — no edits needed.
|
|
kubectl apply -f k8s/thanos-objstore-secret.yaml -n observability
|
|
```
|
|
|
|
### 5. Deploy kube-prometheus-stack (with Thanos Sidecar)
|
|
|
|
```bash
|
|
helm install kube-prom prometheus-community/kube-prometheus-stack \
|
|
-n observability \
|
|
-f helm-values/kube-prometheus-stack-values.yaml \
|
|
--version 65.1.1 \
|
|
--wait
|
|
```
|
|
|
|
### 6. Deploy Thanos (Query + Store Gateway + Compactor)
|
|
|
|
```bash
|
|
helm install thanos bitnami/thanos \
|
|
-n observability \
|
|
-f helm-values/thanos-values.yaml \
|
|
--version 15.7.25 \
|
|
--wait
|
|
```
|
|
|
|
### 7. Deploy Loki
|
|
|
|
```bash
|
|
helm install loki grafana/loki \
|
|
-n observability \
|
|
-f helm-values/loki-values.yaml \
|
|
--version 6.16.0 \
|
|
--wait
|
|
```
|
|
|
|
### 8. Deploy Tempo
|
|
|
|
```bash
|
|
helm install tempo grafana/tempo \
|
|
-n observability \
|
|
-f helm-values/tempo-values.yaml \
|
|
--version 1.10.3 \
|
|
--wait
|
|
```
|
|
|
|
### 9. Deploy OpenTelemetry Collector
|
|
|
|
```bash
|
|
helm install otel-collector open-telemetry/opentelemetry-collector \
|
|
-n observability \
|
|
-f helm-values/otel-collector-values.yaml \
|
|
--version 0.108.0 \
|
|
--wait
|
|
```
|
|
|
|
### 10. Configure Grafana Datasources
|
|
|
|
```bash
|
|
kubectl apply -f k8s/grafana-datasources.yaml -n observability
|
|
# Restart Grafana to pick up the new datasources
|
|
kubectl rollout restart deployment kube-prom-grafana -n observability
|
|
```
|
|
|
|
### 11. Build and Deploy the Quarkus App
|
|
|
|
```bash
|
|
# Option A: Build locally and push to your registry
|
|
cd quarkus-app
|
|
docker build -t your-registry/otel-quarkus-demo:latest .
|
|
docker push your-registry/otel-quarkus-demo:latest
|
|
|
|
# Option B: For local clusters (minikube/kind), load directly
|
|
# minikube: eval $(minikube docker-env) && docker build -t otel-quarkus-demo:latest .
|
|
# kind: docker build -t otel-quarkus-demo:latest . && kind load docker-image otel-quarkus-demo:latest
|
|
|
|
# Deploy (includes ServiceMonitor + PrometheusRule — monitoring config ships with the app)
|
|
kubectl apply -f k8s/quarkus-app.yaml -n observability
|
|
```
|
|
|
|
### 12. Access Grafana
|
|
|
|
```bash
|
|
# Port-forward Grafana
|
|
kubectl port-forward svc/kube-prom-grafana 3000:80 -n observability
|
|
|
|
# Default credentials: admin / prom-operator
|
|
# Open http://localhost:3000
|
|
# Two Prometheus datasources available:
|
|
# - "Prometheus" → local (7-day retention)
|
|
# - "Thanos" → long-term via Thanos Query
|
|
|
|
# Port-forward MinIO Console (optional — inspect Thanos blocks)
|
|
kubectl port-forward svc/minio-console 9001:9001 -n observability
|
|
# Open http://localhost:9001 (minio / minio123)
|
|
```
|
|
|
|
### 13. Generate Traffic
|
|
|
|
```bash
|
|
# Port-forward the Quarkus app
|
|
kubectl port-forward svc/otel-quarkus-demo 8080:8080 -n observability
|
|
|
|
# Run the traffic generator
|
|
bash scripts/generate-traffic.sh
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting Session Guide
|
|
|
|
### Scenario 1: High Latency Investigation (Traces)
|
|
|
|
1. **Open Grafana → Explore → Tempo**
|
|
2. Run query: `{ resource.service.name = "otel-quarkus-demo" }`
|
|
3. Look for traces with high duration on `/api/orders` endpoint
|
|
4. Drill into spans — the `processOrder` span has a simulated delay
|
|
5. Check span attributes: `order.item_count`, `order.total_price`, `order.processing_type`
|
|
6. Notice: orders with `processing_type=complex` take longer
|
|
|
|
### Scenario 2: Error Rate Spike (Metrics + Logs)
|
|
|
|
1. **Grafana → Explore → Prometheus**
|
|
2. Query: `rate(http_server_requests_seconds_count{status=~"5.."}[5m])`
|
|
3. Compare with: `rate(http_server_requests_seconds_count{status="200"}[5m])`
|
|
4. Notice the `/api/orders` endpoint has occasional 500s
|
|
5. **Switch to Loki** and correlate:
|
|
```
|
|
{service_name="otel-quarkus-demo"} |= "ERROR"
|
|
```
|
|
6. Find the error logs — click the `TraceID` derived field to jump directly to the trace in Tempo
|
|
|
|
### Scenario 3: Custom Business Metrics
|
|
|
|
1. **Grafana → Explore → Prometheus**
|
|
2. Query custom metrics:
|
|
- `orders_total` — total orders processed (counter)
|
|
- `orders_amount_total` — total revenue (counter)
|
|
- `order_processing_duration_seconds` — order processing time (histogram)
|
|
- `inventory_level` — current inventory per product (gauge)
|
|
3. Build a dashboard:
|
|
- Orders/sec: `rate(orders_total[5m])`
|
|
- Revenue/min: `rate(orders_amount_total[1m]) * 60`
|
|
- P99 latency: `histogram_quantile(0.99, rate(order_processing_duration_seconds_bucket[5m]))`
|
|
- Inventory levels: `inventory_level`
|
|
|
|
### Scenario 4: Correlating Logs ↔ Traces ↔ Metrics
|
|
|
|
1. Start from a **metric alert**: high error rate on orders
|
|
2. In **Loki**, filter by time range and find error logs:
|
|
```
|
|
{service_name="otel-quarkus-demo"} |= "ERROR"
|
|
```
|
|
3. Click the `TraceID` link on any log line (Loki extracts it from OTLP structured metadata)
|
|
4. In **Tempo**, search by trace ID — see the full request flow
|
|
5. In the trace, find span events with error details
|
|
6. Check related **metrics** for that time window to see broader impact
|
|
|
|
### Useful PromQL Queries
|
|
|
|
```promql
|
|
# Request rate by endpoint
|
|
rate(http_server_requests_seconds_count[5m])
|
|
|
|
# Error rate percentage
|
|
100 * rate(http_server_requests_seconds_count{status=~"5.."}[5m])
|
|
/ rate(http_server_requests_seconds_count[5m])
|
|
|
|
# P95 latency by endpoint
|
|
histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))
|
|
|
|
# Custom: orders per second by status
|
|
rate(orders_total[5m])
|
|
|
|
# Custom: average order value
|
|
rate(orders_amount_total[5m]) / rate(orders_total[5m])
|
|
```
|
|
|
|
### Useful LogQL Queries
|
|
|
|
```logql
|
|
# All app logs (service.name becomes service_name as a Loki stream label)
|
|
{service_name="otel-quarkus-demo"}
|
|
|
|
# Errors only
|
|
{service_name="otel-quarkus-demo"} |= "ERROR"
|
|
|
|
# Filter by trace ID (stored in OTLP structured metadata)
|
|
{service_name="otel-quarkus-demo"} | trace_id = "your-trace-id-here"
|
|
|
|
# Filter by structured metadata attributes
|
|
{service_name="otel-quarkus-demo"} | severity_text = "ERROR"
|
|
|
|
# Slow request logs
|
|
{service_name="otel-quarkus-demo"} |= "processed" |= "complex"
|
|
```
|