# Full OpenTelemetry Observability Stack on Kubernetes

## Architecture

```
┌─────────────────┐      OTLP (gRPC :4317)
│  Quarkus App    │──────────────────────────┐
│  (metrics,      │                          │
│   traces, logs) │                          ▼
│                 │              ┌──────────────────────┐
│ + ServiceMonitor│              │  OpenTelemetry        │
│ + PrometheusRule│              │  Collector            │
└─────────────────┘              │  (gateway mode)       │
                                 └───┬───────┬───────┬───┘
                                     │       │       │
                          metrics    │       │       │  logs
                          (remote    │       │       │  (otlphttp
                           write)    │       │       │   → Loki)
                                     │       │       │
                                     ▼       │       ▼
                              ┌──────────┐   │   ┌──────────┐
                              │Prometheus│   │   │  Loki    │
                              │  + Thanos│   │   └──────────┘
                              │  Sidecar │   │
                              └────┬─────┘   │  traces (otlp)
                                   │         │
                          uploads  │         │
                          blocks   │         ▼
                                   ▼      ┌──────────┐
                            ┌──────────┐  │  Tempo   │
                            │  MinIO   │  └──────────┘
                            │  (S3)    │
                            └────┬─────┘
                                 │
                    ┌────────────┼────────────┐
                    ▼            ▼            ▼
             ┌──────────┐ ┌──────────┐ ┌──────────┐
             │  Thanos  │ │  Thanos  │ │  Thanos  │
             │  Store   │ │Compactor │ │  Query   │
             │ Gateway  │ │          │ │          │
             └──────────┘ └──────────┘ └─────┬────┘
                                             │
                                      ┌──────────┐
                                      │ Grafana  │
                                      │(included)│
                                      └──────────┘
```

## Components

| Component             | Role                                    | Helm Chart                                |
|-----------------------|-----------------------------------------|-------------------------------------------|
| MinIO                 | S3-compatible object storage (Thanos)   | minio/minio                               |
| kube-prometheus-stack | Prometheus + Thanos Sidecar + Grafana   | prometheus-community/kube-prometheus-stack |
| Thanos                | Query + Store Gateway + Compactor       | bitnami/thanos                            |
| Loki                  | Log aggregation                         | grafana/loki                              |
| Tempo                 | Distributed tracing                     | grafana/tempo                             |
| OTel Collector        | Unified telemetry pipeline              | open-telemetry/opentelemetry-collector    |
| Quarkus App           | Demo microservice + ServiceMonitor      | Custom manifests                          |

## Prerequisites

```bash
# A running Kubernetes cluster (minikube, kind, k3s, etc.)
# Helm 3.x installed
# kubectl configured
```

## Deployment Steps

### 1. Add Helm Repositories

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add minio https://charts.min.io/
helm repo update
```

### 2. Create Namespace

```bash
kubectl create namespace observability
```

### 3. Deploy MinIO (S3-compatible object storage)

```bash
helm install minio minio/minio \
  -n observability \
  -f helm-values/minio-values.yaml \
  --version 5.3.0 \
  --wait
```

### 4. Create the Thanos Object Storage Secret

```bash
# Pre-configured to point to the in-cluster MinIO — no edits needed.
kubectl apply -f k8s/thanos-objstore-secret.yaml -n observability
```

### 5. Deploy kube-prometheus-stack (with Thanos Sidecar)

```bash
helm install kube-prom prometheus-community/kube-prometheus-stack \
  -n observability \
  -f helm-values/kube-prometheus-stack-values.yaml \
  --version 65.1.1 \
  --wait
```

### 6. Deploy Thanos (Query + Store Gateway + Compactor)

```bash
helm install thanos bitnami/thanos \
  -n observability \
  -f helm-values/thanos-values.yaml \
  --version 15.7.25 \
  --wait
```

### 7. Deploy Loki

```bash
helm install loki grafana/loki \
  -n observability \
  -f helm-values/loki-values.yaml \
  --version 6.16.0 \
  --wait
```

### 8. Deploy Tempo

```bash
helm install tempo grafana/tempo \
  -n observability \
  -f helm-values/tempo-values.yaml \
  --version 1.10.3 \
  --wait
```

### 9. Deploy OpenTelemetry Collector

```bash
helm install otel-collector open-telemetry/opentelemetry-collector \
  -n observability \
  -f helm-values/otel-collector-values.yaml \
  --version 0.108.0 \
  --wait
```

### 10. Configure Grafana Datasources

```bash
kubectl apply -f k8s/grafana-datasources.yaml -n observability
# Restart Grafana to pick up the new datasources
kubectl rollout restart deployment kube-prom-grafana -n observability
```

### 11. Build and Deploy the Quarkus App

```bash
# Option A: Build locally and push to your registry
cd quarkus-app
docker build -t your-registry/otel-quarkus-demo:latest .
docker push your-registry/otel-quarkus-demo:latest

# Option B: For local clusters (minikube/kind), load directly
# minikube: eval $(minikube docker-env) && docker build -t otel-quarkus-demo:latest .
# kind:     docker build -t otel-quarkus-demo:latest . && kind load docker-image otel-quarkus-demo:latest

# Deploy (includes ServiceMonitor + PrometheusRule — monitoring config ships with the app)
kubectl apply -f k8s/quarkus-app.yaml -n observability
```

### 12. Access Grafana

```bash
# Port-forward Grafana
kubectl port-forward svc/kube-prom-grafana 3000:80 -n observability

# Default credentials: admin / prom-operator
# Open http://localhost:3000
# Two Prometheus datasources available:
#   - "Prometheus" → local (7-day retention)
#   - "Thanos"     → long-term via Thanos Query

# Port-forward MinIO Console (optional — inspect Thanos blocks)
kubectl port-forward svc/minio-console 9001:9001 -n observability
# Open http://localhost:9001  (minio / minio123)
```

### 13. Generate Traffic

```bash
# Port-forward the Quarkus app
kubectl port-forward svc/otel-quarkus-demo 8080:8080 -n observability

# Run the traffic generator
bash scripts/generate-traffic.sh
```

---

## Troubleshooting Session Guide

### Scenario 1: High Latency Investigation (Traces)

1. **Open Grafana → Explore → Tempo**
2. Run query: `{ resource.service.name = "otel-quarkus-demo" }`
3. Look for traces with high duration on `/api/orders` endpoint
4. Drill into spans — the `processOrder` span has a simulated delay
5. Check span attributes: `order.item_count`, `order.total_price`, `order.processing_type`
6. Notice: orders with `processing_type=complex` take longer

### Scenario 2: Error Rate Spike (Metrics + Logs)

1. **Grafana → Explore → Prometheus**
2. Query: `rate(http_server_requests_seconds_count{status=~"5.."}[5m])`
3. Compare with: `rate(http_server_requests_seconds_count{status="200"}[5m])`
4. Notice the `/api/orders` endpoint has occasional 500s
5. **Switch to Loki** and correlate:
   ```
   {service_name="otel-quarkus-demo"} |= "ERROR"
   ```
6. Find the error logs — click the `TraceID` derived field to jump directly to the trace in Tempo

### Scenario 3: Custom Business Metrics

1. **Grafana → Explore → Prometheus**
2. Query custom metrics:
   - `orders_total` — total orders processed (counter)
   - `orders_amount_total` — total revenue (counter)
   - `order_processing_duration_seconds` — order processing time (histogram)
   - `inventory_level` — current inventory per product (gauge)
3. Build a dashboard:
   - Orders/sec: `rate(orders_total[5m])`
   - Revenue/min: `rate(orders_amount_total[1m]) * 60`
   - P99 latency: `histogram_quantile(0.99, rate(order_processing_duration_seconds_bucket[5m]))`
   - Inventory levels: `inventory_level`

### Scenario 4: Correlating Logs ↔ Traces ↔ Metrics

1. Start from a **metric alert**: high error rate on orders
2. In **Loki**, filter by time range and find error logs:
   ```
   {service_name="otel-quarkus-demo"} |= "ERROR"
   ```
3. Click the `TraceID` link on any log line (Loki extracts it from OTLP structured metadata)
4. In **Tempo**, search by trace ID — see the full request flow
5. In the trace, find span events with error details
6. Check related **metrics** for that time window to see broader impact

### Useful PromQL Queries

```promql
# Request rate by endpoint
rate(http_server_requests_seconds_count[5m])

# Error rate percentage
100 * rate(http_server_requests_seconds_count{status=~"5.."}[5m])
  / rate(http_server_requests_seconds_count[5m])

# P95 latency by endpoint
histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))

# Custom: orders per second by status
rate(orders_total[5m])

# Custom: average order value
rate(orders_amount_total[5m]) / rate(orders_total[5m])
```

### Useful LogQL Queries

```logql
# All app logs (service.name becomes service_name as a Loki stream label)
{service_name="otel-quarkus-demo"}

# Errors only
{service_name="otel-quarkus-demo"} |= "ERROR"

# Filter by trace ID (stored in OTLP structured metadata)
{service_name="otel-quarkus-demo"} | trace_id = "your-trace-id-here"

# Filter by structured metadata attributes
{service_name="otel-quarkus-demo"} | severity_text = "ERROR"

# Slow request logs
{service_name="otel-quarkus-demo"} |= "processed" |= "complex"
```