added README.md
This commit is contained in:
294
README.md
Normal file
294
README.md
Normal file
@@ -0,0 +1,294 @@
|
||||
# Full OpenTelemetry Observability Stack on Kubernetes
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐ OTLP (gRPC :4317)
|
||||
│ Quarkus App │──────────────────────────┐
|
||||
│ (metrics, │ │
|
||||
│ traces, logs) │ ▼
|
||||
│ │ ┌──────────────────────┐
|
||||
│ + ServiceMonitor│ │ OpenTelemetry │
|
||||
│ + PrometheusRule│ │ Collector │
|
||||
└─────────────────┘ │ (gateway mode) │
|
||||
└───┬───────┬───────┬───┘
|
||||
│ │ │
|
||||
metrics │ │ │ logs
|
||||
(remote │ │ │ (otlphttp
|
||||
write) │ │ │ → Loki)
|
||||
│ │ │
|
||||
▼ │ ▼
|
||||
┌──────────┐ │ ┌──────────┐
|
||||
│Prometheus│ │ │ Loki │
|
||||
│ + Thanos│ │ └──────────┘
|
||||
│ Sidecar │ │
|
||||
└────┬─────┘ │ traces (otlp)
|
||||
│ │
|
||||
uploads │ │
|
||||
blocks │ ▼
|
||||
▼ ┌──────────┐
|
||||
┌──────────┐ │ Tempo │
|
||||
│ MinIO │ └──────────┘
|
||||
│ (S3) │
|
||||
└────┬─────┘
|
||||
│
|
||||
┌────────────┼────────────┐
|
||||
▼ ▼ ▼
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ Thanos │ │ Thanos │ │ Thanos │
|
||||
│ Store │ │Compactor │ │ Query │
|
||||
│ Gateway │ │ │ │ │
|
||||
└──────────┘ └──────────┘ └─────┬────┘
|
||||
│
|
||||
┌──────────┐
|
||||
│ Grafana │
|
||||
│(included)│
|
||||
└──────────┘
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Role | Helm Chart |
|
||||
|-----------------------|-----------------------------------------|-------------------------------------------|
|
||||
| MinIO | S3-compatible object storage (Thanos) | minio/minio |
|
||||
| kube-prometheus-stack | Prometheus + Thanos Sidecar + Grafana | prometheus-community/kube-prometheus-stack |
|
||||
| Thanos | Query + Store Gateway + Compactor | bitnami/thanos |
|
||||
| Loki | Log aggregation | grafana/loki |
|
||||
| Tempo | Distributed tracing | grafana/tempo |
|
||||
| OTel Collector | Unified telemetry pipeline | open-telemetry/opentelemetry-collector |
|
||||
| Quarkus App | Demo microservice + ServiceMonitor | Custom manifests |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
```bash
|
||||
# A running Kubernetes cluster (minikube, kind, k3s, etc.)
|
||||
# Helm 3.x installed
|
||||
# kubectl configured
|
||||
```
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### 1. Add Helm Repositories
|
||||
|
||||
```bash
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm repo add grafana https://grafana.github.io/helm-charts
|
||||
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
|
||||
helm repo add bitnami https://charts.bitnami.com/bitnami
|
||||
helm repo add minio https://charts.min.io/
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### 2. Create Namespace
|
||||
|
||||
```bash
|
||||
kubectl create namespace observability
|
||||
```
|
||||
|
||||
### 3. Deploy MinIO (S3-compatible object storage)
|
||||
|
||||
```bash
|
||||
helm install minio minio/minio \
|
||||
-n observability \
|
||||
-f helm-values/minio-values.yaml \
|
||||
--version 5.3.0 \
|
||||
--wait
|
||||
```
|
||||
|
||||
### 4. Create the Thanos Object Storage Secret
|
||||
|
||||
```bash
|
||||
# Pre-configured to point to the in-cluster MinIO — no edits needed.
|
||||
kubectl apply -f k8s/thanos-objstore-secret.yaml -n observability
|
||||
```
|
||||
|
||||
### 5. Deploy kube-prometheus-stack (with Thanos Sidecar)
|
||||
|
||||
```bash
|
||||
helm install kube-prom prometheus-community/kube-prometheus-stack \
|
||||
-n observability \
|
||||
-f helm-values/kube-prometheus-stack-values.yaml \
|
||||
--version 65.1.1 \
|
||||
--wait
|
||||
```
|
||||
|
||||
### 6. Deploy Thanos (Query + Store Gateway + Compactor)
|
||||
|
||||
```bash
|
||||
helm install thanos bitnami/thanos \
|
||||
-n observability \
|
||||
-f helm-values/thanos-values.yaml \
|
||||
--version 15.7.25 \
|
||||
--wait
|
||||
```
|
||||
|
||||
### 7. Deploy Loki
|
||||
|
||||
```bash
|
||||
helm install loki grafana/loki \
|
||||
-n observability \
|
||||
-f helm-values/loki-values.yaml \
|
||||
--version 6.16.0 \
|
||||
--wait
|
||||
```
|
||||
|
||||
### 8. Deploy Tempo
|
||||
|
||||
```bash
|
||||
helm install tempo grafana/tempo \
|
||||
-n observability \
|
||||
-f helm-values/tempo-values.yaml \
|
||||
--version 1.10.3 \
|
||||
--wait
|
||||
```
|
||||
|
||||
### 9. Deploy OpenTelemetry Collector
|
||||
|
||||
```bash
|
||||
helm install otel-collector open-telemetry/opentelemetry-collector \
|
||||
-n observability \
|
||||
-f helm-values/otel-collector-values.yaml \
|
||||
--version 0.108.0 \
|
||||
--wait
|
||||
```
|
||||
|
||||
### 10. Configure Grafana Datasources
|
||||
|
||||
```bash
|
||||
kubectl apply -f k8s/grafana-datasources.yaml -n observability
|
||||
# Restart Grafana to pick up the new datasources
|
||||
kubectl rollout restart deployment kube-prom-grafana -n observability
|
||||
```
|
||||
|
||||
### 11. Build and Deploy the Quarkus App
|
||||
|
||||
```bash
|
||||
# Option A: Build locally and push to your registry
|
||||
cd quarkus-app
|
||||
docker build -t your-registry/otel-quarkus-demo:latest .
|
||||
docker push your-registry/otel-quarkus-demo:latest
|
||||
|
||||
# Option B: For local clusters (minikube/kind), load directly
|
||||
# minikube: eval $(minikube docker-env) && docker build -t otel-quarkus-demo:latest .
|
||||
# kind: docker build -t otel-quarkus-demo:latest . && kind load docker-image otel-quarkus-demo:latest
|
||||
|
||||
# Deploy (includes ServiceMonitor + PrometheusRule — monitoring config ships with the app)
|
||||
kubectl apply -f k8s/quarkus-app.yaml -n observability
|
||||
```
|
||||
|
||||
### 12. Access Grafana
|
||||
|
||||
```bash
|
||||
# Port-forward Grafana
|
||||
kubectl port-forward svc/kube-prom-grafana 3000:80 -n observability
|
||||
|
||||
# Default credentials: admin / prom-operator
|
||||
# Open http://localhost:3000
|
||||
# Two Prometheus datasources available:
|
||||
# - "Prometheus" → local (7-day retention)
|
||||
# - "Thanos" → long-term via Thanos Query
|
||||
|
||||
# Port-forward MinIO Console (optional — inspect Thanos blocks)
|
||||
kubectl port-forward svc/minio-console 9001:9001 -n observability
|
||||
# Open http://localhost:9001 (minio / minio123)
|
||||
```
|
||||
|
||||
### 13. Generate Traffic
|
||||
|
||||
```bash
|
||||
# Port-forward the Quarkus app
|
||||
kubectl port-forward svc/otel-quarkus-demo 8080:8080 -n observability
|
||||
|
||||
# Run the traffic generator
|
||||
bash scripts/generate-traffic.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Session Guide
|
||||
|
||||
### Scenario 1: High Latency Investigation (Traces)
|
||||
|
||||
1. **Open Grafana → Explore → Tempo**
|
||||
2. Run query: `{ resource.service.name = "otel-quarkus-demo" }`
|
||||
3. Look for traces with high duration on `/api/orders` endpoint
|
||||
4. Drill into spans — the `processOrder` span has a simulated delay
|
||||
5. Check span attributes: `order.item_count`, `order.total_price`, `order.processing_type`
|
||||
6. Notice: orders with `processing_type=complex` take longer
|
||||
|
||||
### Scenario 2: Error Rate Spike (Metrics + Logs)
|
||||
|
||||
1. **Grafana → Explore → Prometheus**
|
||||
2. Query: `rate(http_server_requests_seconds_count{status=~"5.."}[5m])`
|
||||
3. Compare with: `rate(http_server_requests_seconds_count{status="200"}[5m])`
|
||||
4. Notice the `/api/orders` endpoint has occasional 500s
|
||||
5. **Switch to Loki** and correlate:
|
||||
```
|
||||
{service_name="otel-quarkus-demo"} |= "ERROR"
|
||||
```
|
||||
6. Find the error logs — click the `TraceID` derived field to jump directly to the trace in Tempo
|
||||
|
||||
### Scenario 3: Custom Business Metrics
|
||||
|
||||
1. **Grafana → Explore → Prometheus**
|
||||
2. Query custom metrics:
|
||||
- `orders_total` — total orders processed (counter)
|
||||
- `orders_amount_total` — total revenue (counter)
|
||||
- `order_processing_duration_seconds` — order processing time (histogram)
|
||||
- `inventory_level` — current inventory per product (gauge)
|
||||
3. Build a dashboard:
|
||||
- Orders/sec: `rate(orders_total[5m])`
|
||||
- Revenue/min: `rate(orders_amount_total[1m]) * 60`
|
||||
- P99 latency: `histogram_quantile(0.99, rate(order_processing_duration_seconds_bucket[5m]))`
|
||||
- Inventory levels: `inventory_level`
|
||||
|
||||
### Scenario 4: Correlating Logs ↔ Traces ↔ Metrics
|
||||
|
||||
1. Start from a **metric alert**: high error rate on orders
|
||||
2. In **Loki**, filter by time range and find error logs:
|
||||
```
|
||||
{service_name="otel-quarkus-demo"} |= "ERROR"
|
||||
```
|
||||
3. Click the `TraceID` link on any log line (Loki extracts it from OTLP structured metadata)
|
||||
4. In **Tempo**, search by trace ID — see the full request flow
|
||||
5. In the trace, find span events with error details
|
||||
6. Check related **metrics** for that time window to see broader impact
|
||||
|
||||
### Useful PromQL Queries
|
||||
|
||||
```promql
|
||||
# Request rate by endpoint
|
||||
rate(http_server_requests_seconds_count[5m])
|
||||
|
||||
# Error rate percentage
|
||||
100 * rate(http_server_requests_seconds_count{status=~"5.."}[5m])
|
||||
/ rate(http_server_requests_seconds_count[5m])
|
||||
|
||||
# P95 latency by endpoint
|
||||
histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))
|
||||
|
||||
# Custom: orders per second by status
|
||||
rate(orders_total[5m])
|
||||
|
||||
# Custom: average order value
|
||||
rate(orders_amount_total[5m]) / rate(orders_total[5m])
|
||||
```
|
||||
|
||||
### Useful LogQL Queries
|
||||
|
||||
```logql
|
||||
# All app logs (service.name becomes service_name as a Loki stream label)
|
||||
{service_name="otel-quarkus-demo"}
|
||||
|
||||
# Errors only
|
||||
{service_name="otel-quarkus-demo"} |= "ERROR"
|
||||
|
||||
# Filter by trace ID (stored in OTLP structured metadata)
|
||||
{service_name="otel-quarkus-demo"} | trace_id = "your-trace-id-here"
|
||||
|
||||
# Filter by structured metadata attributes
|
||||
{service_name="otel-quarkus-demo"} | severity_text = "ERROR"
|
||||
|
||||
# Slow request logs
|
||||
{service_name="otel-quarkus-demo"} |= "processed" |= "complex"
|
||||
```
|
||||
Reference in New Issue
Block a user