← Back to Home

Cloud-Native Observability Stack Part 4 - Metrics and Alerting with Prometheus/Grafana

시리즈 소개

  1. Part 1: OpenTelemetry Instrumentation
  2. Part 2: 마이크로서비스 분산 추적
  3. Part 3: 구조화된 로깅과 Correlation ID
  4. Part 4: Prometheus/Grafana로 메트릭과 알림 (현재 글)
  5. Part 5: Observability 데이터로 프로덕션 이슈 디버깅

메트릭의 중요성

메트릭은 시스템의 건강 상태를 수치로 보여줍니다:

  • 요청 처리량 (Throughput)
  • 응답 시간 (Latency)
  • 에러율 (Error Rate)
  • 리소스 사용량 (CPU, Memory)

Spring Boot + Micrometer 설정

의존성 추가

dependencies {
    implementation("org.springframework.boot:spring-boot-starter-actuator")
    implementation("io.micrometer:micrometer-registry-prometheus")
}

Application 설정

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  endpoint:
    health:
      show-details: always
  metrics:
    tags:
      application: order-service
      environment: production
    distribution:
      percentiles-histogram:
        http.server.requests: true
      slo:
        http.server.requests: 100ms,500ms,1000ms

기본 제공 메트릭

HTTP 요청 메트릭

http_server_requests_seconds_count{method="POST",uri="/api/orders",status="200"}
http_server_requests_seconds_sum{method="POST",uri="/api/orders",status="200"}
http_server_requests_seconds_bucket{method="POST",uri="/api/orders",status="200",le="0.1"}

JVM 메트릭

jvm_memory_used_bytes{area="heap",id="G1 Eden Space"}
jvm_gc_pause_seconds_count{action="end of minor GC",cause="G1 Evacuation Pause"}
jvm_threads_live_threads

커스텀 메트릭 구현

Counter (카운터)

@Service
class OrderMetrics(private val meterRegistry: MeterRegistry) {
 
    private val ordersCreated = Counter.builder("orders.created")
        .description("Total number of orders created")
        .tag("service", "order-service")
        .register(meterRegistry)
 
    private val ordersFailed = Counter.builder("orders.failed")
        .description("Total number of failed orders")
        .tag("service", "order-service")
        .register(meterRegistry)
 
    fun recordOrderCreated() {
        ordersCreated.increment()
    }
 
    fun recordOrderFailed(reason: String) {
        Counter.builder("orders.failed")
            .tag("reason", reason)
            .register(meterRegistry)
            .increment()
    }
}

Gauge (게이지)

@Component
class QueueMetrics(
    meterRegistry: MeterRegistry,
    private val orderQueue: OrderQueue
) {
    init {
        Gauge.builder("order.queue.size", orderQueue) { queue ->
            queue.size().toDouble()
        }
            .description("Current size of order processing queue")
            .register(meterRegistry)
    }
}

Timer (타이머)

@Service
class PaymentService(private val meterRegistry: MeterRegistry) {
 
    private val paymentTimer = Timer.builder("payment.processing.time")
        .description("Time taken to process payments")
        .publishPercentiles(0.5, 0.95, 0.99)
        .register(meterRegistry)
 
    fun processPayment(order: Order): PaymentResult {
        return paymentTimer.recordCallable {
            // 결제 처리 로직
            paymentGateway.charge(order.customerId, order.totalAmount)
        }!!
    }
}

Distribution Summary

@Service
class OrderAnalytics(private val meterRegistry: MeterRegistry) {
 
    private val orderAmountSummary = DistributionSummary.builder("order.amount")
        .description("Distribution of order amounts")
        .baseUnit("KRW")
        .publishPercentiles(0.5, 0.75, 0.95)
        .register(meterRegistry)
 
    fun recordOrderAmount(amount: BigDecimal) {
        orderAmountSummary.record(amount.toDouble())
    }
}

Prometheus 설정

# prometheus.yml
global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: 'spring-boot-apps'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets:
        - 'order-service:8080'
        - 'payment-service:8081'
        - 'inventory-service:8082'
 
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Grafana 대시보드

RED Method 대시보드

Rate, Errors, Duration - 서비스 관점:

# Request Rate
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))

# Error Rate
sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))

# Duration (P99)
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le))

USE Method 대시보드

Utilization, Saturation, Errors - 리소스 관점:

# CPU Utilization
system_cpu_usage{application="order-service"}

# Memory Utilization
jvm_memory_used_bytes{application="order-service",area="heap"}
/
jvm_memory_max_bytes{application="order-service",area="heap"}

# Thread Pool Saturation
hikaricp_connections_pending{application="order-service"}

SLI/SLO 정의

Service Level Indicators

# SLI 정의
slis:
  - name: availability
    query: |
      sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m]))
      /
      sum(rate(http_server_requests_seconds_count[5m]))
 
  - name: latency_p99
    query: |
      histogram_quantile(0.99,
        sum(rate(http_server_requests_seconds_bucket[5m])) by (le)
      )
 
  - name: error_rate
    query: |
      sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
      /
      sum(rate(http_server_requests_seconds_count[5m]))

Service Level Objectives

slos:
  - name: availability
    target: 99.9%
    window: 30d
 
  - name: latency_p99
    target: 500ms
    window: 30d
 
  - name: error_rate
    target: 0.1%
    window: 30d

알림 설정

Alertmanager 규칙

# alert-rules.yml
groups:
  - name: order-service-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
          /
          sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"
 
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P99 latency is {{ $value }}s"
 
      - alert: PodDown
        expr: up{job="spring-boot-apps"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"

Slack 알림 설정

# alertmanager.yml
route:
  receiver: 'slack-notifications'
  routes:
    - match:
        severity: critical
      receiver: 'slack-critical'
    - match:
        severity: warning
      receiver: 'slack-warnings'
 
receivers:
  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
 
  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        send_resolved: true

Docker Compose 전체 설정

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.48.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert-rules.yml:/etc/prometheus/alert-rules.yml
 
  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
 
  alertmanager:
    image: prom/alertmanager:v0.26.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

정리

메트릭과 알림의 핵심:

| 항목 | 설명 | |------|------| | Micrometer | Spring Boot 메트릭 추상화 | | RED Method | Rate, Errors, Duration - 서비스 관점 | | USE Method | Utilization, Saturation, Errors - 리소스 관점 | | SLI/SLO | 서비스 품질 목표 정의 | | 알림 | 임계값 기반 자동 알림 |

다음 글에서는 Observability 데이터를 활용한 프로덕션 이슈 디버깅을 다루겠습니다.