使用 Alloy 和 OpenTelemetry 采集 Kubernetes Pod 日志

概述

本文介绍如何使用 Grafana Alloy 的 discovery.kubernetesotelcol 组件实现 Kubernetes Pod 日志采集,完全基于 OpenTelemetry 协议,不依赖 Loki 组件。这种方案具有更好的标准化和互操作性。

架构设计

核心组件

  1. discovery.kubernetes: 自动发现 Kubernetes Pod
  2. otelcol.receiver.filelog: 直接读取容器日志文件
  3. otelcol.processor.k8sattributes: 自动添加 Kubernetes 元数据
  4. otelcol.processor.attributes: 日志属性增强和转换
  5. otelcol.processor.resource: 资源级别属性处理
  6. otelcol.processor.batch: 批处理优化
  7. otelcol.exporter.otlphttp: OTLP HTTP 协议导出

数据流程

graph TD
    A[Kubernetes Pods] --> B[Container Log Files]
    B --> C[otelcol.receiver.filelog]
    C --> D[otelcol.processor.k8sattributes]
    D --> E[otelcol.processor.attributes]
    E --> F[otelcol.processor.resource]
    F --> G[otelcol.processor.batch]
    G --> H[otelcol.exporter.otlphttp]
    H --> I[OTLP Compatible Backend]

配置详解

1. Kubernetes Pod 发现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
discovery.kubernetes "pods" {
  role = "pod"
  
  # 可选:命名空间过滤
  # namespaces {
  #   names = ["default", "monitoring", "kube-system"]
  # }
  
  # 可选:标签选择器过滤
  selectors {
    role = "pod"
    # label = "logging.enabled=true"
  }
}

配置说明:

  • role = "pod": 发现 Pod 资源 0
  • namespaces: 限制发现的命名空间范围
  • selectors: 通过标签选择器过滤特定的 Pod

2. 文件日志接收器

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
otelcol.receiver.filelog "kubernetes_pods" {
  include = [
    "/var/log/pods/*/*/*.log",
    "/var/log/containers/*.log"
  ]
  
  exclude = [
    "/var/log/pods/kube-system_*/*/*.log",
    "/var/log/containers/*_kube-system_*.log"
  ]
  
  operators = [
    {
      # 解析容器运行时日志格式
      type = "regex_parser"
      regex = "^(?P<timestamp>[^\\s]+)\\s+(?P<stream>stdout|stderr)\\s+(?P<logtag>[^\\s]+)\\s+(?P<message>.*)"
      timestamp {
        parse_from = "attributes.timestamp"
        layout_type = "gotime"
        layout = "2006-01-02T15:04:05.999999999Z07:00"
      }
    }
  ]
}

配置说明:

  • include: 指定要监控的日志文件路径模式
  • exclude: 排除不需要采集的日志文件
  • operators: 日志解析操作符,支持正则表达式解析
  • start_at = "end": 从文件末尾开始读取(避免重复处理历史日志)

3. Kubernetes 属性处理器

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
otelcol.processor.k8sattributes "default" {
  auth_type = "serviceAccount"
  wait_for_metadata = true
  wait_for_metadata_timeout = "30s"
  
  extract {
    label {
      tag_name = "app"
      key = "app"
      from = "pod"
    }
    
    annotation {
      tag_name = "deployment_revision"
      key = "deployment.kubernetes.io/revision"
      from = "pod"
    }
  }
}

配置说明:

  • auth_type = "serviceAccount": 使用 ServiceAccount 进行 Kubernetes API 认证
  • extract: 配置要提取的 Pod 标签和注解
  • pod_association: 配置 Pod 关联规则

4. 属性增强处理器

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
otelcol.processor.attributes "enrich_logs" {
  action {
    key = "service.name"
    action = "insert"
    from_attribute = "k8s.pod.labels.project_name"
  }
  
  action {
    key = "log.level"
    action = "extract"
    pattern = ".*\\[(DEBUG|INFO|WARN|ERROR|FATAL)\\].*"
    from_attribute = "body"
  }
}

配置说明:

  • 支持多种操作:insertupdateextractdelete
  • 可以从其他属性复制值或使用正则表达式提取
  • 遵循 OpenTelemetry 语义约定

部署配置

1. RBAC 权限

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
apiVersion: v1
kind: ServiceAccount
metadata:
  name: alloy
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: alloy
rules:
- apiGroups: [""]
  resources: ["pods", "nodes", "services", "endpoints"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: alloy
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: alloy
subjects:
- kind: ServiceAccount
  name: alloy
  namespace: monitoring

2. DaemonSet 部署

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: alloy
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: alloy
  template:
    metadata:
      labels:
        app: alloy
    spec:
      serviceAccountName: alloy
      containers:
      - name: alloy
        image: grafana/alloy:latest
        args:
        - "run"
        - "/etc/alloy/config.alloy"
        - "--server.http.listen-addr=0.0.0.0:12345"
        - "--storage.path=/tmp/alloy"
        volumeMounts:
        - name: config
          mountPath: /etc/alloy
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        ports:
        - containerPort: 12345
          name: http
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
      volumes:
      - name: config
        configMap:
          name: alloy-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      tolerations:
      - effect: NoSchedule
        operator: Exists

3. ConfigMap

1
2
3
4
5
6
7
8
apiVersion: v1
kind: ConfigMap
metadata:
  name: alloy-config
  namespace: monitoring
data:
  config.alloy: |
    # 这里放置完整的 Alloy 配置内容

日志后端集成

1. Jaeger 集成

1
2
3
4
5
6
7
otelcol.exporter.otlphttp "jaeger" {
  client {
    endpoint = "http://jaeger-collector:14268/api/traces"
    timeout = "30s"
    compression = "gzip"
  }
}

2. VictoriaLogs 集成

1
2
3
4
5
6
7
otelcol.exporter.otlphttp "victorialogs" {
  client {
    endpoint = "http://victorialogs:9428/opentelemetry/v1/logs"
    timeout = "30s"
    compression = "gzip"
  }
}

3. 通用 OTLP 后端

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
otelcol.exporter.otlphttp "generic" {
  client {
    endpoint = "https://otlp-endpoint.example.com:4318/v1/logs"
    headers = {
      "Authorization" = "Bearer your-token"
    }
    tls {
      insecure = false
    }
  }
}

性能优化

1. 批处理优化

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
otelcol.processor.batch "default" {
  send_batch_size = 1024      # 批次大小
  send_batch_max_size = 2048  # 最大批次大小
  timeout = "5s"              # 批处理超时
  
  metadata_keys = [           # 元数据键优化
    "k8s.namespace.name",
    "k8s.pod.name",
    "k8s.container.name"
  ]
}

2. 队列配置

1
2
3
4
5
sending_queue {
  enabled = true
  num_consumers = 10    # 消费者数量
  queue_size = 1000     # 队列大小
}

3. 文件监控优化

1
2
max_concurrent_files = 1024  # 最大并发文件数
max_log_size = "1MiB"        # 最大日志文件大小

故障排查

1. 启用调试模式

1
2
3
livedebugging {
  enabled = true
}

2. 检查 Alloy 状态

1
2
3
4
5
6
7
8
9
# 查看 Alloy Pod 状态
kubectl get pods -n monitoring -l app=alloy

# 查看 Alloy 日志
kubectl logs -n monitoring -l app=alloy -f

# 访问 Alloy Web UI
kubectl port-forward -n monitoring svc/alloy 12345:12345
# 然后访问 http://localhost:12345

3. 常见问题

权限问题

1
2
3
# 检查 ServiceAccount 权限
kubectl auth can-i get pods --as=system:serviceaccount:monitoring:alloy
kubectl auth can-i list pods --as=system:serviceaccount:monitoring:alloy

日志文件访问问题

1
2
3
# 检查日志文件路径
kubectl exec -it <alloy-pod> -- ls -la /var/log/pods/
kubectl exec -it <alloy-pod> -- ls -la /var/log/containers/

网络连接问题

1
2
# 测试到后端的连接
kubectl exec -it <alloy-pod> -- curl -v http://otel-collector:4318/v1/logs

监控指标

Alloy 自身指标

1
2
3
4
5
6
prometheus.exporter.self "alloy_metrics" {}

prometheus.scrape "alloy_metrics" {
  targets = prometheus.exporter.self.alloy_metrics.targets
  forward_to = [prometheus.remote_write.default.receiver]
}

关键指标

  • alloy_otelcol_receiver_accepted_log_records_total: 接收的日志记录数
  • alloy_otelcol_processor_batch_batch_send_size: 批处理大小
  • alloy_otelcol_exporter_sent_log_records_total: 发送的日志记录数
  • alloy_otelcol_exporter_send_failed_log_records_total: 发送失败的日志记录数

最佳实践

1. 资源配置

  • CPU: 根据日志量调整,建议起始值 100m,上限 500m
  • 内存: 建议起始值 128Mi,上限 512Mi
  • 存储: 为批处理和缓存预留临时存储空间

2. 标签策略

  • 使用一致的标签命名约定
  • 避免高基数标签(如 Pod UID)
  • 合理使用标签选择器过滤不必要的 Pod

3. 安全考虑

  • 使用最小权限原则配置 RBAC
  • 启用 TLS 加密传输
  • 定期轮换认证凭据
  • 避免在配置中硬编码敏感信息

4. 扩展性

  • 使用 DaemonSet 确保每个节点都有日志采集器
  • 配置适当的资源限制避免影响业务 Pod
  • 考虑使用多个导出器实现高可用

总结

使用 Grafana Alloy 的 discovery.kubernetesotelcol 组件进行 Kubernetes Pod 日志采集具有以下优势:

  1. 标准化: 基于 OpenTelemetry 标准,具有更好的互操作性
  2. 灵活性: 支持多种日志后端和处理管道
  3. 性能: 内置批处理和队列机制,优化传输效率
  4. 可观测性: 丰富的内置指标和调试功能
  5. 云原生: 原生支持 Kubernetes 服务发现和元数据提取

这种方案特别适合需要标准化日志采集管道、支持多种后端系统或希望避免供应商锁定的场景。

参考资料

0%