Prometheus简介 Prometheus是一个开源的系统监控和报警系统,现在已经加入到CNCF基金会,成为继k8s之后第二个在CNCF托管的项目,在kubernetes容器管理系统中,通常会搭配prometheus进行监控,同时也支持多种exporter采集数据,还支持pushgateway进行数据上报,Prometheus性能足够支撑上万台规模的集群。
Prometheus特点 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1.多维度数据模型,时间序列数据由metrics名称和键值对来组成 可以对数据进行聚合,切割等操作 所有的metrics都可以设置任意的多维标签。 2.灵活的查询语言(PromQL),可以对采集的metrics指标进行加法,乘法,连接等操作; 3.可以直接在本地部署,不依赖其他分布式存储; 4.通过基于HTTP的pull方式采集时序数据; 5.可以通过中间网关pushgateway的方式把时间序列数据推送到prometheus server端; 6.可通过服务发现或者静态配置来发现目标服务对象(targets)。 7.有多种可视化图像界面,如Grafana等。
Prometheus组件介绍 1 2 3 4 5 6 7 8 9 10 11 1.Prometheus Server: 用于收集和存储时间序列数据。 2.Client Library: 客户端库,检测应用程序代码,当Prometheus抓取实例的HTTP端点时,客户端库会将所有跟踪的metrics指标的当前状态发送到prometheus server端。 3.Exporters: prometheus支持多种exporter,通过exporter可以采集metrics数据,然后发送到prometheus server端 4.Alertmanager: 从 Prometheus server 端接收到 alerts 后,会进行去重,分组,并路由到相应的接收方,发出报警,常见的接收方式有:电子邮件,微信,钉钉, slack等。 5.Grafana:监控仪表盘 6.pushgateway: 各个目标主机可上报数据到pushgatewy,然后prometheus server统一从pushgateway拉取数据。
Prometheus工作流程: 1 2 3 4 5 6 7 8 9 10 11 1. Prometheus server可定期从活跃的(up)目标主机上(target)拉取监控指标数据,目标主机的监控数据可通过配置静态job或者服务发现的方式被prometheus server采集到,这种方式默认的pull方式拉取指标;也可通过pushgateway把采集的数据上报到prometheus server中;还可通过一些组件自带的exporter采集相应组件的数据; 2.Prometheus server把采集到的监控指标数据保存到本地磁盘或者数据库; 3.Prometheus采集的监控指标数据按时间序列存储,通过配置报警规则,把触发的报警发送到alertmanager 4.Alertmanager通过配置报警接收方,发送报警到邮件,微信或者钉钉等 5.Prometheus 自带的web ui界面提供PromQL查询语言,可查询监控数据 6.Grafana可接入prometheus数据源,把监控数据以图形化形式展示出
部署Node-export node-exporter是什么?
采集机器(物理机、虚拟机、云主机等)的监控指标数据,能够采集到的指标包括CPU, 内存,磁盘,网络,文件数等信息。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitoring labels: name: node-exporter spec: selector: matchLabels: name: node-exporter template: metadata: labels: name: node-exporter spec: hostPID: true hostIPC: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:v0.16.0 ports: - containerPort: 9100 resources: requests: cpu: 0.15 securityContext: privileged: true args: - --path.procfs - /host/proc - --path.sysfs - /host/sys - --collector.filesystem.ignored-mount-points - '"^/(sys|proc|dev|host|etc)($|/)"' volumeMounts: - name: dev mountPath: /host/dev - name: proc mountPath: /host/proc - name: sys mountPath: /host/sys - name: rootfs mountPath: /rootfs tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" volumes: - name: proc hostPath: path: /proc - name: dev hostPath: path: /dev - name: sys hostPath: path: /sys - name: rootfs hostPath: path: / EOF
更新Pod 1 kubectl apply -f node-export.yaml
查看Pod状态 1 2 3 4 kubectl get pods -n monitoring NAME READY STATUS RESTARTS AGE node-exporter-9qpkd 1/1 Running 0 89s node-exporter-zqmnk 1/1 Running 0 89s
部署Prometheus 创建RBAC 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: ["" ] resources: - nodes - nodes/proxy - nodes/metrics - services - endpoints - pods verbs: ["get" , "list" , "watch" ] - apiGroups: - extensions - networking.k8s.io resources: - ingresses verbs: ["get" , "list" , "watch" ] - nonResourceURLs: ["/metrics" , "/metrics/cadvisor" ] verbs: ["get" ] --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: monitoring
更新rbac 1 kubectl apply -f prometheus-rbac.yaml
部署Prometheus采集配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 --- kind: ConfigMap apiVersion: v1 metadata: labels: app: prometheus name: prometheus-config namespace: monitoring data: prometheus.yml: | rule_files: - /etc/prometheus/prometheus-rules.yml alerting: alertmanagers: - static_configs: - targets: ["alertmanager.monitoring.svc:9093" ] global: scrape_interval: 15s scrape_timeout: 10s evaluation_interval: 1m scrape_configs: - job_name: 'kubernetes-node' kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__ ] regex: '(.*):10250' replacement: '${1}:9100' target_label: __address__ action: replace - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-node-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name ] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace , __meta_kubernetes_service_name , __meta_kubernetes_endpoint_port_name ] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape ] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme ] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path ] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__ , __meta_kubernetes_service_annotation_prometheus_io_port ] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace ] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name ] action: replace target_label: kubernetes_name - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: kubernetes_pod_name - job_name: 'kubernetes-schedule' scrape_interval: 5s static_configs: - targets: ['172.16.1.11:10251' ] - job_name: 'kubernetes-controller-manager' scrape_interval: 5s static_configs: - targets: ['172.16.1.11:10252' ] - job_name: 'kubernetes-kube-proxy' scrape_interval: 5s static_configs: - targets: ['172.16.1.11:10249' ,'172.16.1.12:10249' ,'172.16.1.13:10249' ] - job_name: 'kubernetes-etcd' scheme: https tls_config: ca_file: /etc/kubernetes/pki/etcd/ca.crt cert_file: /etc/kubernetes/pki/etcd/server.crt key_file: /etc/kubernetes/pki/etcd/server.key scrape_interval: 5s static_configs: - targets: ['172.16.1.11:2379' ] EOF
注意:通过上面命令生成的promtheus-cfg.yaml文件会有一些问题,$1和$2这种变量在文件里没有,需要在k8s的master1节点打开promtheus-cfg.yaml文件,手动把$1和$2这种变量写进文件里,promtheus-cfg.yaml文件需要手动修改部分如下:22行的replacement: ‘:9100’变成replacement: ‘${1}:9100’
部署告警配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 kind: ConfigMap apiVersion: v1 metadata: labels: app: prometheus name: prometheus-rules namespace: monitoring data: prometheus-rules.yml: | groups: - name: example rules: - alert: kube-proxy的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过80%" - alert: kube-proxy的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过90%" - alert: scheduler的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过80%" - alert: scheduler的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过90%" - alert: controller-manager的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过80%" - alert: controller-manager的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过90%" - alert: apiserver的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过80%" - alert: apiserver的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过90%" - alert: etcd的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过80%" - alert: etcd的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 组件的cpu使用率超过90%" - alert: kube-state-metrics的cpu使用率大于80% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.k8s_app}} 组件的cpu使用率超过80%" value: "{{ $value }} %" threshold: "80%" - alert: kube-state-metrics的cpu使用率大于90% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.k8s_app}} 组件的cpu使用率超过90%" value: "{{ $value }} %" threshold: "90%" - alert: coredns的cpu使用率大于80% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.k8s_app}} 组件的cpu使用率超过80%" value: "{{ $value }} %" threshold: "80%" - alert: coredns的cpu使用率大于90% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.k8s_app}} 组件的cpu使用率超过90%" value: "{{ $value }} %" threshold: "90%" - alert: kube-proxy打开句柄数>600 expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>600" value: "{{ $value }} " - alert: kube-proxy打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>1000" value: "{{ $value }} " - alert: kubernetes-schedule打开句柄数>600 expr: process_open_fds{job=~"kubernetes-schedule"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>600" value: "{{ $value }} " - alert: kubernetes-schedule打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-schedule"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>1000" value: "{{ $value }} " - alert: kubernetes-controller-manager打开句柄数>600 expr: process_open_fds{job=~"kubernetes-controller-manager"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>600" value: "{{ $value }} " - alert: kubernetes-controller-manager打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-controller-manager"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>1000" value: "{{ $value }} " - alert: kubernetes-apiserver打开句柄数>600 expr: process_open_fds{job=~"kubernetes-apiserver"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>600" value: "{{ $value }} " - alert: kubernetes-apiserver打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-apiserver"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>1000" value: "{{ $value }} " - alert: kubernetes-etcd打开句柄数>600 expr: process_open_fds{job=~"kubernetes-etcd"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>600" value: "{{ $value }} " - alert: kubernetes-etcd打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-etcd"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}} 的{{$labels.job}} 打开句柄数>1000" value: "{{ $value }} " - alert: coredns expr: process_open_fds{k8s_app=~"kube-dns"} > 600 for: 2s labels: severity: warnning annotations: description: "插件{{$labels.k8s_app}} ({{$labels.instance}} ): 打开句柄数超过600" value: "{{ $value }} " - alert: coredns expr: process_open_fds{k8s_app=~"kube-dns"} > 1000 for: 2s labels: severity: critical annotations: description: "插件{{$labels.k8s_app}} ({{$labels.instance}} ): 打开句柄数超过1000" value: "{{ $value }} " - alert: kube-proxy expr: process_virtual_memory_bytes{job=~"kubernetes-kube-proxy"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ): 使用虚拟内存超过2G" value: "{{ $value }} " - alert: scheduler expr: process_virtual_memory_bytes{job=~"kubernetes-schedule"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ): 使用虚拟内存超过2G" value: "{{ $value }} " - alert: kubernetes-controller-manager expr: process_virtual_memory_bytes{job=~"kubernetes-controller-manager"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ): 使用虚拟内存超过2G" value: "{{ $value }} " - alert: kubernetes-apiserver expr: process_virtual_memory_bytes{job=~"kubernetes-apiserver"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ): 使用虚拟内存超过2G" value: "{{ $value }} " - alert: kubernetes-etcd expr: process_virtual_memory_bytes{job=~"kubernetes-etcd"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ): 使用虚拟内存超过2G" value: "{{ $value }} " - alert: kube-dns expr: process_virtual_memory_bytes{k8s_app=~"kube-dns"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "插件{{$labels.k8s_app}} ({{$labels.instance}} ): 使用虚拟内存超过2G" value: "{{ $value }} " - alert: HttpRequestsAvg expr: sum(rate(rest_client_requests_total{job=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"}[1m])) > 1000 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ): TPS超过1000" value: "{{ $value }} " threshold: "1000" - alert: Pod_restarts expr: kube_pod_container_status_restarts_total{namespace=~"kube-system|default|monitor-sa"} > 0 for: 2s labels: severity: warnning annotations: description: "在{{$labels.namespace}} 名称空间下发现{{$labels.pod}} 这个pod下的容器{{$labels.container}} 被重启,这个监控指标是由{{$labels.instance}} 采集的" value: "{{ $value }} " threshold: "0" - alert: Pod_waiting expr: kube_pod_container_status_waiting_reason{namespace=~"kube-system|default"} == 1 for: 2s labels: team: admin annotations: description: "空间{{$labels.namespace}} ({{$labels.instance}} ): 发现{{$labels.pod}} 下的{{$labels.container}} 启动异常等待中" value: "{{ $value }} " threshold: "1" - alert: Pod_terminated expr: kube_pod_container_status_terminated_reason{namespace=~"kube-system|default|monitor-sa"} == 1 for: 2s labels: team: admin annotations: description: "空间{{$labels.namespace}} ({{$labels.instance}} ): 发现{{$labels.pod}} 下的{{$labels.container}} 被删除" value: "{{ $value }} " threshold: "1" - alert: Etcd_leader expr: etcd_server_has_leader{job="kubernetes-etcd"} == 0 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ): 当前没有leader" value: "{{ $value }} " threshold: "0" - alert: Etcd_leader_changes expr: rate(etcd_server_leader_changes_seen_total{job="kubernetes-etcd"}[1m]) > 0 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ): 当前leader已发生改变" value: "{{ $value }} " threshold: "0" - alert: Etcd_failed expr: rate(etcd_server_proposals_failed_total{job="kubernetes-etcd"}[1m]) > 0 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ): 服务失败" value: "{{ $value }} " threshold: "0" - alert: Etcd_db_total_size expr: etcd_debugging_mvcc_db_total_size_in_bytes{job="kubernetes-etcd"} > 10000000000 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}} ({{$labels.instance}} ):db空间超过10G" value: "{{ $value }} " threshold: "10G" - alert: Endpoint_ready expr: kube_endpoint_address_not_ready{namespace=~"kube-system|default"} == 1 for: 2s labels: team: admin annotations: description: "空间{{$labels.namespace}} ({{$labels.instance}} ): 发现{{$labels.endpoint}} 不可用" value: "{{ $value }} " threshold: "1" - name: 物理节点状态-监控告警 rules: - alert: 物理节点cpu使用率 expr: 100 -avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 90 for: 2s labels: severity: ccritical annotations: summary: "{{ $labels.instance }} cpu使用率过高" description: "{{ $labels.instance }} 的cpu使用率超过90%,当前使用率[{{ $value }} ],需要排查处理" - alert: 物理节点内存使用率 expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90 for: 2s labels: severity: critical annotations: summary: "{{ $labels.instance }} 内存使用率过高" description: "{{ $labels.instance }} 的内存使用率超过90%,当前使用率[{{ $value }} ],需要排查处理" - alert: InstanceDown expr: up == 0 for: 2s labels: severity: critical annotations: summary: "{{ $labels.instance }} : 服务器宕机" description: "{{ $labels.instance }} : 服务器延时超过2分钟" - alert: 物理节点磁盘的IO性能 expr: 100 -(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100 ) < 60 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!" description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}} )" - alert: 入网流量带宽 expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100 ) > 102400 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 流入网络带宽过高!" description: "{{$labels.mountpoint }} 流入网络带宽持续5分钟高于100M. RX带宽使用率{{$value}} " - alert: 出网流量带宽 expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100 ) > 102400 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 流出网络带宽过高!" description: "{{$labels.mountpoint }} 流出网络带宽持续5分钟高于100M. RX带宽使用率{{$value}} " - alert: TCP会话 expr: node_netstat_Tcp_CurrEstab > 1000 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!" description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}} %)" - alert: 磁盘容量 expr: 100 -(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs" }*100) > 80 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!" description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}} %)"
更新文件
1 kubectl apply -f prometheus-rules.yaml
部署Prometheus-Server
注:yaml中定义了nodename,强制调度到了node1节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 mkdir /data chmod 777 /data/ --- apiVersion: apps/v1 kind: Deployment metadata: name: prometheus-server namespace: monitoring labels: app: prometheus spec: replicas: 1 selector: matchLabels: app: prometheus component: server template: metadata: labels: app: prometheus component: server annotations: prometheus.io/scrape: 'true' spec: nodeName: z6gizpvemac5jsc serviceAccountName: prometheus containers: - name: prometheus image: prom/prometheus:v2.2.1 imagePullPolicy: IfNotPresent command: - prometheus - --config.file=/etc/prometheus/prometheus.yml - --storage.tsdb.path=/prometheus - --storage.tsdb.retention=720h ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: /etc/prometheus/prometheus.yml name: prometheus-config subPath: prometheus.yml - mountPath: /prometheus/ name: prometheus-storage-volume - name: localtime mountPath: /etc/localtime - name: prometheus-rules mountPath: /etc/prometheus/rules/ volumes: - name: prometheus-config configMap: name: prometheus-config items: - key: prometheus.yml path: prometheus.yml mode: 0644 - name: prometheus-rules configMap: name: prometheus-rules - name: prometheus-storage-volume hostPath: path: /data type: Directory - name: localtime hostPath: path: /etc/localtime EOF
更新Pod 1 kubectl apply -f prometheus-deploy.yaml
查看Pod状态 1 2 3 4 5 kubectl get pods -n monitoring NAME READY STATUS RESTARTS AGE node-exporter-9qpkd 1/1 Running 0 76m node-exporter-zqmnk 1/1 Running 0 76m prometheus-server-85dbc6c7f7-nsg94 1/1 Running 0 6m7
部署SVC 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 cat > prometheus-svc.yaml << EOF --- apiVersion: v1 kind: Service metadata: name: prometheus namespace: monitoring labels: app: prometheus spec: type: NodePort ports: - port: 9090 targetPort: 9090 protocol: TCP selector: app: prometheus component: server EOF
更新SVC 1 kubectl apply -f prometheus-svc.yaml
查看SVC 1 2 3 kubectl get svc -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus NodePort 10.96.45.93 <none> 9090:31043/TCP 50s
访问Prometheus采集UI
修改Porxy监听IP
默认监听IP为127.0.0.1,这样的话无法通过宿主机IP采集到数据
1 2 3 4 5 6 7 8 9 10 11 [root@v86a5soqgn7i23h data1] ··· metricsBindAddress: "0.0.0.0" ··· [root@v86a5soqgn7i23h data1] pod "kube-proxy-4r2v2" deleted [root@v86a5soqgn7i23h data1] pod "kube-proxy-qgjzz" deleted [root@v86a5soqgn7i23h data1] pod "kube-proxy-qr7dn" deleted
部署Grafana 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 cat >grafana.yaml << EOF apiVersion: apps/v1 kind: Deployment metadata: name: monitoring-grafana namespace: monitoring spec: replicas: 1 selector: matchLabels: task: monitoring k8s-app: grafana template: metadata: labels: task: monitoring k8s-app: grafana spec: containers: - name: grafana image: grafana/grafana:7.5.4 ports: - containerPort: 3000 protocol: TCP volumeMounts: - mountPath: /etc/ssl/certs name: ca-certificates readOnly: true - mountPath: /var/lib/grafana name: grafana-storage env: - name: INFLUXDB_HOST value: monitoring-influxdb - name: GF_SERVER_HTTP_PORT value: "3000" - name: GF_AUTH_BASIC_ENABLED value: "false" - name: GF_AUTH_ANONYMOUS_ENABLED value: "true" - name: GF_AUTH_ANONYMOUS_ORG_ROLE value: Admin - name: GF_SERVER_ROOT_URL value: / volumes: - name: ca-certificates hostPath: path: /etc/ssl/certs - name: grafana-storage hostPath: path: /data/grafana-volume-data --- apiVersion: v1 kind: Service metadata: labels: kubernetes.io/cluster-service: 'true' kubernetes.io/name: monitoring-grafana name: monitoring-grafana namespace: monitoring spec: ports: - port: 3000 targetPort: 3000 selector: k8s-app: grafana type: NodePort EOF
更新Pod 1 kubectl apply -f grafana.yaml
查看Pod状态 1 2 3 4 monitoring-grafana-7d7f6cf5c6-vrxw9 1/1 Running 0 3h51m monitoring-grafana NodePort 10.111.173.47 <none> 80:31044/TCP 3h54m
如果利用了PV存储需要做一个权限配置 们这里增加了securityContext
,但是我们将目录/var/lib/grafana
挂载到 pvc 这边后目录的拥有者并不是上面的 grafana(472)这个用户了,所以我们需要更改下这个目录的所属用户,这个时候我们可以利用一个 Job 任务去更改下该目录的所属用户
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 apiVersion: batch/v1 kind: Job metadata: name: grafana-chown namespace: monitoring spec: template: spec: restartPolicy: Never containers: - name: grafana-chown command : ["chown" , "-R" , "472:472" , "/var/lib/grafana" ] image: busybox imagePullPolicy: IfNotPresent volumeMounts: - name: storage subPath: grafana mountPath: /var/lib/grafana volumes: - name: storage persistentVolumeClaim: claimName: grafana
访问Grafana对接Prometheus
上传监控模板
部署Kube-state-metrics组件 创建rbac 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.2.1 name: kube-system rules: - apiGroups: - "" resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: - list - watch - apiGroups: - extensions resources: - daemonsets - deployments - replicasets - ingresses verbs: - list - watch - apiGroups: - apps resources: - statefulsets - daemonsets - deployments - replicasets verbs: - list - watch - apiGroups: - batch resources: - cronjobs - jobs verbs: - list - watch - apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - list - watch - apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create - apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create - apiGroups: - policy resources: - poddisruptionbudgets verbs: - list - watch - apiGroups: - certificates.k8s.io resources: - certificatesigningrequests verbs: - list - watch - apiGroups: - storage.k8s.io resources: - storageclasses - volumeattachments verbs: - list - watch - apiGroups: - admissionregistration.k8s.io resources: - mutatingwebhookconfigurations - validatingwebhookconfigurations verbs: - list - watch - apiGroups: - networking.k8s.io resources: - networkpolicies verbs: - list - watch --- apiVersion: v1 kind: ServiceAccount metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.2.1 name: kube-state-metrics namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.2.1 name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system
更新yaml文件
1 kubectl apply -f kube-state-metrics-rbac.yaml
部署kube-state-metrics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 apiVersion: apps/v1 kind: Deployment metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.2.1 name: kube-state-metrics namespace: kube-system spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: kube-state-metrics template: metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.2.1 spec: containers: - image: registry.cn-shenzhen.aliyuncs.com/starsl/kube-state-metrics:v2.2.1 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 name: kube-state-metrics ports: - containerPort: 8080 name: http-metrics - containerPort: 8081 name: telemetry readinessProbe: httpGet: path: / port: 8081 initialDelaySeconds: 5 timeoutSeconds: 5 nodeSelector: beta.kubernetes.io/os: linux serviceAccountName: kube-state-metrics
更新yaml文件
1 2 3 kubectl apply -f kube-state-metrics-deploy.yaml kubectl get pods -n kube-system kube-state-metrics-79c9686b96-4njrs 1/1 Running 0 76s
创建service 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.2.1 name: kube-state-metrics namespace: kube-system spec: clusterIP: None ports: - name: http-metrics port: 8080 targetPort: http-metrics - name: telemetry port: 8081 targetPort: telemetry selector: app.kubernetes.io/name: kube-state-metrics
更新yaml文件
1 2 3 kubectl apply -f kube-state-metrics-svc.yaml kubectl get svc -n kube-system | grep kube-state-metrics kube-state-metrics ClusterIP 10.105.53.102 <none> 8080/TCP 2m38s
部署Alertmanager
部署邮件配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 cat >alertmanager-cm-email.yaml <<EOF kind: ConfigMap apiVersion: v1 metadata: name: alertmanager-config-mail namespace: monitoring data: alertmanager.yml: |- global: resolve_timeout: 5m smtp_smarthost: "smtp.qq.com:465" smtp_from: "188837747@qq.com" smtp_auth_username: "188837747@qq.com" smtp_auth_password: "uoiqnfogvuhubheh" smtp_require_tls: false templates: - '/etc/alertmanager-templates/email.tmpl' route: group_by: ['alertname' ] repeat_interval: 30m receiver: live-monitoring receivers: - name: live-monitoring email_configs: - to: 718334935 @qq.com html: '{{ template "email.html" . }} ' headers: { Subject: "[WARN] 阿轩智能报警系统" } EOF
alertmanager配置文件解释说明:
1 2 3 4 5 6 7 8 9 10 11 smtp_smarthost: 'smtp.163.com:25' smtp_from: '15011572657@163.com' smtp_auth_username: '15011572657' smtp_auth_password: 'BDBPRMLNZGKWRFJP' email_configs: - to: '1980570647@qq.com'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 global: resolve_timeout: 5m smtp_smarthost: "smtp.qq.com:465" smtp_from: '{{ template "email.from" . }}' smtp_auth_username: '{template "email.from" . }' smtp_auth_password: "zaifuwbledqubjfa" smtp_require_tls: false smtp_hello: 'qq.com' templates: - '/etc/alertmanager/templates/email.tmpl' route: group_by: ['alertname' ] group_wait: 5s group_interval: 5s repeat_interval: 5m receiver: 'email' receivers: - name: 'email' email_configs: - to: '{{ template "email.to" . }}' html: '{{ template "email.to.html" . }}' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname' , 'dev' , 'instance' ]
更新yaml文件
1 kubectl apply -f alertmanager-cm-mail.yaml
Alertmanager告警模板 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-templates namespace: monitoring data: email.tmpl: | {{ define "email.html" }} {{ range .Alerts }} <pre> ========start========== 告警程序: prometheus_alert 告警级别: {{ .Labels.severity }} 告警类型: {{ .Labels.alertname }} 故障主机: {{ .Labels.instance }} 告警主题: {{ .Annotations.summary }} 告警详情: {{ .Annotations.description }} 触发时间: {{ .StartsAt.Format "2019-12-14 16:01:01" }} ========end========== </pre> {{ end }} {{ end }}
部署Alertmanager-Server 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 apiVersion: apps/v1 kind: StatefulSet metadata: name: alertmanager namespace: monitoring labels: k8s-app: alertmanager version: v0.21.0 spec: replicas: 1 serviceName: alertmanager selector: matchLabels: k8s-app: alertmanager version: v0.21.0 template: metadata: labels: k8s-app: alertmanager version: v0.21.0 annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: containers: - name: alertmanager image: quay.io/prometheus/alertmanager:v0.21.0 imagePullPolicy: "IfNotPresent" args: - --config.file=/etc/config/alertmanager.yml - --storage.path=/data ports: - containerPort: 9093 readinessProbe: httpGet: path: /#/status port: 9093 initialDelaySeconds: 30 timeoutSeconds: 30 resources: limits: cpu: 200m memory: 200Mi requests: cpu: 5m memory: 40Mi volumeMounts: - name: config-volume mountPath: /etc/config - name: storage-volume mountPath: /data - name: templates-volume mountPath: /etc/alertmanager-templates - name: localtime mountPath: /etc/localtime volumes: - name: config-volume configMap: name: alertmanager-config-mail - name: templates-volume configMap: name: alertmanager-templates - name: localtime hostPath: path: /etc/localtime volumeClaimTemplates: - metadata: name: storage-volume namespace: monitoring spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 10Gi storageClassName: managed-nfs-storage --- apiVersion: v1 kind: Service metadata: labels: k8s-app: alertmanager name: alertmanager namespace: monitoring spec: type: NodePort ports: - port: 9093 targetPort: 9093 nodePort: 31192 selector: k8s-app: alertmanager
总体Pod状态 1 2 3 4 5 6 7 8 9 10 [root@v86a5soqgn7i23h alertmanager] NAME READY STATUS RESTARTS AGE alertmanager-0 1/1 Running 0 5m46s grafana-chown-zgkmh 0/1 Completed 0 3d monitoring-grafana-6798dc8d7c-kdxmc 1/1 Running 0 3d node-exporter-n7x8c 1/1 Running 0 3d node-exporter-ngzzn 1/1 Running 0 3d node-exporter-q8hlv 1/1 Running 0 3d prometheus-server-6796c74d5d-nqmdc 1/1 Running 0 45m
访问Prometheus UI
可以看到有告警信息
访问Alertmanager UI
可以看到告警信息展示在alertmanager UI中
访问QQ邮箱查看邮件告警