0%

Prometheus + Pushgateway + Grafana 监控 Python 内存占用

前言

  • 工作中需要了解一些历史遗留的Python程序的内存占用和CPU使用情况。经过调研选择目前主流的监控方案Prometheus,记录一些配置,接入问题。

Prometheus

介绍

image.png

  • Prometheus :
  • Alertmanager: 配置报警规则,发送报警通知
  • Web UI : 一般使用 Grafana

安装(docker)

  • 国内直接下镜像比较慢,可以用 阿里的镜像加速
  • Prometheus :
    • 可以使用外面的配置文件映射过去, 方便修改
1
docker run -p 9090:9090 -v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
1
docker run -d -p 9091:9091 prom/pushgateway
  • Grafana: 默认用户密码 admin admin
1
docker run -d -p 3000:3000 grafana/grafana
  • /etc/prometheus/prometheus.yml 配置文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# my global config
global:
scrape_interval: 15s #抓取时间间隔 Default is every 1 minute.

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"


# A scrape configuration containing exactly one endpoint to scrape:
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
static_configs:
- targets: ["localhost:9090"]

# pushgateway
- job_name: "pushgateway"
honor_labels: true # 默认false, true 标签以 拉取数据为准,
# 示例 false 推到gateway 的 job instance 会被改写成 exported_job 和 exported_instance
static_configs:
- targets: ["192.168.1.182:9091"] #写本机ip, 不要写localhost

Python 接入

sdk: client_python

1
pip install prometheus-client

推送到Pushgateway

  • 启动一个线程,定时上传信息, 使用registry.REGISTRY (Linux环境下,已注册内存,cpu ,python info, gc 等指标信息)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import socket
import threading
import time

from prometheus_client import (
push_to_gateway,
registry,
)


def monitor(gateway: str, job: str, interval: int):
while True:
push_to_gateway(
gateway=gateway,
job=job,
registry=registry.REGISTRY,
grouping_key={"instance": socket.getfqdn()},
)
time.sleep(interval)


def start_monitor(gateway: str, job: str, interval: int = 10):
"""启动监控
:param gateway: 网关地址
:param job: 任务名
:param interval: 统计频率
"""
t = threading.Thread(target=monitor, args=(gateway, job, interval))
t.daemon = True
t.start()

if __name__ == "__main__":
# 10s 收集一次信息
start_monitor('127.0.0.1:9091', job='test_mem', interval=10)

l = []*100000
while True:
time.sleep(1)
l.extend([]*100000)


  • 注意:默认的指标如cpu,memory 只支持线程模式,多进程模型需要自定义统计指标,如gunicorn

Grafana 图表展示

[Python Process.json](/about/Python Process.json)

  • image.png

扩展

  1. Django-Prometheus + Dashborads
  2. FastAPI-exporter
  3. Flask_exporter

参考

欢迎关注我的其它发布渠道