前言

工作中需要了解一些历史遗留的Python程序的内存占用和CPU使用情况。经过调研选择目前主流的监控方案Prometheus,记录一些配置，接入问题。

Prometheus

介绍

Prometheus ：
- 直接采集：从Exporters拉取信息，适用于REST接口服务等长时间运行的作业
  - 接口服务，Nginx， Postgresql, EMQ 主机信息
  - 更多见https://prometheus.io/docs/instrumenting/exporters/
- 间接采集：应用程序将信息推送到 Pushgateway ,Prometheus 从 Pushgateway 拉取信息，适用于定时任务等短时间运行的作业
Alertmanager: 配置报警规则，发送报警通知
Web UI ：一般使用 Grafana

安装(docker)

国内直接下镜像比较慢，可以用阿里的镜像加速
Prometheus ：
- 可以使用外面的配置文件映射过去，方便修改

1	docker run -p 9090:9090 -v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

http://localhost:9090 验证启动成功
Pushgateway：

1	docker run -d -p 9091:9091 prom/pushgateway

Grafana: 默认用户密码 admin admin

1	docker run -d -p 3000:3000 grafana/grafana

/etc/prometheus/prometheus.yml 配置文件

# my global config
global:
  scrape_interval: 15s #抓取时间间隔 Default is every 1 minute.

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"


# A scrape configuration containing exactly one endpoint to scrape:
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ["localhost:9090"]
  
  # pushgateway
  - job_name: "pushgateway"
    honor_labels: true # 默认false, true 标签以 拉取数据为准, 
    # 示例 false 推到gateway 的 job instance 会被改写成 exported_job 和 exported_instance
    static_configs:
         - targets: ["192.168.1.182:9091"] #写本机ip， 不要写localhost

Python 接入

sdk: client_python

1	pip install prometheus-client

推送到Pushgateway

启动一个线程，定时上传信息，使用registry.REGISTRY (Linux环境下，已注册内存，cpu ，python info， gc 等指标信息)

import socket
import threading
import time

from prometheus_client import (
    push_to_gateway,
    registry,
)


def monitor(gateway: str, job: str, interval: int):
    while True:
        push_to_gateway(
            gateway=gateway,
            job=job,
            registry=registry.REGISTRY,
            grouping_key={"instance": socket.getfqdn()},
        )
        time.sleep(interval)


def start_monitor(gateway: str, job: str, interval: int = 10):
    """启动监控
    :param gateway: 网关地址
    :param job: 任务名
    :param interval: 统计频率
    """
    t = threading.Thread(target=monitor, args=(gateway, job, interval))
    t.daemon = True
    t.start()
    
if __name__ == "__main__":
    # 10s 收集一次信息
 start_monitor('127.0.0.1:9091', job='test_mem', interval=10)
    
    l = []*100000
    while True:
        time.sleep(1)
        l.extend([]*100000)

注意：默认的指标如cpu，memory 只支持线程模式，多进程模型需要自定义统计指标,如gunicorn

Grafana 图表展示

配置数据源为Prometheus 添加仪表盘
Grafana 仪表盘: https://grafana.com/grafana/dashboards/
可能需求太基础了，没找到现成的仪表盘，找到了Asyncworker Python Process, 在此基础上，修改了一下：

[Python Process.json](/about/Python Process.json)

ZhangHeng's Blog

Prometheus + Pushgateway + Grafana 监控 Python 内存占用

前言

Prometheus

介绍

安装(docker)

Python 接入

sdk: client_python

推送到Pushgateway

Grafana 图表展示

扩展

参考