linux性能工具perf工作原理简析( 二 ) _linux

如果perf-stat命令没有通过-e参数指定任何event，函数add_default_attributes()会默认添加8个events 。event是perf工具的核心对象，各种命令都是围绕着event工作。perf-stat命令可以同时指定多个events，由一个核心全局变量struct perf_evlist *evsel_list组织起来，以下仅列出几个很重要的成员：
struct perf_evlist { struct list_head entries; bool enabled; struct { int cork_fd; pid_t pid; } workload; struct fdarray pollfd; struct thread_map *threads; struct cpu_map *cpus; struct events_stats stats; ...}

entries: 所有events列表, 即struct perf_evsel对象；
pid: 运行cmd的进程pid, 即运行ls命令的进程pid;
pollfd: 保存sys_perf_event_open()返回的fd;
threads: perf-stat可以通过-t参数指定多个线程，仅在这些线程运行时进行计数；
cpus: perf-stat能通过-C参数指定多个cpu, 仅当程序运行在这些cpu上时才会计数；
stats: 计数统计结果，perf-stat从mmap内存区读取counter值后，还要做一些数值转换或聚合等处理

perf_evlist::entries是一个event链表，链接的对象是一个个event，由struct perf_evsel表示，其中非常重要的成员如下:
struct perf_evsel {char *name;struct perf_event_attr attr;struct perf_counts *counts;struct xyarray *fd;struct cpu_map *cpus;struct thread_map *threads;}

name: event的名称；
attr: event的属性，传递给perf系统调用非常重要的参数；
cpus, threads, fd: perf-stat可以指定一些对event计数的限制条件，只统计哪些task或哪些cpu, 其实就是一个由struct xyarray表示的二维表格，最终的计数值被分解成cpus*threads个小的counter，sys_perf_event_open()请求perf驱动为每个分量值创建一个子counter，并分别返回一个fd;
counts: perf_counts::values保存每个分量计数值，perf_counts::aggr保存最终所有分量的聚合值。

perf的性能计数器本质上是一些特殊的硬件寄存器，perf对这样的硬件能力进行抽象，提供针对event的per-CPU和per-thread的64位虚机计数器("virtual" 64-bit counters) 。当perf-stat不指定任何thread或cpu时，这样的一个二维表格就变成一个点，即一个event对应一个counter，对应一个fd 。
简单介绍了核心数据结构，终于可以继续看看perf-stat的工作流了。perf-stat的工作逻辑主要在__run_perf_stat()中，大致是这样: a. fork一个子进程，准备用来运行cmd，即示例中的ls命令；b. 为每一个event事件，通过sys_perf_event_open()系统调用，创建一个counter; c. 通过管道给子进程发消息，exec命令, 即运行示例中的ls命令, 并立即enable计数器; d. 当程序运行结束后，disable计数器，并读取counter 。用户态的工作流大致如下：
__run_perf_stat() perf_evlist__prepare_workload() create_perf_stat_counter() sys_perf_event_open() enable_counters() perf_evsel__run_ioctl(evsel, ncpus, nthreads, PERF_EVENT_IOC_DISABLE) ioctl(fd, ioc, arg) wait() disable_counters() perf_evsel__run_ioctl(evsel, ncpus, nthreads, PERF_EVENT_IOC_ENABLE) read_counters() perf_evsel__read(evsel, cpu, thread, count) readn(fd, count, size)用户态工作流比较清晰，最终都可以很方便通过ioctl()控制计数器，通过read()读取计数器的值。而这样方便的条件都是perf系统调sys_perf_event_open（）用创造出来的，已经迫不及待想看看这个系统调用做了些什么。
perf系统调用
perf系统调用会为一个虚机计数器(virtual counter)打开一个fd，然后perf-stat就通过这个fd向perf内核驱动发请求。perf系统调用定义如下(linux/kernel/events/core.c):
/** * sys_perf_event_open - open a performance event, associate it to a task/cpu * * @attr_uptr: event_id type attributes for monitoring/sampling * @pid: target pid * @cpu: target cpu * @group_fd: group leader event fd */SYSCALL_DEFINE5(perf_event_open, struct perf_event_attr __user *, attr_uptr, pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)特别提一下， struct perf_event_attr是一个信息量很大的结构体，kernel中有文档详细介绍[7] 。其它参数如何使用，man手册有详细的解释，并且手册最后还给出了用户态编程例子，见man perf_event_open 。
sys_perf_event_open()主要做了这几件事情：
a. 根据struct perf_event_attr，创建和初始化struct perf_event, 它包含几个重要的成员:
/** * struct perf_event - performance event kernel representation: */struct perf_event { struct pmu *pmu; //硬件pmu抽象 local64_t count; // 64-bit virtual counter u64 total_time_enabled; u64 total_time_running; struct perf_event_context *ctx; // 与task相关...}b. 为这个event找到或创建一个struct perf_event_context, context和event是1:N的关系，一个context会与一个进程的task_struct关联，perf_event_count::event_list表示所有对这个进程感兴趣的事件, 它包括几个重要成员：