[ Reference : https://yunmingzhang.wordpress.com/2017/10/24/performance-counters-for-measuring-numa-ops/ ]
Some useful performance counters from ocperf measuring loads to remote and local DRAM, cache hits, and DRAM QPI traffic.
To get PMU tools, clone this directory
https://github.com/andikleen/pmu-tools
pmu-tools/ocperf.py stat -e
mem_load_uops_l3_hit_retired.xsnp_hit,
mem_load_uops_l3_hit_retired.xsnp_hitm,
mem_load_uops_l3_hit_retired.xsnp_none,
mem_load_uops_l3_miss_retired.remote_dram,
mem_load_uops_l3_miss_retired.remote_fwd,
mem_load_uops_l3_miss_retired.local_dram -I500 ./executable
This measures l3 cache hits, local DRAM and remote DRAM accesses (ocperf.py stat -e ). The operations are sampled at 500ms intervals with ( -I )
More documentation on the specifics
mem_load_uops_l3_hit_retired.xsnp_hit measures hits that come from cross core snoop (hit in the L2 of another core, could be because the cache line is dirty, the other cores currently owns it)
mem_load_uops_l3_hit_retired.xsnp_hitm measures hits that come from the shared L3 directly (no snoop involved)
In general, if we are only doing reads, then we should mostly seeing direct read from L3 shared.
Official documentation in pmu-tools ocperf
mem_load_uops_l3_hit_retired.xsnp_hit
Retired load uops which data sources were L3 and cross-core snoop
hits in on-pkg core cache. (Supports PEBS) Errata: HSM26, HSM30
mem_load_uops_l3_hit_retired.xsnp_hitm
Retired load uops which data sources were HitM responses from
shared L3. (Supports PEBS) Errata: HSM26, HSM30
mem_load_uops_l3_hit_retired.xsnp_miss
Retired load uops which data sources were L3 hit and cross-core
snoop missed in on-pkg core cache. (Supports PEBS) Errata: HSM26,
HSM30
mem_load_uops_l3_hit_retired.xsnp_none
Retired load uops which data sources were hits in L3 without
snoops required. (Supports PEBS) Errata: HSM26, HSM30
mem_load_uops_l3_miss_retired.remote_fwd
Retired load uop whose Data Source was: forwarded from remote
cache (Supports PEBS) Errata: HSM30
mem_load_uops_l3_miss_retired.remote_hitm
Retired load uop whose Data Source was: Remote cache HITM
If we want to measure remote LLC reads, we can use offcore counters, such as
offcore_response.all_reads.llc_hit.any_response
To measure QPI traffic
pmu-tools/ucevent/ucevent.py –scale GB QPI_LL.QPI_DATA_BW — ./executable
Sometimes this would complain about “./”. One way to get around it is to do — taskset -c 0-num_cores ./executable or add something before the “./”.
–scale GB sets the scale to GB, QPI data bandwidth shows the traffic between NUMA nodes.
QPI link goes between the two sockets. There are two in each direction (one for each memory controller) for a total of 4 links. This counter will show the traffic on all 4 links. On the ones in Lanka, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, they are 15 GB/s per link. So max 30 GB/s between the two sockets. This is about 30% less than 40 GB/s local DRAM. This 30GB/s bandwidth therefore limits the remote DRAM bandwidth to 30 GB/s.
However, a lot of applications are not bandwidth bounded. For latency bounded applications on the current Lanka set up, according to Vlad
local DRAM latency: 80 ns
QPI link latency : 40 ns,
as a result, remote DRAM latency: 80 ns + 40 ns = 120 ns, about 50% slower than local DRAM.
The 40 ns latency QPI link latency also imposes a non-trivial overhead for remote LLC access. A local LLC access is about 34.8-20 ns. So the 40 ns QPI link latency can potentially make a remote LLC as expensive as the local DRAM access.
toplev for measuring everything (see if it is memory bound)
pmu-tools/toplev.py -l2 -m -C2 -S — ./executable