'Performance'에 해당하는 글 3건

[ Reference : https://yunmingzhang.wordpress.com/2017/10/24/performance-counters-for-measuring-numa-ops/ ]


Some useful performance counters from ocperf measuring loads to remote and local DRAM, cache hits, and DRAM QPI traffic.

To get PMU tools, clone this directory

https://github.com/andikleen/pmu-tools

pmu-tools/ocperf.py stat -e

mem_load_uops_l3_hit_retired.xsnp_hit,

mem_load_uops_l3_hit_retired.xsnp_hitm,

mem_load_uops_l3_hit_retired.xsnp_none,

mem_load_uops_l3_miss_retired.remote_dram,

mem_load_uops_l3_miss_retired.remote_fwd,

mem_load_uops_l3_miss_retired.local_dram -I500 ./executable

This measures l3 cache hits, local DRAM and remote DRAM accesses (ocperf.py stat -e ). The operations are sampled at 500ms intervals with ( -I )

 

More documentation on the specifics

mem_load_uops_l3_hit_retired.xsnp_hit measures hits that come from cross core snoop (hit in the L2 of another core, could be because the cache line is dirty, the other cores currently owns it) 

mem_load_uops_l3_hit_retired.xsnp_hitm measures hits that come from the shared L3 directly (no snoop involved)

In general, if we are only doing reads, then we should mostly seeing direct read from L3 shared.

Official documentation in pmu-tools ocperf

  mem_load_uops_l3_hit_retired.xsnp_hit      

     Retired load uops which data sources were L3 and cross-core snoop

     hits in on-pkg core cache. (Supports PEBS) Errata: HSM26, HSM30

  mem_load_uops_l3_hit_retired.xsnp_hitm     

     Retired load uops which data sources were HitM responses from

     shared L3. (Supports PEBS) Errata: HSM26, HSM30

  mem_load_uops_l3_hit_retired.xsnp_miss     

     Retired load uops which data sources were L3 hit and cross-core

     snoop missed in on-pkg core cache. (Supports PEBS) Errata: HSM26,

     HSM30

  mem_load_uops_l3_hit_retired.xsnp_none     

     Retired load uops which data sources were hits in L3 without

     snoops required. (Supports PEBS) Errata: HSM26, HSM30

 

  mem_load_uops_l3_miss_retired.remote_fwd   

     Retired load uop whose Data Source was: forwarded from remote

     cache (Supports PEBS) Errata: HSM30

  mem_load_uops_l3_miss_retired.remote_hitm 

     Retired load uop whose Data Source was: Remote cache HITM

 

If we want to measure remote LLC reads, we can use offcore counters, such as

offcore_response.all_reads.llc_hit.any_response

To measure QPI traffic

pmu-tools/ucevent/ucevent.py –scale GB QPI_LL.QPI_DATA_BW   — ./executable

Sometimes this would complain about “./”. One way to get around it is to do — taskset -c 0-num_cores ./executable or add something before the “./”.

–scale GB sets the scale to GB, QPI data bandwidth shows the traffic between NUMA nodes.

QPI link goes between the two sockets. There are two in each direction (one for each memory controller) for a total of 4 links. This counter will show the traffic on all 4 links.  On the ones in Lanka, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, they are 15 GB/s per link. So max 30 GB/s between the two sockets. This is about 30% less than 40 GB/s local DRAM. This 30GB/s bandwidth therefore limits the remote DRAM bandwidth to 30 GB/s. 

However, a lot of applications are not bandwidth bounded. For latency bounded applications on the current Lanka set up, according to Vlad

local DRAM latency: 80 ns

QPI link latency : 40 ns,

as a result, remote DRAM latency: 80 ns + 40 ns = 120 ns, about 50% slower than local DRAM.

The 40 ns latency QPI link latency also imposes a non-trivial overhead for remote LLC access. A local LLC access is about 34.8-20 ns. So the 40 ns QPI link latency can potentially make a remote LLC as expensive as the local DRAM access.

toplev for measuring everything (see if it is memory bound)

pmu-tools/toplev.py -l2 -m -C2 -S — ./executable


WRITTEN BY
RootFriend
개인적으로... 나쁜 기억력에 도움되라고 만들게되었습니다.

,

1. perf : the good, the bad, the ugly

출처 : http://rhaas.blogspot.co.uk/2012/06/perf-good-bad-ugly.html


2. oprofile vs perf

출처 : http://comments.gmane.org/gmane.linux.oprofile/11429


OProfile was, arguably, the profiling tool of choice for Linux developers for nearly 10 years.  A few years
ago, various members of the Linux kernel community defined and implemented a formal kernel API to access
performance monitor counters (PMCs) to address the needs of the performance tools development
community.  Prior to the introduction of this API, oprofile used a special oprofile-specific kernel
module, while other performance tools relied on kernel patches (e.g., 'perfctr', 'perfmon' -- which
were never accepted upstream) to access the PMCs.  The kernel developers of this new API also developed an
example tool that used the new API, which they called 'perf'.  The original perf tool was capable of
profiling (single process or system-wide), as well as simple event counting.  Several
  other features have been added to it since then. The perf tool has matured a lot in the past few years, and has
gained a lot of followers.

Currently, oprofile is strictly a profiling tool.  So there are more features available in the perf tool,
but with those added features, comes added complexity in using it.  Comparing the profiling capabilities
of the two tools, there is a lot of overlap, but they each have their own strengths.  The original oprofile
("legacy" opcontrol-based profiler) could only do system-wide profiling, which required root
authority.  In August 2012, oprofile 0.9.8 was released that included the new 'operf' tool, which uses the
new kernel API mentioned above.  Using operf allows users who know and love oprofile's post-processing
tools to get the same benefits as 'perf' (i.e., single app profiling without the need for root authority),
while still leveraging the advantages of oprofile (symbolic native even
 t names, Java profiling, user manual, and an established community).

In the end, though, which of these  tools to use for profiling is a personal choice.  Try them both out and
decide for yourself.  You may find that you'd like to have them both in your toolbox.


WRITTEN BY
RootFriend
개인적으로... 나쁜 기억력에 도움되라고 만들게되었습니다.

,

출처 : http://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1



Real, User and Sys process time statistics

One of these things is not like the other. Real refers to actual elapsed time; User and Sys refer to CPU time used only by the process.

  • Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).

  • User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.

  • Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.

User+Sys will tell you how much actual CPU time your process used. Note that this is across all CPUs, so if the process has multiple threads it could potentially exceed the wall clock time reported by Real. Note that in the output these figures include the User and Sys time of all child processes as well, although the underlying system calls return the statistics for the process and its children separately.

Origins of the statistics reported by time (1)

The statistics reported by time are gathered from various system calls. 'User' and 'Sys' come fromwait (2) or times (2), depending on the particular system. 'Real' is calculated from a start and end time gathered from the gettimeofday (2) call. Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by time.

On a multi-processor machine a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel. Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.

A brief primer on Kernel vs. User mode

On unix, or any protected-memory operating system, 'Kernel' or 'Supervisor' mode refers to a privileged mode that the CPU can operate in. Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code. An example of such an action might be to manipulate the MMU to gain access to the address space of another process. Normally, user-mode code cannot do this (with good reason), although it can request shared memory from the kernel, which could be read or written by more than one process. In this case, the shared memory is explicitly requested from the kernel through a secure mechansm and both processes have to explicitly attach to it in order to use it.

The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode. In order to switch to kernel mode you have to issue a specific instruction (often called a trap) that switches the CPU to running in kernel mode and runs code from a specific location held in a jump table. For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode. You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.

The 'system' calls in the C libary (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program. Behind the scenes they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode. It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call. A page describing the system calls provided by the Linux kernel and the conventions for setting up registers can be found here.

More about 'sys'

There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.) These are under the supervision of The Kernel, and he alone can do them. Some operations that you do (like malloc or fread/fwrite) will invoke these Kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to malloc will do some processing of its own (still counted in 'user' time) and then somewhere along the way call the function in kernel (counted in 'sys' time). After returning from the kernel call there will be some more time in 'user' and then malloc will return to your code. When the switch happens and how much of it is spent in kernel mode - you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use malloc and the like in the background, which will again have some time in 'sys' then.


WRITTEN BY
RootFriend
개인적으로... 나쁜 기억력에 도움되라고 만들게되었습니다.

,