xdp-tutorial icon indicating copy to clipboard operation
xdp-tutorial copied to clipboard

Questions about basic03-map-counter and assignment 3: Per CPU Stats

Open simonhf opened this issue 5 years ago • 4 comments

So I was working through assignment 3 in basic03-map-counter [1] and have the following questions:

Okay, so I changed BPF_MAP_TYPE_ARRAY to BPF_MAP_TYPE_PERCPU_ARRAY and added the map_get_value_percpu_array() function. Everything works as expected.

Question #1:

It says "Thus far, we have used atomic operations to increment our stats counters; however, this is expensive as it inserts memory barriers to make sure different CPUs don’t garble each other’s data. We can avoid this by using another array type that stores its data in per-CPU storage. The drawback of this is that we move the burden of summing to userspace."

How should I change the following code so that it's no longer an atomic operation?

/* LLVM maps __sync_fetch_and_add() as a built-in function to the BPF atomic add
 * instruction (that is BPF_STX | BPF_XADD | BPF_W for word sizes)
 */
#ifndef lock_xadd
#define lock_xadd(ptr, val) ((void) __sync_fetch_and_add(ptr, val))
#endif

Or is LLVM doing this somehow magically for me 'under the covers'?

Question #2:

The per CPU code executes bpf_num_possible_cpus() which returns 128 on my i9 laptop. I guess that's why it's called 'possible CPUs' :-) However, this seems a bit of a waste looping through 100+ possible CPU arrays which will never(?) be written too? If I knew I only had, say, 16 CPUs then could I somehow only loop through the first or last 16 of those 128 possible arrays. How does that work? Or is this something which will become obvious when I finish the tutorial?

    unsigned int nr_cpus = bpf_num_possible_cpus();

[1] https://github.com/xdp-project/xdp-tutorial/tree/master/basic03-map-counter

simonhf avatar Feb 21 '20 02:02 simonhf

Simon Hardy-Francis [email protected] writes:

So I was working through assignment 3 in basic03-map-counter [1] and have the following questions:

Okay, so I changed BPF_MAP_TYPE_ARRAY to BPF_MAP_TYPE_PERCPU_ARRAY and added the map_get_value_percpu_array() function. Everything works as expected.

Question #1:

It says "Thus far, we have used atomic operations to increment our stats counters; however, this is expensive as it inserts memory barriers to make sure different CPUs don’t garble each other’s data. We can avoid this by using another array type that stores its data in per-CPU storage. The drawback of this is that we move the burden of summing to userspace."

How should I change the following code so that it's no longer an atomic operation?

/* LLVM maps __sync_fetch_and_add() as a built-in function to the BPF atomic add
 * instruction (that is BPF_STX | BPF_XADD | BPF_W for word sizes)
 */
#ifndef lock_xadd
#define lock_xadd(ptr, val) ((void) __sync_fetch_and_add(ptr, val))
#endif

Or is LLVM doing this somehow magically for me 'under the covers'?

A non-atomic operation is just a regular addition. I.e., instead of "lock_xadd(&rec->rx_packets, 1)", you'd just do "rec->rx_packets += 1".

Question #2:

The per CPU code executes bpf_num_possible_cpus() which returns 128 on my i9 laptop. I guess that's why it's called 'possible CPUs' :-)

Hmm, 128 CPUs does seem a bit much for a laptop. What's the output of 'cat /sys/devices/system/cpu/possible' and 'cat /proc/cpuinfo' on your system?

However, this seems a bit of a waste looping through 100+ possible CPU arrays which will never(?) be written too? If I knew I only had, say, 16 CPUs then could I somehow only loop through the first or last 16 of those 128 possible arrays. How does that work? Or is this something which will become obvious when I finish the tutorial?

    unsigned int nr_cpus = bpf_num_possible_cpus();

Well, a per-cpu map means that the kernel code will just use the CPU index of whichever CPU is currently running the BPF program whenever it updates the map. So if it is really the case that only the first 16 CPUs will ever be used, the remaining indexes should just always be 0, so you can just skip them in your loop...

tohojo avatar Feb 21 '20 11:02 tohojo

Thanks for the quick answers!

Regarding the number of possible CPUs:

The first command shows the non-intuitive 128 whereas the second command shows CPUs 0 to 16 as expected; 8 cores + 8 hyper threads.

$ cat /sys/devices/system/cpu/possible
0-127
$ cat /proc/cpuinfo | tail -30
address sizes   : 43 bits physical, 48 bits virtual
power management:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 158
model name      : Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
stepping        : 13
microcode       : 0xc6
cpu MHz         : 2400.001
cache size      : 16384 KB
physical id     : 0
siblings        : 16
core id         : 15
cpu cores       : 16
apicid          : 15
initial apicid  : 15
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xsaves arat md_clear flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit
bogomips        : 4800.00
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management:

simonhf avatar Feb 21 '20 17:02 simonhf

Found this [1] which says "cpus that have been allocated resources and can be brought online if they are present.".

Also, this command gives the expected number of CPUs used:

$ cat /sys/devices/system/cpu/present
0-15

Does this mean that if for some reason I wanted to create a per CPU array which was really big for some reason -- like 1 GB -- that XDP in the kernel would allocate 128 of these arrays instead of the expected 16 arrays on my laptop? Since we are dealing with the kernel, are the arrays allocated and assigned to actual pages of RAM whether they are used or not, or are they only assigned to pages of RAM upon access?

[1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-system-cpu

simonhf avatar Feb 21 '20 18:02 simonhf

Simon Hardy-Francis [email protected] writes:

Found this [1] which says "cpus that have been allocated resources and can be brought online if they are present.".

Also, this command gives the expected number of CPUs used:

$ cat /sys/devices/system/cpu/present
0-15

Does this mean that if for some reason I wanted to create a per CPU array which was really big for some reason -- like 1 GB -- that XDP in the kernel would allocate 128 of these arrays instead of the expected 16 arrays on my laptop? Since we are dealing with the kernel, are the arrays allocated and assigned to actual pages of RAM whether they are used or not, or are they only assigned to pages of RAM upon access?

I'm not 100% positive, but yeah, I believe all the percpu allocations uses num_possible_cpus() or the equivalent. No idea why your kernel thinks your laptop can have 128 CPUs, though; I suppose maybe the chipset technically supports it or something?

tohojo avatar Feb 21 '20 18:02 tohojo