bpftrace icon indicating copy to clipboard operation
bpftrace copied to clipboard

Use global scratch buffers for reading and writing map values

Open ryantimwilson opened this issue 1 year ago • 11 comments

This PR continues fixing https://github.com/bpftrace/bpftrace/issues/3431, migrating map values to scratch global buffers. This will allow large strings in map values, one of the primary use cases for large strings.

We introduce a write buffer of size 1 and read buffer with size N as its possible multiple values can be read in tandem.

Note in many cases, ResourceAnalyser does not exactly replicate CodegenLLVM meaning it is possible to over-allocate memory at the benefit of code simplicity. However, we did 2 optimizations in ResourceAnalyser which seem to reduce most unnecessary memory based on codegen tests:

  1. Don't recurse into map from AssignmentMap to avoid an unnecessary read buffer
  2. Check if we actually need to do a copy during AssignmentMap to avoid an unnecessary write buffer

After this, we see no additional global buffers in current codegen tests.

We also add new codegen and runtime tests for big strings in map values.

ryantimwilson avatar Oct 03 '24 17:10 ryantimwilson

Per offline discussion with @ajor and @danobi, PR in current form fails to compile bpfsnake due to kernel verifier saying too many jumps >= 8192 due to select LLVM IR instructions

To fix this, I did 3 things:

  1. Cached bounded CPU ID at beginning of probe to minimize # of jumps. Unfortunately, this caused LLVM to barf with BPF program is too large. Processed 1000001 insn errors due to instruction exploration space (e.g. verifier assumes value could be MAX_CPU_ID or get_snmp_processor_id so has to 2x explore every path using scratch buffers)
  2. To fix 1, I just used assert(snmp_processor_id <= MAX_CPU_ID). Unfortunately, this caused long jump errors.
  3. To fix 2, I moved to CPUv4 to support long jumps. But I needed to write my own probing function similar to https://github.com/llvm/llvm-project/blob/efb583178d74b2174e8b9660b67ba7295527b09f/llvm/lib/TargetParser/Host.cpp#L464 to ensure the kernel and LLVM support V4.

I will productionize local changes and put them up in separate commits. But this is a pretty involved change.

ryantimwilson avatar Oct 03 '24 20:10 ryantimwilson

Playing with snake some more, a carefully placed loop:

diff ~/scratch/snake.bt ~/scratch/snake.bt.optimized
58c58
<         unroll(50) {
---
>         while($y < 50) {

brings insn count down from 400K -> 20K.

So I think unroll() is the true issue here. Program is just too complex.

Let's do the cpuid caching optimization since that will be necessary sooner or later. But let's avoid the complicated changes - we are too deep in complex territory already to start tugging on more threads.

danobi avatar Oct 03 '24 23:10 danobi

Let's do the cpuid caching optimization since that will be necessary sooner or later.

@danobi Thanks for finding this!

Converting unroll(50) -> while ($y < 50) indeed works with my CPU ID caching local patch makes this PR work with bpfsnake. So I will implement it as a commit in this PR.

However, I still see # of instructions at 400K, not 20K:

processed 402613 insns (limit 1000000) max_states_per_insn 19 total_states 23892 peak_states 2587 mark_read 1302

On bpftrace master, I actually get slightly more instructions:

processed 416021 insns (limit 1000000) max_states_per_insn 25 total_states 24735 peak_states 2792 mark_read 1405

Curious how you're getting 20K. I'm on kernel 6.9

I also tried converting unroll (12) -> while ($x < 12) to further reduce instruction count but that got me too many jumps 8193 >= limit of 8192. I get ~70K instructions there though

ryantimwilson avatar Oct 04 '24 12:10 ryantimwilson

@ajor you can generally ignore my comment except 1 per @danobi's investigation

Second question due to my lack of knowledge of the verifier: why does caching the CPU ID trigger this expanded exploration space, but fetching it before each string lookup doesn't? (Or does it, but we used to hit the 8192 branch limit before the 1000000 instruction limit?)

I think the reason is bpfsnake doesn't use fmt strings or tuples - what I used before. I did fairly simple load tests, not complex programs like bpfsnake.

Sorry, I don't understand this - what you mean exactly by adding an assert? An assert in the bpftrace source code won't appear in the generated bytecode and we don't have support for asserts in BPF AFAIK.

I meant generating something like this in bytecode, similar to what previous scratch maps did:

if (snmp_processor_id > MAX_CPU_ID)
  return 0;
else
  // continue with rest of program
  ...

In case you haven't seen it already, we have similar code in bpffeature.cpp for detecting what the kernel supports:

Sounds good, I'll consider using it. I may also upstream a LLVM patch for cpuv4 detection into "probe" since this isn't urgent now.

Unrelated point: There's a lot of unnecessary changes going on in the test cases, mostly because they're calling print unnecessarily (i.e. when that's not the thing they're meant to be testing). I'll go through at least some of them today and try to simplify them down so that they're not impacted by your PR.

I'm confused by this comment...don't we need to print things to read the output and verify its what we expect. Or do you propse just using exit()?

ryantimwilson avatar Oct 04 '24 13:10 ryantimwilson

I'm confused by this comment...don't we need to print things to read the output and verify its what we expect. Or do you propse just using exit()?

I've made the change in #3486. The scripts in the codegen tests don't actually get run, so they don't need to output anything at all. They're meant to be minimal unit tests that just exercise specific parts of codegen_llvm.cpp.

ajor avatar Oct 04 '24 15:10 ajor

@ajor @danobi I implemented the CPU ID caching in its own PR since IMHO it was non-trivial (story of this whole task lol): https://github.com/bpftrace/bpftrace/pull/3487

Lets follow up there, then once that's merged, I can continue on this PR

ryantimwilson avatar Oct 04 '24 19:10 ryantimwilson

Not ready for review yet, code is still quite sloppy, just putting up to make sure tests work across all environments before I start refactoring

ryantimwilson avatar Oct 09 '24 22:10 ryantimwilson

Okay after enough iterations locally, the logic is so complex around whether we read/write from scratch buffer for map values/keys in CodegenAnalyser. Its not worth trying to replicate it in ResourceAnalyser, otherwise, the code will be super brittle.

We need to experiment with ways to run CodegenAnalyser twice. First time, we don't actually generate code (we can create a MockIRBuilderBPF), we just count resources e.g. when CreateMapValueScratchBuffer is called, instead we run a callback that bumps # of buffers needed and max size needed. Then the next time we actually generate the code. I will try to prototype this but also need to focus on other work, so might take me a bit.

Idea would be to:

  1. Split Visitor logic from CodegenLLVM into CodegenVisitor and CodegenLLVM. Keep compiling/optimizing/etc into CodegenLLVM
  2. Create GlobalVarAnalyser that is subclass of CodegenVisitor and override callback methods for scratch buffers to count resources

EDIT: Decided to do a simpler approach to unblock this, see PR description

ryantimwilson avatar Oct 10 '24 00:10 ryantimwilson

@ajor @danobi @jordalgo this is ready for review.

I managed to get no over-allocation memory according to codegen and bpfsnake tests. The tradeoff is there is a lot of subtle dependencies between CodegenLLVM and ResourceAnalyser. So we may need to figure out if this tradeoff is worth it or if we should go for a more blunt approach and maybe over-allocate memory.

ryantimwilson avatar Oct 10 '24 18:10 ryantimwilson

Sigh my most recent changes to make memory accounting more accurate didn't work...still failing many tests due to complexity of CodegenLLVM. I think the only option is the blunt approach which may over-allocate memory

I'm not implementing 16 byte minimum threshold to use scratch buffer vs stack because its hard to catch bugs if we do that (most tests are below 16 bytes)

ryantimwilson avatar Oct 11 '24 04:10 ryantimwilson

After offline discussion with @danobi, @jordalgo and @ajor, we all agree this is too complex, hard to maintain and brittle (and it even failed tests despite passing on my 6.9 kernel with BPF backports)

Here's our proposal to make this easier:

  1. We allow setting a config setting BPFTRACE_SCRATCH_BUF_THRES that defaults to 32 bytes
  2. If allocation < BPFTRACE_SCRATCH_BUF_THRES, use stack. Else use scratch buffer
  3. We remove all this complicated code and let customers know "Hey if you use big strings, expect over-allocation". And advanced users can make it configurable

ryantimwilson avatar Oct 11 '24 17:10 ryantimwilson

Dumb question: why do we even need a write map buffer? Shouldn't the assignment values already be allocated to the stack or global buffer depending on their type and size?

@jordalgo my understanding is a few cases:

  1. Tuples/strings where they are smaller than the map value size
  2. Arrays/structs that may not be read into BPF memory yet
  3. Delete API where we need to zero an element

There's not a lot but a smattering of smaller use cases

ryantimwilson avatar Oct 22 '24 14:10 ryantimwilson