ebpf icon indicating copy to clipboard operation
ebpf copied to clipboard

Incomplete verifier log in 1.16

Open weli-l opened this issue 1 year ago • 7 comments

Describe the bug

When I called collection.LoadAndAssign, I found that the error log of ebpf program loading failure was intercepted, resulting in very few logs available to me. Is there any way to restore all these logs?

time="2024-08-27T02:38:37Z" level=error msg="loadKmeshSockConnObjects failed: program cluster_manager: load program: bad address: regs=10 stack=0 before 994: (67) r4 <<= 32: regs=10 stack=0 (truncated, 960 line(s) omitted)" subsys=pkg/bpf
time="2024-08-27T02:38:37Z" level=info msg="Clean kmesh_version map and bpf prog" subsys=pkg/bpf

How to reproduce

code a wrong bpf prog

Version information

v0.16.0

weli-l avatar Aug 30 '24 09:08 weli-l

Hi @weli-l

Which values did you use for LogLevel and LogSize in ProgramOptions? And can you provide a minimal reproducer?

florianl avatar Aug 30 '24 11:08 florianl

Hi @weli-l

Which values did you use for LogLevel and LogSize in ProgramOptions? And can you provide a minimal reproducer?

now I use v0.16.0 , perhaps there is no need to pass these two parameters,When I use v0.5.0, I set LogLevel to 1 and LogSize to 1280000, and I test it in a program, there is no minimal reproducer

weli-l avatar Aug 31 '24 01:08 weli-l

@weli-l You can use this way to get the whole verifier error:

        var ve *ebpf.VerifierError
	if errors.As(err, &ve) {
		// Using %+v will print the whole verifier error, not just the last
		// few lines.
		log.Fatalf("Verifier error: %+v", ve)
	}

mozillazg avatar Aug 31 '24 02:08 mozillazg

@weli-l You can use this way to get the whole verifier error:

        var ve *ebpf.VerifierError
	if errors.As(err, &ve) {
		// Using %+v will print the whole verifier error, not just the last
		// few lines.
		log.Fatalf("Verifier error: %+v", ve)
	}

Thanks! It works. There are more logs printed, but many logs are still omitted, making it impossible to support the location problem. In addition, I use this method

return fmt.Errorf("bpf Load failed: %-100v", ve)

It also cannot print the last 100 lines, and it still prints from the beginning of the log

weli-l avatar Aug 31 '24 10:08 weli-l

We shipped some changes to verifier log handling in 1.16, but there was no intended breakage. I didn't vet the changes thoroughly, and this is a very thorny piece of code by nature, so it's possible for there to be a regression.

In 1.16, LogSize is a no-op, the library now tries to probe the required buffer for the log. The LogSize field was removed on the dev branch right after the 1.16 release to avoid breaking API.

@weli-l On 1.16, please try to revert the following commits in the given order:

218b9f968308e6bfdf1cc4324ee683de66c3ea68 f2b2f6d61aaf1fa8c0eb14c2a58f7e199fbd6086

Printing with %+v should always print the full log if VerifierError.Truncated is false (this means we got the full log). Note that with these commits removed, you'll have to set LogSize to an appropriately-large value to fit your log, otherwise the end will be missing. Thanks!

ti-mo avatar Sep 05 '24 13:09 ti-mo

Supporting this issue: using 1.16, I had a similar situation where I couldn't see the verifier output for the subprogram and couldn't link the func#number to its name. And it would make sense since the subprogram verification should have happened early and be at the beginning of the logs.

mtardy avatar Sep 12 '24 09:09 mtardy

I've taken a look at the new verifier log handling and nothing sticks out as a potential cause for this. The info provided here is incredibly limited, so I don't have much to go on. I've played around with the verifier log tests on 5.10, 5.4 and 4.19 and things behave as expected.

However, it looks like the initial post has bad address: in the message, which corresponds to EFAULT, which is sprinkled generously around linux/kernel/bpf/verifier.c and looks like it's exclusively used for signaling verifier bugs. I suspect you've (both) hit a spot in the verifier that makes it return EFAULT instead of ENOSPC, without checking the size of the log buffer.

Without a reproducer, this is going to be impossible for me to troubleshoot, so please try to get me an ELF that triggers the bug. This may need to be fixed upstream, but making this waterproof in the future may be non-trivial.

I'll consider reinstating the ProgramOptions.LogSize field and treat it as the starting size of the buffer, so at least when this happens again in the future, the caller has a way of manually sizing the buffer to obtain the full log so we can confirm it's a verifier bug.

ti-mo avatar Oct 10 '24 10:10 ti-mo

@mtardy @weli-l I've just merged #1630 that re-adds a ProgramOptions.LogSizeStart field for situations like these. It sets the size of the log buffer at the start of the growth algorithm. When hitting a verifier bug, the error message now also includes a hint to that effect. The bad address: error from the report indicates you hit a verifier bug, so you're encouraged to check dmesg and report it to bpf@vger.

Please let me know if this helps, thank you.

ti-mo avatar Dec 18 '24 11:12 ti-mo