ebpf icon indicating copy to clipboard operation
ebpf copied to clipboard

Loading of ebpf program fails with "duplicate found at address"

Open scudette opened this issue 6 months ago • 8 comments

Describe the bug

Issue #894 highlighted an issue with attaching to symbols in ambiguous locations and PR #1588 now hard rejects such symbols.

This means that we can not load at all on such kernels that have duplicate symbols, even if we are not even interested in that specific symbol at all. Error is

populating kallsyms caches: getting modules from kallsyms: assigning symbol modules: symbol load_elf_phdrs: duplicate found at address 0xffffffff8e0245
f0 (module ""): multiple kernel symbols with the same name 

This hard reject should be made configurable or keep a list of symbols that can not be attached but still allow the loading of the program.

This change breaks upgrades from older versions for us.

How to reproduce

Load any ebpf program on older kernels ( am using Ubuntu 22.04)

Linux version 5.15.0-139-generic (buildd@lcy02-amd64-029) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #149-Ubuntu SMP Fri Apr 11 22:06:13 UTC 2025

Version information

master

scudette avatar Jun 20 '25 06:06 scudette

The following diff fixes the issue as it does not event register the offending symbol so they can not be attached to anyway

diff --git a/internal/kallsyms/kallsyms.go b/internal/kallsyms/kallsyms.go
index 9154a8a..464654d 100644
--- a/internal/kallsyms/kallsyms.go
+++ b/internal/kallsyms/kallsyms.go
@@ -119,6 +119,8 @@ func assignModules(f io.Reader, symbols map[string]string) error {
                }

                if _, ok := found[string(s.name)]; ok {
+                       delete(symbols, string(s.name))
+                       continue
+
                        // We've already seen this symbol. Return an error to avoid silently
                        // attaching to a symbol in the wrong module. libbpf also rejects
                        // referring to ambiguous symbols.

scudette avatar Jun 20 '25 06:06 scudette

The problem is that the kernel doesn't behave deterministically in this case, as the issues you linked point out. Your "fix" just papers over the issue. I'm not aware of a work around, this needs fixing upstream.

cc @ti-mo

lmb avatar Jun 20 '25 07:06 lmb

Can you explain what you mean by "upstream"? Do you mean it needs to be fixed in the Linux kernel?

While I agree with you - we need to be able to run on any Linux system in the wild including older ones

scudette avatar Jun 20 '25 07:06 scudette

Yes, upstream means the kernel.

we need to be able to run on any Linux system in the wild including older ones

Not sure what the best workaround is for you. Can you get away with not attaching to these symbols dynamically?

lmb avatar Jun 20 '25 07:06 lmb

To be clear i am not trying to attach to these symbols at all - I get a failure to load the entire ebpf program because the library fails to initialize its cache.

Here is the relevant backtrace:

github.com/cilium/ebpf/internal/kallsyms.assignModules({0x2e45500, 0xc0075dc000}, 0xc00091b6f8)
        /home/mic/projects/ebpf/internal/kallsyms/kallsyms.go:123 +0x265
github.com/cilium/ebpf/internal/kallsyms.AssignModules(0xc00091ba20)
        /home/mic/projects/ebpf/internal/kallsyms/kallsyms.go:84 +0x26c
github.com/cilium/ebpf.populateKallsyms(0xc005a8e000)
        /home/mic/projects/ebpf/collection.go:457 +0x155
github.com/cilium/ebpf.newCollectionLoader(0xc007e7aa20, 0x0?)
        /home/mic/projects/ebpf/collection.go:430 +0x11b
github.com/cilium/ebpf.NewCollectionWithOptions(0xc007e7aa20, {{{0x0, 0x0}, {0x0, 0x0, 0x0}}, {0x0, 0x0, 0x0, 0x0, ...}, ...})
        /home/mic/projects/ebpf/collection.go:363 +0xbb

As you can see it fails to creation a new collection because it can not populate its kallsyms cache.

scudette avatar Jun 20 '25 08:06 scudette

That code should only execute if you are using that symbol: https://github.com/cilium/ebpf/blob/51ab2a455de272c9e1c6cf461cc12a9d7b71a031/internal/kallsyms/kallsyms.go#L117-L119

Can you log what arguments are passed to assignModules?

lmb avatar Jun 20 '25 09:06 lmb

Thanks for clarifying this - on closer inspection I found that one of the programs was actually trying to attach to that symbol after all.

The problem I have is that I have about 20-30 ebpf programs that I need to load at the same time but they will not all be used depending on runtime requirements. If there are some systems which do not support attaching to some symbols due to the above mentioned conflict this is ok i think but I need to be able to remove the offending program at runtime so at least I can attach the others.

Right now with the way the errors are returned it is clumsy to do because errors are not properly structured.

I have written the following code to work around this but it feels a bit sub optimal as the strings are probably not going to remain stable over time:

var (
    duplicateErrorRegex = regexp.MustCompile(`symbol ([^:]+): duplicate found at address`)
)

func (self *EBPFManager) catchEbpfLoadingErrors() (*ebpf.Collection, error) {

retry_loop:
    for {
        res, err := ebpf.NewCollectionWithOptions(self.spec, ebpf.CollectionOptions{})
        if err == nil {
            return res, nil
        }

        m := duplicateErrorRegex.FindStringSubmatch(err.Error())
        if len(m) > 1 {
            // Remove the offending program and try again.                                                                                                 
            for key, p := range self.spec.Programs {
                if p.AttachTo == m[1] {
                    self.logger.Debug("Disabling program %v due to %v", key, err.Error())
                    delete(self.spec.Programs, key)
                    continue retry_loop
                }
            }
        }

        return nil, err
    }
}

Ideally the error will return the name of the offending symbol and possibly even the program name that causes the error so we dont have to do the above regex gymnastics.

I guess this is more of feature request, though so happy to file a new issue.

Thanks again for clarifying and supporting this library.

scudette avatar Jun 20 '25 12:06 scudette

Hi @scudette, thanks for the report! Some background to justify why this check is in place and why we don't allow for it to be disabled: attaching to ambiguous symbols almost always leads to unintended behaviour. Multiple symbols means multiple addresses, so even though there's a single function definition in the kernel's source code, the binary can end up with multiple copies of it. One copy will get invoked by some parts of the kernel (let's assume all users within the same compilation unit before linking vmlinux), while the other copy will only be used by other subsystems built in another compilation unit.

This means that, without this check, you're guaranteed to miss all events from some parts of the kernel. Worse, the lack of determinism Lorenz mentioned will cause one symbol to appear first in kallsyms depending on the weather at the time the kernel was built, causing your program to be attached to different symbols across n machines. Depending on your use case, this could mean a security or policy bypass, or a complete lack of visibility -- but only sometimes, making it notoriously difficult to troubleshoot.

There's also the added papercut of multiple function definitions being emitted to BTF, with the function's arguments sometimes having subtly different qualifiers or names, which is a problem if you want to attach program types that require BTF func info to be provided. The BTF lookup may pick a different candidate than the symbol chosen by the kallsyms resolver. IIRC this is primarily why libbpf chose to reject ambiguous symbol lookups.

You're doing the right thing in disabling the program altogether under these circumstances, but part of my motivation for this change was trying to push people towards looking for other, more stable symbols that could provide the same information (often a call site of the ambiguous symbol) though I understand that this isn't possible in all cases and workarounds are needed.


Before we commit to exporting a sentinel for ambiguous symbol errors, let's discuss some workarounds.

Typically, you want to avoid putting required and optional programs into a single BPF object. You'd load the required CollectionSpec first, which takes care of pinning shared maps, then load one or more optional objects on a best-effort basis. If you don't want/need to pin maps, CollectionOptions.MapReplacements is another solution, but it needs to be kept in sync with your C code when new shared maps are added, which is not ideal depending on its size.

Alternatively, instead of building multiple objects, you can use LoadAndAssign to load a subset of (required) programs, with one or more subsequent invocations for optional programs. Note that each call to LoadAndAssign technically yields a new Collection underlyingly, so maps aren't shared between invocations by default.

Lastly, the concept of optional programs has come up a few times internally, but we're far from reaching consensus here and usually recommend our users one of the options above. This would come in the form of an optional tag in LoadAndAssign object fields. The user will have to deal with nils, but they get the choice after all. It's not clear yet how we will communicate errors to the caller.

For what it's worth, Cilium takes the opposite approach: within a Collection, all programs are required, and Cilium targets the lowest common denominator of kernels that support the features we want to provide as a product (currently 5.4). Doesn't translate to your problem 1:1, but we have many people working upstream to ensure interfaces are stable before we start relying on them.

Please give these solution some consideration and report back!

ti-mo avatar Jun 23 '25 10:06 ti-mo