packages icon indicating copy to clipboard operation
packages copied to clipboard

ebtables-tiny: Fix lockfile function and reduce wait on lock

Open SmithChart opened this issue 1 year ago • 6 comments

During investigation of an issue with ebtables we discovered that the locking mechanism in ebtables-tiny did not work as expected. These commits should fix this issue.

These commits were part of the Freifunk Braunschweig parker -releases for about an year now. We did not discover any negative impact with these changes.

SmithChart avatar Aug 04 '24 10:08 SmithChart

Please also advise: Should I open a separate PR on the main gluon repo once this is merged? Or do you pick up changes from here automagically?

SmithChart avatar Aug 04 '24 10:08 SmithChart

Waiting until the lock releases indefinitely is the intended behavior. The comment on the function is incorrect, not the code (the same incorrect comment exists in "full" ebtables). I believe the comment was correct at some point in the past, but the code was improved without updating the comment.

neocturne avatar Aug 04 '24 21:08 neocturne

Waiting until the lock releases indefinitely is the intended behavior. The comment on the function is incorrect, not the code (the same incorrect comment exists in "full" ebtables). I believe the comment was correct at some point in the past, but the code was improved without updating the comment.

The original code returned from lock_file with -1 if another process had the lock: https://git.netfilter.org/ebtables/commit/?id=8b5594d7c21f3056c8c194aea1f1519c006aeaee

When it was changed to use flock() it never returned as long another process had the lock, making the retry loop with sleep in ebt_get_kernel_table useless. No "Trying to obtain lock" is ever printed, making it hard to diagnose this case from logs.

https://github.com/freifunk-gluon/gluon/blob/main/package/gluon-radv-filterd/src/gluon-radv-filterd.c also has code to handle timeouts of ebtables-tiny. We're not sure how this is supposed to interact. :)

jluebbe avatar Aug 14 '24 18:08 jluebbe

In theory, ebtables should never run long enough for any of this to matter. If the lock is ever held longer than a few milliseconds, that seems like a bug to me, not the code that waits for the lock to release.

neocturne avatar Aug 14 '24 18:08 neocturne

In theory, ebtables should never run long enough for any of this to matter. If the lock is ever held longer than a few milliseconds, that seems like a bug to me, not the code that waits for the lock to release.

Agreed. We added this change because we saw hanging ebtables-tiny processes and wanted to investigate where they got stuck (e.g on the lock or somewhere else). Without these log messages in a (working) retry loop, there is no clear indication which of those ebtables-tiny processes is hanging where.

jluebbe avatar Aug 14 '24 21:08 jluebbe

@neocturne @jluebbe how to proceed here? :)

rotanid avatar Dec 28 '24 13:12 rotanid

@neocturne and me agreed on IRC to close this PR. it's a kernel bug and this PR isn't the right solution to it - and hopefully there will be a totally different solution in the future.

rotanid avatar Oct 09 '25 23:10 rotanid

@rotanid Do you have some details on the kernel bug?

jluebbe avatar Oct 15 '25 17:10 jluebbe

Basically, every ebtables call into the kernel should return quickly. Waiting indefinitely for other ebtables processes to exit is the correct and intended behavior (as "indefinitely" should always be <1s here) - but it is also unclear to me how the retry implemented in this PR even fixes anything, even if there is a kernel bug, unless it somehow breaks the locking altogether.

neocturne avatar Oct 15 '25 19:10 neocturne