Improve CLI usability around ringbuffer limits

Open joestringer opened this issue 5 years ago • 1 comments

Some example usage of hubble:

I want to find out if any apps are reaching out to 8.8.8.8:

# hubble observe --to-ip=8.8.8.8
requested data has been overwritten and is no longer available

I'm not sure how many flows are kept in the ringbuffer or the timeline that represents, so I tried listing the last 30m:

# hubble observe --namespace default -o json --since=30m
requested data has been overwritten and is no longer available

These seem to be both derived from the error in the hubble server side around attempting to list more flows than the current ringbuffer contents. But as a user, I don't know or necessarily care about the ringbuffer size, I just want to query these flows and get whatever information is available.

Furthermore, the error itself is pretty generic, so I know I am doing something wrong but it's unclear what I should try next. I was informed there is also --all CLI in the latest version (not yet available in Cilium containers) and I can do some analysis of hubble status to figure out how many flows are likely to be present, but this will not catch all cases and these are very complicated mitigations if I want to just try to find as much information as is available in Hubble.

If the response from the Hubble server was clearly "Here are the N flows out of M" or "From the last N minutes (since timestamp X), I found these relevant flows" then this would help to provide the context around whether the flows are likely to include the information I'm looking for or not.

Dec 16 '20 21:12 joestringer

These seem to be both derived from the error in the hubble server side around attempting to list more flows than the current ringbuffer contents

You are allowed to ask for more flows than the buffer contains. The problem is that in a chatty cluster (and a lockless buffer) there is not enough time for us to rewind the buffer and read the flows before the writer writes over them (thus producing this error).

There are fundamentally no problems with asking for --last 100000 or --since 100y, apart from the timing issue described above.

--all was added to the hubble CLI but you read the PR (https://github.com/cilium/hubble/pull/411/files) it requests for --last MAX_INT so that doesn't solve the timing issues.

@Rolinh is the last person to work on this and is very familiar with this. Last I heard there were some attempts to create more than one read pointer, but I don't have the status of that in my head currently.

We currently only reserve one flow between the reader and the writer. So the amount of time we have to respond to the request is (1 / (#flows/s). We may want to reserve more flows between the reader and writer pointers to allow us some more time to respond to queries, but that's not guaranteed to work in all cases either.

Other solutions are welcome, but it's difficult since there is no read/write locking.

Dec 16 '20 23:12 glibsm