online icon indicating copy to clipboard operation
online copied to clipboard

Handle limited open Connections due to keepalive connections

Open sgothel opened this issue 1 year ago • 2 comments

  • Resolves: #9833
  • Target version: master

Summary

(This commit is a WIP, pushed for discussion and to be amended!)

  • ProtocolHandlerInterface::checkTimeout(..)

    • Add bool return value: true -> shutdown connection, caller shall stop processing
    • Implemented for http::Session
      • Timeout (30s) with missing response
    • Implemented for WebSocketHandler
      • Timeout (64s = SocketPoll::DefaultPollTimeoutMicroS) after missing pong (server only)
  • StreamSocket -> Socket (properties moved)

    • bytes sent/received
    • closed state
  • Socket (added properties)

    • creation- and last-seen -time
    • socket type and port
    • checkForcedRemoval(..) WIP
      • called directly from SocketPoll::poll()
      • only for IPv4/v6 network connections
      • similar to ProtocolHandlerInterface::checkTimeout(..)
      • added further criteria (age, throughput, ..)
        • Timeout (64s = SocketPoll::DefaultPollTimeoutMicroS) if (now - lastSeen) > timeout
        • Timeout (12 hours) if (now - creationTime) > timeout
        • TODO: Add maximimal IPv4/IPv6 socket-count criteria, drop oldest.
  • SocketPoll::poll()

    • Additionally erases if !socket->isOpen() || socket->checkForcedRemoval()

TODO

  • Facility to configure timeouts, at least for testing!
  • More elaborated tests
    • WebSocket
    • ..
  • checkForcedRemoval(..) WIP
    • TODO: Add maximimal IPv4/IPv6 socket-count criteria, drop oldest.

Checklist

  • [X] I have run make prettier-write and formatted the code.
  • [X] All commits have Change-Id
  • [ ] I have run tests with make check
  • [X] Passed certain unit tests and cypress desktop annotations.js
  • [X] I have issued make run and manually verified that everything looks okay
  • [X] Documentation (manuals or wiki) has been updated or is not required

sgothel avatar Aug 28 '24 00:08 sgothel

This change set created a linkage error (android), the 'usual' type_info, vtable thing. Can't figure a missed 'virtual' or even pure virtual '=0' virtual key in the code, also reviewed the MOBILEAPP macro. Hence I need to reproduce and build the android locally here .. WIP.

sgothel avatar Aug 28 '24 09:08 sgothel

Thank you @Ashod for the review.

I understand this is a draft, but it reads like a mix of random fixes and improvements (some of it is good and useful, like adding const and = default, make_shared, etc.), re-inventing methods for doing things already supported, and debugging/tracing support for your understanding (to be removed).

The draft status was motivated by my intention to discuss the actual semantics and goals, i.e. mitigate limited connections - or better, ensure connections are closed as well as to reduce the risk of running out of sockets/file-handles.

Which brings us to the next point I guess.

It's very hard to read, as there is no clear theme or purpose.

When reading commit by commit, I believe everything is well split apart semantically. I believed the purpose was well described in #9833 and the commit https://github.com/CollaboraOnline/online/pull/9916/commits/3ef99c511d168b1a9a9bb6c86d4a857a584e91e4 actually starting to address (matching this drafts' description) was meaningful.

Indeed, I added a few unrelated single commits into this review. Understanding that we allow reasonable small changes w/o issue report, esp of intrinsic nature, I believe it should be OK. (admin vs workflow compromise as discussed earlier)

I left various comments, but I didn't dive too deep in a number of areas because it was unclear why we are either replacing an existing approach, or inventing a new one when we can improve on the existing one (where there are shortcomings).

I am not aware of really replacing anything. I merely moved and added functionality.

At a minimum, I'd separate the backtrace improvements from the socket improvements, and likely the misc. improvements from new code/features/tests.

As described above, a zero side-effect change in its own commit.

The Backtrace object in particular is useful for all, as we can produce an instance capturing the current stack, move it around and output where desired. As discussed, I did not add addr2line or libunwind dependencies, which I understand ofc.

Finally, I'm troubled by vformatString and the re-working a lot of the socket states and internals with a huge set of new members that change the existing semantics with unclear benefits.

I just like printf formatting as it is more performant and flexible (at least to me). IMHO I use it in a most reasonable save fashion, while being supported by the compiler to remove the risk of wrong format specifiers and stack values. The latter is still a weakness, which I am aware of.

Perhaps there is some context I'm missing. Perhaps this was agreed on and I'm unaware.

Allow me to continue answering your other remarks below.

sgothel avatar Aug 28 '24 17:08 sgothel

I think it'll be easier to manage this if its split up so that the mostly uncontroversial misc cleanups of existing stuff where no change in logic is intended, the make_unique, make_shared, privator ctor -> = delete, constifying, expansion of auto& to real type, emplace_back, etc goes into a separate pr.

The backtrace debugging aid looks probably useful to have to me, but can also be its own pr I think (and probably not worth fighting the "CodeQL scanning - action / Analyze" step which apparently can't handle the constexpr use which is blocking ci from passing this pr).

And generally trim this pr down to the core change to make it easier to think about it

caolanm avatar Aug 29 '24 08:08 caolanm

Thanks for the responses, @sgothel! We've discussed some offline, so didn't reply here (e.g. printf-style formatting being fast--in fact it's very slow compared to only serialization).

Indeed, I added a few unrelated single commits into this review. Understanding that we allow reasonable small changes w/o issue report, esp of intrinsic nature, I believe it should be OK. (admin vs workflow compromise as discussed earlier)

In general, agreed. But I'll echo @caolanm's comment here that breaking up into PRs, that can be decided to merge independent of other parts, is always helpful.

I should also point out that this PR is well above a 1000 lines. This is a rare size. Considering that some parts change the interface of a large hierarchy (the sockets), which is a fundamental core part of the whole project and very delicate (can easily break things catastrophically), it requires reading and studying far more code (i.e. the context and consequences) to review properly. An isolated and low-risk change of a few 100 lines is fine. A complex few 100 line change (such as in the sockets) is challenging and time consuming. Sometimes we have no choice because we can't break it up. But a 1000 lines that can be easily broken up is hardly necessary.

Finally, even though we can review individual commits, it's preferable to have a single theme to a PR. Helps the reviewer. In some cases a few misc. changes can go together (comment fixes, cosmetics, formatting, logging, etc.). But mixing functional changes with non-functional changes dilutes the process and can reduce the quality of the review. More so with high-risk changes (i.e. socket code) mixed with debugging code (i.e. backtracing).

Regardless, you have a lot of great improvements here, and I don't want to undermine the importance of your contribution! Thank you.

Ashod avatar Aug 30 '24 11:08 Ashod

Thank you both for your review and kind words.

I have split up single commits into 4 branches, i.e. this pull-request branch and

  • #9952
  • #9953
  • #9954

For this feature work, I need to complete the unit test incl. the timeout configuration. Looks OK .. shall provide it in 1-2 days.

sgothel avatar Sep 02 '24 03:09 sgothel

@sgothel, there were multiple suggestions to split this large PR up to reduce its size. That was when it had 18 commits and was "only" a 1000 lines. Now it's past 1500 lines and is 1 commit! I'm sensing this isn't going in the direction of the feedback.

What's the plan, please?

Ashod avatar Sep 06 '24 12:09 Ashod

As said needs to be split up; lets try to get the less controversial pieces in and merged as separate pull requests - to slim this down =) Can you work on that Sven ?

will do the split later then (tomorrow) + finishing the tests

sgothel avatar Sep 06 '24 12:09 sgothel

@sgothel, there were multiple suggestions to split this large PR up to reduce its size. That was when it had 18 commits and was "only" a 1000 lines. Now it's past 1500 lines and is 1 commit! I'm sensing this isn't going in the direction of the feedback.

What's the plan, please?

As I wrote, I will split out the config and TimeAverage (~400 lines or so) and put the remainder at the bottom -> 3 commits within this pull request. Perhaps more, w/ a fresh look.

But splitting this up in multiple pull request (PR) makes would probably makes no sense, as these pieces are required and each PR must be tested. IMHO.

sgothel avatar Sep 06 '24 13:09 sgothel

As I wrote, I will split out the config and TimeAverage (~400 lines or so) and put the remainder at the bottom -> 3 commits within this pull request. Perhaps more, w/ a fresh look.

But splitting this up in multiple pull request (PR) makes would probably makes no sense, as these pieces are required and each PR must be tested. IMHO.

Multiple pull requests means we can merge non-controversial stuff fast - and have it being tested in the wild - ideally of course in the red release. It's a definite anti-pattern to build up a huge, blocking change-request that is all or nothing. There is/was lots of other 'stuff' in this that doesn't belong in one CR - IIRC; cosmetic stuff - "remove virtual from lots of functions" - that lower risk stuff should be got in separately, and should not be repeatedly presented to reviewers - it just chews time to skip over. So - please split this into separate CRs that we can merge - non-controversial / pre-requisite pieces, and other bits on top.

Thanks.

mmeeks avatar Sep 06 '24 13:09 mmeeks

I have cut out pull #10012 (which should be merged with this one IMHO).

Also produced smaller commits, which hopefully enables easier review of intend/story. Since the 4 commits of this pull request (PR) have no real meaning beyond this issue, allow me to offer these within this PR.

As mentioned Friday, I like to add one more criteria and test it before wrapping things up (-> maxConnections).

sgothel avatar Sep 09 '24 07:09 sgothel

Will also revise left over dirt (tests, ..) as addressed in previous review .. notify when done.

sgothel avatar Sep 09 '24 08:09 sgothel

@Ashod and @mmeeks .. I have cleaned up the commits and followed most if not all of your ideas/recommendations.

For this .. I have rearranged the single commits, so that the first ones merely cleanup/restructure Socket etc w/o features. Indeed, they are all compile clean and should be bisectable.

Then I completed the feature and unit tests, also based on latest discussion (max-connection -> keep old-connections alive but block new ones as already implemented via MAX_CONNECTIONS). So this would be a double-safety-net.

Since I am quite happy now, I move this to 'review'.

sgothel avatar Sep 11 '24 14:09 sgothel

.. oops, scan detected a merge conflict of mine .. one sec.

sgothel avatar Sep 11 '24 14:09 sgothel

.. oops, scan detected a merge conflict of mine .. one sec.

Re-arranged sneaking in 2 missing commits in the chain.

sgothel avatar Sep 11 '24 14:09 sgothel

Since I am quite happy now, I move this to 'review'.

Thanks Sven - sounds very promising =) good stuff.

mmeeks avatar Sep 11 '24 17:09 mmeeks

I'm confused. I still see some patches eg. this one: https://github.com/CollaboraOnline/online/pull/9916/commits/e5d6dc55a210aab1aa3d22b408e4726fbda96945 that has a combination of totally safe, new debugging methods, as well as much the more risky 'isClosed' change, as well as all this renaming and accessor changes - which need splitting in isolation to be reviewed carefully, or am I mis-reading the github UI - TBH I find the UI confusing; perhaps those are old patch fragments from earlier commits or something. Did you push the split ? and/or am I just not seeing it ?

mmeeks avatar Sep 11 '24 17:09 mmeeks

I'm confused. I still see some patches eg. this one: e5d6dc5 that has a combination of totally safe, new debugging methods, as well as much the more risky 'isClosed' change, as well as all this renaming and accessor changes - which need splitting in isolation to be reviewed carefully, or am I mis-reading the github UI - TBH I find the UI confusing; perhaps those are old patch fragments from earlier commits or something. Did you push the split ? and/or am I just not seeing it ?

I didn't remove this one (Socket and WebSocketHandler) as visible by the date + hash. I reorganized on top of it to separate feature from restructuring code, thought this was the intention and I do not see anything risky here (flipped a boolean close -> open, OK).

The next parts 2, 3 address single issues and also fix the std::string 'abuse' for osstream, scoped enums, shorter accessors etc.

All of these are orthogonal to the feature - so could be seen as bisectable ex-feature.

A later Socket change for the passing of time-point, then the 2 feature commits.

Perhaps not all 100% to your preference in regards to sliced down commits .. but feature and restructuring is.

If you insist on re-splitting it (i.e. this 1st commit perhaps), advise .. would do it tomorrow then incl. clean-build tests at least (as done here).

(edit: re this web-UI, yes esp when commenting etc .. I usually use git commandline, git-cola, gitk and qgit .. also when reviewing for the overview)

sgothel avatar Sep 11 '24 17:09 sgothel

Well - CI passes which is a good sign; I'm really not happy at having cosmetic cleanups merged with drastic functional change - today you can see where the functional change is amid the big chunk of non-functional change: how about in two years ? =) We can get this in, but not again for socket code.

My only concern now is what happens during save-on shutdown; I had a nightmare last night (literally) that the lifecycle of ClientSession is essentially owned by the SocketPoll (is it?) - and if that is the last ClientSession that has been hard disconnected from, and we have un-saved data - we need to keep that ClientSession alive until it is saved.

It struck me that this might well be why we have shutdown but not deleted Socket's in the SocketPoll - or did I miss something there?

@Ashod can you do a last quick review for that use-case, and double-check we have a unit test that covers it effectively ? =)

Thanks!

mmeeks avatar Sep 11 '24 20:09 mmeeks

Hi @mmeeks,

My only concern now is what happens during save-on shutdown;

It's complicated and one of the most sensitive parts of the code, with a lot of history.

I had a nightmare last night (literally) that the lifecycle of ClientSession is essentially owned by the SocketPoll (is it?)

Sort of.

SocketPoll does in fact control when we find out that the socket is disconnected (ClientSession::onDisconnect()), which lets DocBroker know to start cleaning up (DocumentBroker::removeSession()).

and if that is the last ClientSession that has been hard disconnected from, and we have un-saved data - we need to keep that ClientSession alive until it is saved.

Indeed, as you expect, that's what we do.

We don't blindly remove the ClientSession's reference from DocBroker. Instead, we intentionally keep the last reference to ClientSession in DocBroker and start the unload process, if we need to save. We keep this last reference until we successfully save and upload, or we fail and give up.

I have really tried to understand if this PR could affect this in any way, but I can't with any confidence say I found out the answer. On the face of it, nothing stood out. But the sockets are such a foundational part of everything we do, that it's hard to say with a straight face there are no side-effects to worry about.

It struck me that this might well be why we have shutdown but not deleted Socket's in the SocketPoll - or did I miss something there?

I can't tell. So much has changed, from socket lifetime management, to the socket poll, to refactorings of socket open/closed state, to the timeout management, that I cannot estimate the impact of all of this. (My worry is that we've mangled things badly enough that we either randomly disconnect sockets, or we don't properly handle edge cases that many other functions depend on.)

But I do know that we had far more benign code than this, that had harbored great ills and nasty surprises.

@Ashod can you do a last quick review for that use-case, and double-check we have a unit test that covers it effectively ? =)

We do have unit-tests for the last-editor disconnecting with modifications to save. Unfortunately, the unit-tests related to save and upload have become very unstable of late, with very high rate of random failure. So that's no consolation.

P.S. This 5th pass was anything but quick. Considering there is ~1000 new lines changed since my last review, at ~2400 lines, I'm surprised it only took me 3 hours. (Actually, no, I'm not surprised; I had to stop.)

Ashod avatar Sep 12 '24 02:09 Ashod

I had a nightmare last night (literally) that the lifecycle of ClientSession is essentially owned by the SocketPoll (is it?)

Sort of.

So I did a bit of reading myself; this is fine: I thought SocketPoll controlled lifecycle of the ClientSession - but in fact the DocumentBroker member:

/// All session of this DocBroker by ID.
SessionMap<ClientSession> _sessions;

template<class T> class SessionMap : public std::map<std::string, std::shared_ptr<T> >

holds a hard reference on it - so not a big deal =)

mmeeks avatar Sep 12 '24 08:09 mmeeks

I had a nightmare last night (literally) that the lifecycle of ClientSession is essentially owned by the SocketPoll (is it?)

...

template class SessionMap : public std::map<std::string, std::shared_ptr >

holds a hard reference on it - so not a big deal =)

I was testing (~1 week ago) the socket-close on these criteria of this patch with firefox editing a document. The connection was build-up again and no artifacts were visible.

sgothel avatar Sep 12 '24 16:09 sgothel

Revised PR, please note Changes to original PR in description on the top.

Thx to @caolanm for performing the initial split-up, discussions leading to more simplifications and reordering as well as introducing me to git rebase -i, also used to validate clean compilation of each commit at the end.

sgothel avatar Sep 26 '24 08:09 sgothel

Again I would like to separate the cosmetic re-working of things from functional changes. This should be resolved now, all bisectable.

Pushing down the bytesSent etc. statistics from StreamSocket -> Socket - interesting, not sure of the rationale - perhaps harmless, but would like to know why - where does this simplify something else ?

  • to allow dumping the socket stats (incl. throughput)
  • to have it in the same level as open state, fd, - same domain

And of course, other things - not so convinced about ;-) Hope most of thes 'claims', incl. @Ashod s, are resolved by now, please note Changes to original PR in description on the top.

sgothel avatar Sep 26 '24 13:09 sgothel

Added minor changes addressing @Ashod earlier notes.

sgothel avatar Sep 26 '24 14:09 sgothel

Note: Unit tests: +7 files, 879 insertions(+), remaining changed lines +800, -222

sgothel avatar Sep 26 '24 14:09 sgothel

Actually - I lie the c1,2,3,4 thing really needs fixing or explaining; its not clear to me that it will work =)

mmeeks avatar Oct 03 '24 16:10 mmeeks

@mmeeks @caolanm addresses Michael's notes

  • inlined the changes in rebased commits
  • rebased to (Monday's) master

sgothel avatar Oct 09 '24 11:10 sgothel

* rebased to (Monday's) master

I apologize: Should have pushed with a rebase to master 1st, then another push with the changes. This would have allowed us to see the push delta on the fixed notes only using this github UI diff. Now it shows the total change :-(

sgothel avatar Oct 09 '24 11:10 sgothel

resolved merge conflict due to changed net/Socket.*

sgothel avatar Oct 10 '24 13:10 sgothel

@caolanm

Did a full rebase to master a330f9e75717de96c72794d0ac672f75376c9711

  • ensuring to pass git rebase -i --exec "make -j 20"
    • fixing some mild compilation issues introduced with last rebase merge fixes

sgothel avatar Oct 16 '24 03:10 sgothel