beast icon indicating copy to clipboard operation
beast copied to clipboard

Assert in close.hpp line 244 version 1.75

Open gopalak opened this issue 2 years ago • 34 comments

Discussed in https://github.com/boostorg/beast/discussions/2415

Originally posted by gopalak April 25, 2022 Hi I am seeing an assert in close.hpp which is kind of strange as I have serialized and made sure only one of async read, wr and close is called at any given time.

Unable to attach stack as its happening in my production code - What I see in the stack is I see assert in -- 242 teardown: 243 // Teardown 244 BOOST_ASSERT(impl.wr_block.is_locked(this)); 245 using beast::websocket::async_teardown;

Happens fron here. Ido have read and write handlers installed as per doc and dispatched

Screen Shot 2022-04-25 at 10 05 46 AM ------------

gopalak avatar Apr 26 '22 04:04 gopalak

@vinniefalco here moved it as an issue - will repost some of my finding here

gopalak avatar Apr 26 '22 04:04 gopalak

Glad to see that issues work :)

vinniefalco avatar Apr 26 '22 04:04 vinniefalco

I have 4 threads but think I figured seeing the code in close.hpp:244 - where it asserts that a write callback in actually installed via the impl.wr_block -> My code path has a logic to do some had shake before I actually install both read and write handlers and where this assert is hit, I only had read callback installed but no async write - issue is the window when the teardown was happening and this assert hit is not like easily reproducible. But based on reading the code and assert line that seems to be the correct fix.

There is no documentation which says the async close needs a pre rfequisite of write and read handlers needs to be installed for SSK streams @vinniefalco

gopalak avatar Apr 26 '22 04:04 gopalak

You can't mix sync and async calls

vinniefalco avatar Apr 26 '22 04:04 vinniefalco

Screen Shot 2022-04-25 at 3 15 47 PM

gopalak avatar Apr 26 '22 04:04 gopalak

@vinniefalco I am not mixing sync and async calls -> All I said was I am installing calls to async_read and async_write if one is not already present in my logic if I have to call async_close during my handshake flow

So what I am doing semantically is

if (no_async_read_installed) install_async_read();

if (no_async_write_installed) install async_write_hdlr()l;

call async_close();

gopalak avatar Apr 26 '22 04:04 gopalak

You can't mix sync and async calls

SO just to clarify I am not mixing and matching sync / async calls - all are async calls here. Pls see above comment ... - and when I say install its a dispatch of async_read and async_write is all I am doing before calling async_close - as one stack trace I saw actually had like 4 bytes of data in the buffers read and due to close was trying to write the close packet and that is where it hit the assert - see my screen shot above on the the actual beast src code where it hit the assert

gopalak avatar Apr 26 '22 04:04 gopalak

Since there is no doc on this I want to make sure I am understanding the code in the assert line that it indeed is looking that an async_write call is registered which would have the lock - is that what that call checks ?

gopalak avatar Apr 26 '22 04:04 gopalak

There is a doc, see:

https://www.boost.org/doc/libs/1_79_0/libs/beast/doc/html/beast/using_websocket/notes.html#beast.using_websocket.notes.thread_safety

vinniefalco avatar Apr 26 '22 04:04 vinniefalco

If you are getting that assert it means that more than one thread is touching the websocket stream at the same time, which is not allowed.

vinniefalco avatar Apr 26 '22 04:04 vinniefalco

I have made all async calls dispatched to the executor of the websocket stream (read, write and close - all async)

gopalak avatar Apr 26 '22 04:04 gopalak

There is a doc, see:

https://www.boost.org/doc/libs/1_79_0/libs/beast/doc/html/beast/using_websocket/notes.html#beast.using_websocket.notes.thread_safety

Yes I have read it and I have safe guards that only one of async (read, wr or close) is called in my code flow as well

gopalak avatar Apr 26 '22 04:04 gopalak

Is there a way I can find the thread which locked the soft futex in gdb ?

gopalak avatar Apr 26 '22 04:04 gopalak

@vinniefalco Changing to 1 thread for the io context makes no difference. I can see alll completion handler in that thread

gopalak avatar Apr 26 '22 16:04 gopalak

You still get the assert? Are you calling async_close from a foreign thread?

vinniefalco avatar Apr 26 '22 17:04 vinniefalco

You still get the assert? Are you calling async_close from a foreign thread?

Yes i am still seeing the assert and no I am not calling it from a foreign thread. Its the thread where io_ctx is run and I have set it to 1 thread now

gopalak avatar Apr 26 '22 17:04 gopalak

I am unable to share code but this is the sequence -> Complete WS handshake with peer Do some op like send one request and wait for reply and when waiting for reply client closes connection using async_close Think the way the close frame comes when we are waiting for read may be causing some thing (this is my gut feeling) as from my logs I can clearly see I am waiting for a reply when this assert is hit and nothing else happens in between apart from the async close and I can assure u that I dispatch async_close to the strand

gopalak avatar Apr 26 '22 18:04 gopalak

Has this been resolved?

vinniefalco avatar Jun 14 '22 17:06 vinniefalco

No this is still happening @vinniefalco

gopalak avatar Sep 05 '22 16:09 gopalak

So one recent scenario I saw was like this - @vinniefalco @madmongo1

  1. Connection is started - I have aa timer for connect
  2. SSL handshake done
  3. WS handshake starts -
  4. The app layer has this holder obj (lets call it AppSession which holds a shared pointer of the WS conn) has a cleanup started and as part of it - calls async_close on the WS Session which s actually working on WS handshake
  5. This call hits an ABORT on the above call or some times one level down asserting id != T::id - So clearly the handshake starting and async_close seems to be hitting some race condition.

There is only 1 thread for the io_ctx all async calls are dispatched to the strand and guards are in place to make sure there is only 1 of each async calls active at a given instant. Those asserts never fire meaning I have no pending and active async calls in tandem

This once again happens mostly under load. I think the above scenario is kind of hard to recreate. I tried in my code where I set the appObj state to bad and started the shutdown and when there is not much load it seems to work fine. In Prod we did see a lot of load on the client side as well as the server side and this scenario is hit multiple times.

I think there is a race condition some where and the checking of state as most asserts are in the same location when either the SSL is being torndown or the WS session

gopalak avatar Sep 05 '22 21:09 gopalak

@vinniefalco @madmongo1 can u folks pls help for the above ?

gopalak avatar Sep 06 '22 18:09 gopalak

Yep. Build me a test program to replicate the issue. Without that I don’t have a chance.

madmongo1 avatar Sep 06 '22 19:09 madmongo1

@madmongo1 I am unable to like I said its happening under load. I can send u stack trace to ur email. Which email can I send to also can copy @vinniefalco as well

gopalak avatar Sep 06 '22 19:09 gopalak

Can you build a test that creates a lot of load? Perhaps fork 1000 clients to hammer the service?

madmongo1 avatar Sep 06 '22 19:09 madmongo1

No - We cant its not a simple service to build else I would have. There are complex states involved. Cant give much innards here unfortunately else trust me I would have done that before I bother you folks here

gopalak avatar Sep 06 '22 19:09 gopalak

So this is how these things normally go down: Client reports a mysterious crash in beast. I ask him to build me a simpler test program to replicate the issue. The simpler test program doesn’t not crash. The client adds in more and more complexity until the program crashes. It turns out that the complexity was hiding some UB and beast is actually fine.

madmongo1 avatar Sep 06 '22 19:09 madmongo1

Prove me wrong.

madmongo1 avatar Sep 06 '22 19:09 madmongo1

So here is the thing. The assert is telling wr lock is not taken during tear down and I have one thread and all my async_close are dispatched. How can with 1 thread there could be multiple locks taken ? And if my code had an issue in such a dispatch which has happened before I didnt need this kind of load to see it. It would happen much faster with very less load and i have fixed them in my code.

gopalak avatar Sep 06 '22 19:09 gopalak

This is almost similar to the last assert you fixed in the beast code. With 1 thread for io_ctx and all async alls being dispatched we should never hit this scenario isnt it

gopalak avatar Sep 06 '22 19:09 gopalak

Prove me wrong.

I am not trying to prove any thing wrong here just to be clear. I am trying to assess and find a solution. I have scoured my code and it has so many asserts to catch any unexpected and wrong async calls made and none ever get hit

gopalak avatar Sep 06 '22 19:09 gopalak