beast Assert in close.hpp line 244 version 1.75

Assert in close.hpp line 244 version 1.75

Open gopalak opened this issue 2 years ago • 34 comments

Discussed in https://github.com/boostorg/beast/discussions/2415

^{Originally posted by gopalak April 25, 2022} Hi I am seeing an assert in close.hpp which is kind of strange as I have serialized and made sure only one of async read, wr and close is called at any given time.

Unable to attach stack as its happening in my production code - What I see in the stack is I see assert in -- 242 teardown: 243 // Teardown 244 BOOST_ASSERT(impl.wr_block.is_locked(this)); 245 using beast::websocket::async_teardown;

Happens fron here. Ido have read and write handlers installed as per doc and dispatched

------------

Apr 26 '22 04:04 gopalak

@vinniefalco here moved it as an issue - will repost some of my finding here

Apr 26 '22 04:04 gopalak

Glad to see that issues work :)

Apr 26 '22 04:04 vinniefalco

I have 4 threads but think I figured seeing the code in close.hpp:244 - where it asserts that a write callback in actually installed via the impl.wr_block -> My code path has a logic to do some had shake before I actually install both read and write handlers and where this assert is hit, I only had read callback installed but no async write - issue is the window when the teardown was happening and this assert hit is not like easily reproducible. But based on reading the code and assert line that seems to be the correct fix.

There is no documentation which says the async close needs a pre rfequisite of write and read handlers needs to be installed for SSK streams @vinniefalco

Apr 26 '22 04:04 gopalak

You can't mix sync and async calls

Apr 26 '22 04:04 vinniefalco

Screen Shot 2022-04-25 at 3 15 47 PM

Apr 26 '22 04:04 gopalak

@vinniefalco I am not mixing sync and async calls -> All I said was I am installing calls to async_read and async_write if one is not already present in my logic if I have to call async_close during my handshake flow

So what I am doing semantically is

if (no_async_read_installed) install_async_read();

if (no_async_write_installed) install async_write_hdlr()l;

call async_close();

Apr 26 '22 04:04 gopalak

You can't mix sync and async calls

SO just to clarify I am not mixing and matching sync / async calls - all are async calls here. Pls see above comment ... - and when I say install its a dispatch of async_read and async_write is all I am doing before calling async_close - as one stack trace I saw actually had like 4 bytes of data in the buffers read and due to close was trying to write the close packet and that is where it hit the assert - see my screen shot above on the the actual beast src code where it hit the assert

Apr 26 '22 04:04 gopalak

Since there is no doc on this I want to make sure I am understanding the code in the assert line that it indeed is looking that an async_write call is registered which would have the lock - is that what that call checks ?

Apr 26 '22 04:04 gopalak

There is a doc, see:

https://www.boost.org/doc/libs/1_79_0/libs/beast/doc/html/beast/using_websocket/notes.html#beast.using_websocket.notes.thread_safety

Apr 26 '22 04:04 vinniefalco

If you are getting that assert it means that more than one thread is touching the websocket stream at the same time, which is not allowed.

Apr 26 '22 04:04 vinniefalco

I have made all async calls dispatched to the executor of the websocket stream (read, write and close - all async)

Apr 26 '22 04:04 gopalak

There is a doc, see:

https://www.boost.org/doc/libs/1_79_0/libs/beast/doc/html/beast/using_websocket/notes.html#beast.using_websocket.notes.thread_safety

Yes I have read it and I have safe guards that only one of async (read, wr or close) is called in my code flow as well

Apr 26 '22 04:04 gopalak

Is there a way I can find the thread which locked the soft futex in gdb ?

Apr 26 '22 04:04 gopalak

@vinniefalco Changing to 1 thread for the io context makes no difference. I can see alll completion handler in that thread

Apr 26 '22 16:04 gopalak

You still get the assert? Are you calling async_close from a foreign thread?

Apr 26 '22 17:04 vinniefalco

You still get the assert? Are you calling async_close from a foreign thread?

Yes i am still seeing the assert and no I am not calling it from a foreign thread. Its the thread where io_ctx is run and I have set it to 1 thread now

Apr 26 '22 17:04 gopalak

I am unable to share code but this is the sequence -> Complete WS handshake with peer Do some op like send one request and wait for reply and when waiting for reply client closes connection using async_close Think the way the close frame comes when we are waiting for read may be causing some thing (this is my gut feeling) as from my logs I can clearly see I am waiting for a reply when this assert is hit and nothing else happens in between apart from the async close and I can assure u that I dispatch async_close to the strand

Apr 26 '22 18:04 gopalak

Has this been resolved?

Jun 14 '22 17:06 vinniefalco

No this is still happening @vinniefalco

Sep 05 '22 16:09 gopalak

So one recent scenario I saw was like this - @vinniefalco @madmongo1

Connection is started - I have aa timer for connect
SSL handshake done
WS handshake starts -
The app layer has this holder obj (lets call it AppSession which holds a shared pointer of the WS conn) has a cleanup started and as part of it - calls async_close on the WS Session which s actually working on WS handshake
This call hits an ABORT on the above call or some times one level down asserting id != T::id - So clearly the handshake starting and async_close seems to be hitting some race condition.

There is only 1 thread for the io_ctx all async calls are dispatched to the strand and guards are in place to make sure there is only 1 of each async calls active at a given instant. Those asserts never fire meaning I have no pending and active async calls in tandem

This once again happens mostly under load. I think the above scenario is kind of hard to recreate. I tried in my code where I set the appObj state to bad and started the shutdown and when there is not much load it seems to work fine. In Prod we did see a lot of load on the client side as well as the server side and this scenario is hit multiple times.

I think there is a race condition some where and the checking of state as most asserts are in the same location when either the SSL is being torndown or the WS session

Sep 05 '22 21:09 gopalak

@vinniefalco @madmongo1 can u folks pls help for the above ?

Sep 06 '22 18:09 gopalak

Yep. Build me a test program to replicate the issue. Without that I don’t have a chance.

Sep 06 '22 19:09 madmongo1

@madmongo1 I am unable to like I said its happening under load. I can send u stack trace to ur email. Which email can I send to also can copy @vinniefalco as well

Sep 06 '22 19:09 gopalak

Can you build a test that creates a lot of load? Perhaps fork 1000 clients to hammer the service?

Sep 06 '22 19:09 madmongo1

No - We cant its not a simple service to build else I would have. There are complex states involved. Cant give much innards here unfortunately else trust me I would have done that before I bother you folks here

Sep 06 '22 19:09 gopalak

So this is how these things normally go down: Client reports a mysterious crash in beast. I ask him to build me a simpler test program to replicate the issue. The simpler test program doesn’t not crash. The client adds in more and more complexity until the program crashes. It turns out that the complexity was hiding some UB and beast is actually fine.

Sep 06 '22 19:09 madmongo1

Prove me wrong.

Sep 06 '22 19:09 madmongo1

So here is the thing. The assert is telling wr lock is not taken during tear down and I have one thread and all my async_close are dispatched. How can with 1 thread there could be multiple locks taken ? And if my code had an issue in such a dispatch which has happened before I didnt need this kind of load to see it. It would happen much faster with very less load and i have fixed them in my code.

Sep 06 '22 19:09 gopalak

This is almost similar to the last assert you fixed in the beast code. With 1 thread for io_ctx and all async alls being dispatched we should never hit this scenario isnt it

Sep 06 '22 19:09 gopalak

Prove me wrong.

I am not trying to prove any thing wrong here just to be clear. I am trying to assess and find a solution. I have scoured my code and it has so many asserts to catch any unexpected and wrong async calls made and none ever get hit

Sep 06 '22 19:09 gopalak

beast beast copied to clipboard

Assert in close.hpp line 244 version 1.75

Discussed in https://github.com/boostorg/beast/discussions/2415

Unable to attach stack as its happening in my production code - What I see in the stack is I see assert in -- 242 teardown: 243 // Teardown 244 BOOST_ASSERT(impl.wr_block.is_locked(this)); 245 using beast::websocket::async_teardown;

beast
beast copied to clipboard