Hang and crash when disconnecting on Windows
Hi.
I'm using libsmb2 on a Windows 11 laptop to connect to a Samba drive on a Linux embedded device. The connection happens through a USB3 cable, where the embedded device is detected as a network card on the laptop using RNDIS.
It has been working well so far, with the exception that if I unplug the USB cable during a transfer, then I have a crash when I delete the context later with smb2_destroy_context.
Now I tried to update the lib to the latest commit on master, as my prior version of libsmb2 was from June 2022.
Hang forever
But now with the latest new version, I get an infinite hang if I unplug the cable during a read operation using the sync API:
- It stays stuck forever in the
wait_for_replyfunction. - Adding a timeout does not change anything, because
fdis never invalidated to -1 and the function does not return. - In
smb2_service_fd, the polling raises flagsPOLLERRandPOLLHUP. ThePOLLHUPblock is never reached due to thegotoinPOLLERRblock. However, in both cases I don't see anything that could invalidate thefdthere.
Fix
EDIT: After checking the diff since June 2022, I see that smb2_service and smb2_service_fd now return a t_socket type instead of an int. But on Windows, this type is defined as unsigned, so returning -1 and checking for < 0 does not work, causing the function to not return and thus the infinite hang.
The faulty commit seems to be 4b1ef1f6bd09f96a9f0b78accfeac6e2c2766f34.
Crash
EDIT2: With the latest version and the hang fixed, if I unplug the USB cable while it's transferring a file, and then I close the SMB client and call smb2_destroy_context, then I get something looking like memory corruption, causing fully random crashes in the rest of the application. Commenting the line smb2_destroy_context in the destructor of my client fixes these crashes.
Any idea what could go wrong in that case ? This seems very difficult to find and fix (on Windows, without valgrind); I'll try to replace the calloc and free functions and put prints, but the nature of the code with multiple callbacks and generic void* types makes it extremely difficult to debug.
Fix
EDIT3: After two full days of analysis, the commit 0dc1cc795512c808b52ecc83f7cd3e530936655d which creates a special case for the close operation is the cause of my crash.
In my workflow, I read a file and then close it using smb2_close() (sync). However, if the USB network interface is removed during the transfer, the read operation fails, but the smb2_close() is still called to simplify the logic, expecting it to fail as well. In this smb2_close(), the wait_for_reply() call fails with code -1 due to an expected error in smb2_service(), also returning -1. However in that case, the callback remains set somewhere in a PDU that was queued. So calling the free() at the end of smb2_close() frees this memory that is still referenced in a PDU callback.
Then during the call to smb2_destroy_context when destroying the client, the PDUs in the queues are cancelled and all callback are called. But as the one for the close operation was already freed, it crashes !
What could be done to fix these issues? Reverting these 2 commits is maybe not the best fix, as one of them was added to fix a memory leak apparently. In my case I'll fix locally already, because a leak is better than a crash.
Hi.
I'm using
libsmb2on a Windows 11 laptop to connect to a Samba drive on a Linux embedded device. The connection happens through a USB3 cable, where the embedded device is detected as a network card on the laptop using RNDIS.It has been working well so far, with the exception that if I unplug the USB cable during a transfer, then I have a crash when I delete the context later with
smb2_destroy_context.Now I tried to update the lib to the latest commit on
master, as my prior version oflibsmb2was from June 2022.Hang forever
But now with the latest new version, I get an infinite hang if I unplug the cable during a read operation using the sync API:
- It stays stuck forever in the
wait_for_replyfunction.- Adding a timeout does not change anything, because
fdis never invalidated to -1 and the function does not return.- In
smb2_service_fd, the polling raises flagsPOLLERRandPOLLHUP. ThePOLLHUPblock is never reached due to thegotoinPOLLERRblock. However, in both cases I don't see anything that could invalidate thefdthere.Fix
EDIT: After checking the diff since June 2022, I see that
smb2_serviceandsmb2_service_fdnow return at_sockettype instead of anint. But on Windows, this type is defined as unsigned, so returning-1and checking for< 0does not work, causing the function to not return and thus the infinite hang. The faulty commit seems to be4b1ef1f6bd09f96a9f0b78accfeac6e2c2766f34.Crash
EDIT2: With the latest version and the hang fixed, if I unplug the USB cable while it's transferring a file, and then I close the SMB client and call
smb2_destroy_context, then I get something looking like memory corruption, causing fully random crashes in the rest of the application. Commenting the linesmb2_destroy_contextin the destructor of my client fixes these crashes. Any idea what could go wrong in that case ? This seems very difficult to find and fix (on Windows, without valgrind); I'll try to replace thecallocandfreefunctions and put prints, but the nature of the code with multiple callbacks and genericvoid*types makes it extremely difficult to debug.Fix
EDIT3: After two full days of analysis, the commit
0dc1cc795512c808b52ecc83f7cd3e530936655dwhich creates a special case for the close operation is the cause of my crash. In my workflow, I read a file and then close it usingsmb2_close()(sync). However, if the USB network interface is removed during the transfer, the read operation fails, but thesmb2_close()is still called to simplify the logic, expecting it to fail as well. In thissmb2_close(), thewait_for_reply()call fails with code -1 due to an expected error insmb2_service(), also returning -1. However in that case, the callback remains set somewhere in a PDU that was queued. So calling thefree()at the end ofsmb2_close()frees this memory that is still referenced in a PDU callback. Then during the call tosmb2_destroy_contextwhen destroying the client, the PDUs in the queues are cancelled and all callback are called. But as the one for the close operation was already freed, it crashes !What could be done to fix these issues? Reverting these 2 commits is maybe not the best fix, as one of them was added to fix a memory leak apparently. In my case I'll fix locally already, because a leak is better than a crash.
Hey there, any news? Because first the commit that i replaced the t_socket was a attenpt to clear a warning also reverting one of them maybe in fact should be the best choice.