ceph-dokan icon indicating copy to clipboard operation
ceph-dokan copied to clipboard

ceph-dokan crashed when mds are overloaded

Open shenyan1 opened this issue 9 years ago • 1 comments

The ceph-dokan crashed when the mds are overloaded. Our experiment is: 10 osd server, 3 mds, 14 ceph-dokan client(2, 4, 16 concurrent IO per client).

The following are windbg output: 20 Id: b2cc.116c0 Suspend: 0 Teb: 7ef6a000 Unfrozen ChildEBP RetAddr Args to Child
04ef14ec 75500bdd 00000002 04ef153c 00000001 ntdll_773b0000!ZwWaitForMultipleObjects+0x15 04ef1588 74fb1a2c 04ef153c 04ef15b0 00000000 KERNELBASE!WaitForMultipleObjectsEx+0x100 04ef15d0 74fb4208 00000002 7efde000 00000000 kernel32!WaitForMultipleObjectsExImplementation+0xe0 04ef15ec 74fd80a4 00000002 04ef1620 00000000 kernel32!WaitForMultipleObjects+0x18 04ef1658 74fd7f63 04ef1738 00000001 00000001 kernel32!WerpReportFaultInternal+0x186 04ef166c 74fd7858 04ef1738 00000001 04ef1708 kernel32!WerpReportFault+0x70 04ef167c 74fd77d7 04ef1738 00000001 b921b0ed kernel32!BasepReportFault+0x20 04ef1708 774274df 00000000 774273bc 00000000 kernel32!UnhandledExceptionFilter+0x1af 04ef1710 774273bc 00000000 04efffd4 773dc530 ntdll_773b0000!__RtlUserThreadStart+0x62 04ef1724 77427261 00000000 00000000 00000000 ntdll_773b0000!_EH4_CallFilterFunc+0x12 04ef174c 7740b459 fffffffe 04efffc4 04ef1888 ntdll_773b0000!_except_handler4+0x8e 04ef1770 7740b42b 04ef1838 04efffc4 04ef1888 ntdll_773b0000!ExecuteHandler2+0x26 04ef1794 7740b3ce 04ef1838 04efffc4 04ef1888 ntdll_773b0000!ExecuteHandler+0x24 04ef1820 773c0133 01ef1838 04ef1888 04ef1838 ntdll_773b0000!RtlDispatchException+0x127 04ef182c 04ef1838 04ef1888 c0000005 00000000 ntdll_773b0000!KiUserExceptionDispatcher+0xf WARNING: Frame IP not in any known module. Following frames may be wrong. 04ef1b98 64fecc6e 00000000 0064003b 04ef1bc8 0x4ef1838 04ef1bc8 64fcf81c 00000000 00000000 00000098 libcephfs!ZN5Inode11caps_issuedEPi+0x4a 04ef1ca8 64fded05 0064001f 04ef1dd8 00000000 libcephfs!ZN6Client9fill_statEP5InodeP9stat_cephP11frag_info_tP11nest_info_t+0x50a 04ef1d88 64e82858 00000b4b 04ef1dd8 04efff70 libcephfs!ZN6Client5fstatEiP9stat_ceph+0x161 04ef1da8 004049d1 00a0a8a8 00000b4b 04ef1dd8 libcephfs!ceph_fstat+0x38

The bug is quite tricky. The mds is overloaded, so the ceph_fstat requests doesn't return back. Ceph-dokan hangs on fstat->getattr->make_request for 300sec. But ceph-dokan has set timeout ( DokanResetTimeout(CEPH_DOKAN_IO_TIMEOUT, DokanFileInfo). CEPH_DOKAN_IO_TIMEOUT is 120 sec. So the dokan driver cleanuped up inode unexpectly.

Client MDS
getattr ...
make_request ...
wait mds receive request
.... BUSY
timeout, cleanup and release_fh BUSY
.... send message back.
mds return
access invalid fh and crash ...

So, ketor, can we cancel DokanResetTimeout to fix this bug?

shenyan1 avatar Jul 07 '15 07:07 shenyan1

Hi @shenyan1 , dokan's arch limit this. If we do not call DokanResetTimeout explicitly, the dokan will set a default timeout 5s.

So I think may be you can set the time for 300s or bigger to reduce the bug.

ketor avatar Jul 20 '15 09:07 ketor