ceph-dokan
ceph-dokan copied to clipboard
ceph-dokan crashed when mds are overloaded
The ceph-dokan crashed when the mds are overloaded. Our experiment is: 10 osd server, 3 mds, 14 ceph-dokan client(2, 4, 16 concurrent IO per client).
The following are windbg output:
20 Id: b2cc.116c0 Suspend: 0 Teb: 7ef6a000 Unfrozen
ChildEBP RetAddr Args to Child
04ef14ec 75500bdd 00000002 04ef153c 00000001 ntdll_773b0000!ZwWaitForMultipleObjects+0x15
04ef1588 74fb1a2c 04ef153c 04ef15b0 00000000 KERNELBASE!WaitForMultipleObjectsEx+0x100
04ef15d0 74fb4208 00000002 7efde000 00000000 kernel32!WaitForMultipleObjectsExImplementation+0xe0
04ef15ec 74fd80a4 00000002 04ef1620 00000000 kernel32!WaitForMultipleObjects+0x18
04ef1658 74fd7f63 04ef1738 00000001 00000001 kernel32!WerpReportFaultInternal+0x186
04ef166c 74fd7858 04ef1738 00000001 04ef1708 kernel32!WerpReportFault+0x70
04ef167c 74fd77d7 04ef1738 00000001 b921b0ed kernel32!BasepReportFault+0x20
04ef1708 774274df 00000000 774273bc 00000000 kernel32!UnhandledExceptionFilter+0x1af
04ef1710 774273bc 00000000 04efffd4 773dc530 ntdll_773b0000!__RtlUserThreadStart+0x62
04ef1724 77427261 00000000 00000000 00000000 ntdll_773b0000!_EH4_CallFilterFunc+0x12
04ef174c 7740b459 fffffffe 04efffc4 04ef1888 ntdll_773b0000!_except_handler4+0x8e
04ef1770 7740b42b 04ef1838 04efffc4 04ef1888 ntdll_773b0000!ExecuteHandler2+0x26
04ef1794 7740b3ce 04ef1838 04efffc4 04ef1888 ntdll_773b0000!ExecuteHandler+0x24
04ef1820 773c0133 01ef1838 04ef1888 04ef1838 ntdll_773b0000!RtlDispatchException+0x127
04ef182c 04ef1838 04ef1888 c0000005 00000000 ntdll_773b0000!KiUserExceptionDispatcher+0xf
WARNING: Frame IP not in any known module. Following frames may be wrong.
04ef1b98 64fecc6e 00000000 0064003b 04ef1bc8 0x4ef1838
04ef1bc8 64fcf81c 00000000 00000000 00000098 libcephfs!ZN5Inode11caps_issuedEPi+0x4a
04ef1ca8 64fded05 0064001f 04ef1dd8 00000000 libcephfs!ZN6Client9fill_statEP5InodeP9stat_cephP11frag_info_tP11nest_info_t+0x50a
04ef1d88 64e82858 00000b4b 04ef1dd8 04efff70 libcephfs!ZN6Client5fstatEiP9stat_ceph+0x161
04ef1da8 004049d1 00a0a8a8 00000b4b 04ef1dd8 libcephfs!ceph_fstat+0x38
The bug is quite tricky. The mds is overloaded, so the ceph_fstat requests doesn't return back. Ceph-dokan hangs on fstat->getattr->make_request for 300sec. But ceph-dokan has set timeout ( DokanResetTimeout(CEPH_DOKAN_IO_TIMEOUT, DokanFileInfo). CEPH_DOKAN_IO_TIMEOUT is 120 sec. So the dokan driver cleanuped up inode unexpectly.
Client | MDS |
---|---|
getattr | ... |
make_request | ... |
wait | mds receive request |
.... | BUSY |
timeout, cleanup and release_fh | BUSY |
.... | send message back. |
mds return | |
access invalid fh and crash | ... |
So, ketor, can we cancel DokanResetTimeout to fix this bug?
Hi @shenyan1 , dokan's arch limit this. If we do not call DokanResetTimeout explicitly, the dokan will set a default timeout 5s.
So I think may be you can set the time for 300s or bigger to reduce the bug.