node-lmdb icon indicating copy to clipboard operation
node-lmdb copied to clipboard

basic operations fail under WSL (windows subsystem for linux)

Open warner opened this issue 4 years ago • 11 comments

We experienced a persistent failure when running node-lmdb under WSL: the "Windows Subsystem for Linux" thing that gives you a linux-ish kernel and an Ubuntu (or other) userspace. A basic operation, just open a DB and then write out a key, would fail with a funny error like this:

Error: MDB_BAD_TXN: Transaction must abort, has a child, or is invalid

My coworker @FUDCo tracked it down to LMDB expecting certain data-synchronization behavior between ~two filehandles mmap'ed to the same backing file~ mmap-based access and normal write()-based access to the same fd, which appears to behave differently through WSL's linux-ish kernel and the working systems (it worked fine on macOS and real linux). He also found a one-line fix in the node-lmdb configuration that tells LMDB to not rely upon that behavior: add useWritemap: true in the open() options:

lmdbEnv.open({
     path: dirPath,
     mapSize: 2 * 1024 * 1024 * 1024,
     useWritemap: true, // fixes crash
   });

I don't know if there's anything special that node-lmdb (or LMDB itself) should do to address this, but I figured I'd mention it here in case anyone else runs into this problem in the future. Chip's full writeup is at https://github.com/Agoric/agoric-sdk/issues/950#issuecomment-618197120

(edited to correct description of data-sync issue)

warner avatar Apr 23 '20 20:04 warner

Thank you for taking the time for this writeup. I don't personally use either Windows, or WSL these days, so I can't offer advice, other than a suggestion to report this problem to the upstream LMDB developers such as @hyc ― or the WSL developers.

I don't believe you should need the useWritemap option just to work around an operating system defect.

Venemo avatar Apr 23 '20 20:04 Venemo

The LMDB design document specifies that it requires a cache-coherent filesystem. Using writemap is the usual workaround though; it's also needed on OpenBSD for the same reason.

hyc avatar Apr 23 '20 20:04 hyc

Is there any non-horrible way to test this at startup, and maybe print a warning message if it looks like we're on a non-coherent platform? Chip said one side did a write(), and the other read from memory, and got zeros instead of the data that was written.. is there an easy way to read from memory just after doing the write (and sync) and see if the results make sense? Might help guide folks towards the workaround more quickly.

(I suppose any such diagnostic might want to go into LMDB proper, rather than the node bindings, but it'd be more actionable if the message can print the exact option you can use to work around the platform limitation, and that option might be spelled/formatted differently depending upon the binding you're using. So maybe putting it in node-lmdb wouldn't be the worst choice)

warner avatar Apr 23 '20 21:04 warner

If you do come up with a reliable way to test this, or query it from the system, then I'd be happy to include that in node-lmdb. I still believe that this should be reported as a bug to the WSL team, though.

Venemo avatar Apr 24 '20 07:04 Venemo

Agreed, this is a WSL bug. They've known about it for a long time already https://twitter.com/hyc_symas/status/1013800256088215552

It was working before Windows 10 update 1803, as documented here https://github.com/nimiq/core-js/issues/387

hyc avatar Apr 24 '20 12:04 hyc

It'd be super-helpful if you could please file an issue in the WSL repo where the WSL devs are most active and where issues in WSL are triaged and tracked. Please include the simplest possible repro case, and provide as much info as possible re. OS & tools versions you're running, etc.

Many thanks.

https://github.com/microsoft/wsl

bitcrazed avatar Apr 24 '20 18:04 bitcrazed

I am using WSL 2 with node-lmdb & am not seeing any issues. It appears this may now be fixed.

kylebernhardy avatar Jan 04 '21 21:01 kylebernhardy

Currently running into this issue using WSL1 on W10, has anyone figured out a fix?

raaiqwilliams avatar May 05 '22 07:05 raaiqwilliams

There is no fix for WSL1 other than to switch to WSL2.

hyc avatar May 05 '22 12:05 hyc

Yeah, WSL1 has devastating and unpredictable bugs that can't be worked around. I was just running LMDB on WSL 2 this morning, and it works great.

kriszyp avatar May 05 '22 13:05 kriszyp

Ah that's unfortunate, very much appreciate the responses though thank you!

raaiqwilliams avatar May 06 '22 08:05 raaiqwilliams