lmdb-js
lmdb-js copied to clipboard
MDB_PROBLEM: Unexpected problem - txn should abort
Hi Kris! Every now and then a user reports this error. Or our CI or our testing itself runs into it:
MDB_PROBLEM: Unexpected problem - txn should abort
User error:
Our CI run (https://app.circleci.com/pipelines/github/gatsbyjs/gatsby/81406/workflows/46bdcdf9-85eb-444a-aab1-b92845954af3/jobs/964466) failed with it intermittently:
MDB_PROBLEM: Unexpected problem - txn should abort
1 | query pageHomecircleciprojectstartersdefaultsrcpagesusingTypescriptTsx2907560070 {
2 | site {
> 3 | buildTime(formatString: "YYYY-MM-DD hh:mm a z")
| ^
4 | }
5 | }
6 |
File path: /home/circleci/project/starters/default/src/pages/using-typescript.tsx
Url path: /using-typescript/
Plugin: none
Error: MDB_PROBLEM: Unexpected problem - txn should abort
- open.js:156
[default]/[lmdb]/open.js:156:21
- write.js:762 LMDBStore.transactionSync
[default]/[lmdb]/write.js:762:17
- open.js:155 new LMDBStore
[default]/[lmdb]/open.js:155:11
- open.js:218 LMDBStore.openDB
[default]/[lmdb]/open.js:218:6
- cache-lmdb.ts:60 GatsbyCacheLmdb.getDb
[default]/[gatsby]/src/utils/cache-lmdb.ts:60:44
- cache-lmdb.ts:69 GatsbyCacheLmdb.get
[default]/[gatsby]/src/utils/cache-lmdb.ts:69:17
- date.ts:234 formatDate
[default]/[gatsby]/src/schema/types/date.ts:234:53
- date.ts:318 Object.resolve [as originalResolver]
[default]/[gatsby]/src/schema/types/date.ts:318:20
- task_queues.js:93 processTicksAndRejections
internal/process/task_queues.js:93:5
- task_queues.js:62 runNextTicks
internal/process/task_queues.js:62:3
- timers.js:434 processImmediate
internal/timers.js:434:9
- async Promise.all
I'm not too well versed with all the moving pieces yet so I'll ask my colleague to maybe shed some more light on this, but wanted to report it nevertheless so that a) I don't forget it and b) we can get behind this :)
Files from the stacktrace:
- https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/utils/cache-lmdb.ts
- https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/utils/page-data.ts#L80
Thanks for your work!
I don't think that I can tell what is occurring from this information alone, but I have added some additional details for the error messages when a MDB_PROBLEM occurs to better determine the cause (it is possible that is related #41, but not sure). I published this additional error information in the v2.3.0-beta3 release if you want to try that out.
Ok, thanks! I'll tell users to use this version until it's in a stable version then (when they can more reliably reproduce this).
Have you ever seen this error again (with the more detailed error message)? Still would be interested in knowing what caused this.
We had to pin lmdb (https://github.com/gatsbyjs/gatsby/pull/35397) so our own tests didn't run this version yet. (cc @pieh)
That issue (#156) has been fixed though, so you can use latest now, right?
We need to update https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/schema/graphql-engine/lmdb-bundling-patch.ts first for the new binary loading -- but that's on us.
Ok, no rush, was just curious.
@LekoArts actually looking at the patch module you are using... would you like me to use more statically analyzable code like fs.readFileSync(require.resolve('../dict/dict.txt'))
for reading the dictionary? Referencing binaries might be a bit better trickier though. But maybe could do something like listing them out:
try {
require.resolve('lmdb-win32-x64/node.napi.node');
require.resolve('lmdb-win32-x64/node.abi93.node');
require.resolve('lmdb-linux-x64/node.napi.node');
....
}catch(e){}
Or do you think that would just cause more errors/confusion with vercel (the require
would be coming from a createRequire
too)?
Hey @kriszyp - about the hacky patch we are using, ideally we don't need patch like that, but I did some attempts at playing with lmdb code before in hope I could submit pull request and frankly didn't exactly like what I was doing there and that's when I went with the "hacky patch" route.
dict.txt
part problem was probably the easier of the 2 (the problem there IIRC was that lmdb package setup is that entry to package could be in different "directory levels" - depending if it's commonjs or ESM):
https://github.com/DoctorEvidence/lmdb-js/blob/4d6371dc6faaa69f2a1a43d980ace0f328c89106/package.json#L22-L32
+
rollup outputting to dist
The binaries loading part also shared that, but I think there were more problems there which I just don't remember (it was a while ago).
Other than above, with [email protected]
, node-gyp-build-optional-packages
was introduced (vs node-gyp-build
) which I suspect won't work with @vercel/webpack-asset-relocator-loader
:
- It looks for specific things (like
require('node-gyp-build')
) - Given https://github.com/kriszyp/node-gyp-build/commit/57be47b88bf4aea751bdbae93f90064eab5d960a they would actually need to run this "modified"
load.path
function from your forked package and not just try regularnode-gyp-build
(at least until https://github.com/prebuild/node-gyp-build/pull/45 land) I think.
For [email protected]
the node-gyp-build-optional-packages
is what's not working anymore with our patch and this is part that we will need to adjust to be able to bump it - as @LekoArts mentioned - that's on us.
In general - non-web-frontend npm packages have varying levels of "support" for being bundled and it's hard to expect them to work out-of-the-box for that, so I just didn't want to waste your time on our niche use case for lmdb
package. It's also a bit fragile in general so we do prefer to pin versions as it is safer for our users (upside is things shouldn't randomly break, downside is we will be slower on updates than using ^
version selector or similar)
Ok, thanks for the explanation! And yeah hopefully the node-gyp-build PR merges soon. And again, no rush, was just checking.
This is the error message I get with the 2.3.0 beta
Error message
There was an error in your GraphQL query:
MDB_PROBLEM: Unexpected problem - txn should abort:
mdb_page_unspill no parent
1 | query DateQuery {
2 | dateTime: contentfulDate(contentful_id: {eq:
"38akBjGb3T1t4AjB87wQjo"}) {
3 | title
4 | date: dateTime
> 5 | formatted: dateTime(formatString: "D.M.YYYY - hh:mm")
| ^
6 | }
7 | dateTimeTimezone: contentfulDate(contentful_id: {eq:
"6dZ8pK4tFWZDZPHgSC0tNS"}) {
8 | title
9 | date: dateTimeTimezone
10 | formatted: dateTimeTimezone(formatString: "D.M.YYYY -
hh:mm (z)")
11 | }
12 | date: contentfulDate(contentful_id: {eq:
"5FuULz0jl0rKoKUKp2rshf"}) {
13 | title
14 | date
15 | formatted: date(formatString: "D.M.YYYY")
File path: /Users/ward/projects/gatsby/gatsby/e2e-tests/contentf
ul/src/pages/date.js
Url path: /date/
Plugin: none
Error: MDB_PROBLEM: Unexpected problem - txn should abort: mdb
_page_unspill no parent
- open.js:177
[contentful]/[lmdb]/open.js:177:17
- write.js:771 LMDBStore.transactionSync
[contentful]/[lmdb]/write.js:771:17
- open.js:176 new LMDBStore
[contentful]/[lmdb]/open.js:176:11
- open.js:240 LMDBStore.openDB
[contentful]/[lmdb]/open.js:240:6
- cache-lmdb.ts:60 GatsbyCacheLmdb.getDb
[contentful]/[gatsby]/src/utils/cache-lmdb.ts:60:44
- cache-lmdb.ts:69 GatsbyCacheLmdb.get
[contentful]/[gatsby]/src/utils/cache-lmdb.ts:69:17
- date.ts:234 formatDate
[contentful]/[gatsby]/src/schema/types/date.ts:234:53
- date.ts:318 resolve
[contentful]/[gatsby]/src/schema/types/date.ts:318:20
- task_queues:96 processTicksAndRejections
node:internal/process/task_queues:96:5
- task_queues:65 runNextTicks
node:internal/process/task_queues:65:3
- timers:437 processImmediate
node:internal/timers:437:9
- async Promise.all
This the source code we're hitting
https://github.com/gatsbyjs/gatsby/blob/9b25267009f318949705e2e1faf7af859b0b668a/packages/gatsby/src/schema/types/date.ts#L224-L259
const formatDate = async ({
date,
fromNow,
difference,
formatString,
locale = `en`,
}: IFormatDateArgs): Promise<string | number> => {
const normalizedDate = JSON.parse(JSON.stringify(date))
if (formatString) {
const cacheKey = `${normalizedDate}-${formatString}-${locale}`
const cachedFormat = await getFormatDateCache().get(cacheKey)
if (cachedFormat) {
return cachedFormat as string
}
const result = moment
.utc(normalizedDate, ISO_8601_FORMAT, true)
.locale(locale)
.format(formatString)
await getFormatDateCache().set(cacheKey, result)
return result
} else if (fromNow) {
return moment
.utc(normalizedDate, ISO_8601_FORMAT, true)
.locale(locale)
.fromNow()
} else if (difference) {
return moment().diff(
moment.utc(normalizedDate, ISO_8601_FORMAT, true).locale(locale),
difference
)
}
return normalizedDate
}
Let me know if this helps or not at all. We're basically checking if the element is in the cache if not we set it
That helps, that definitely is not where I had been expecting an MDB_PROBLEM
to occur, this seems to point to the unspilling operation. A little explanation of this: in LMDB, if a transaction gets too large and has more than about 500MB of "dirty" data (that hasn't been committed to disk), it starts "spilling" or writing it to disk. But if that data is also modified again in the same transaction it is "unspilled" back to memory, and this appears to be where the error is occurring.
It is not very clear how this code would trigger this, like you said, it looks just a basic get/put.
Is this part of a very large number of writes in the same event turn/transaction (writing more than 500MB in the operation)?
Are you using child transactions (childTransaction
) at all?
Of course if there is a way I can run a test locally to reproduce, that would be helpful.
I am investigating if this assertion/error really is necessary and the state is detrimental. It may be possible to simply turn off of this check or make it a warning (which basically allows unspilling even if a page can't be found in the spilled list), but will do some checking of the effects. Of course the potential problem is if this allows for a bad state where invalid data is returned or corruption can take place.
Also FYI, I have not removed these more detailed messages, you can use the latest lmdb version (2.3.5 right now), and the extra error messages should still be there.
I'll see if I can get you a repro. We're not using childTransactions and I don't event think we're using any transactions ourselves.
Well, everything in LMDB is in a transaction, it is just implicit/automatic if there is no explicit transaction, and usually one transaction per event turn, unless there is a lot of heavy writing where more gets batched into the current transaction.
Ok, v2.3.6 should have the assertion switched to just being a warning (with a little more info), so the application can proceed and we can see if it really is a bad state with detrimental outcomes, or can safely be ignored.
gatsby@next
now uses lmdb 2.3.10 so going forward we should get better error logs and will post them here 👍
We're using Gatsby cloud for 5 staging sites and seeing this issue pop up multiple times a day. Per Gatsby support team's suggestion, I updated our gatsby version to 4.13.1 yesterday and 3 of the 5 sites failed with this error on that PR build. Please let me know how we can best support you all in finding a solution!
https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/f3997d6e-da36-4555-8cba-e424b040cbe3/builds/5a3864d0-73d6-4348-9a1e-0e696ef0ed26/details#rawLogs https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/da690a16-3284-48d1-8e0f-4545d6169068/builds/eb1574d3-8957-41cf-bd6b-c567a349ecc7/details#rawLogs https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/0acef4d7-fb7c-4e1d-bee5-649347611ebb/builds/90c8d4a2-2854-4e3d-8def-0da4f3e7ac57/details#rawLogs
@lauraturk FYI, we believe this may be related or the same issue #164, which also has plenty of discussion of our efforts to track this down. Certainly if you are ever able to come up with a reproducible test case that I could run locally, we would be most grateful (I've never been able to reproduce the issue in this ticket, so most of the discussion has been theorizing about ways that it might possibly occur).
Hi @kriszyp , unfortunately, I am not able to create a reproducible test case for you to run locally. It's very random across our 5 gatsby sites. Per Gatsby support I upgraded to gatsby@next today, and only one site failed. Hopefully this log is helpful? https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/da690a16-3284-48d1-8e0f-4545d6169068/builds/cb4bfd92-c8a9-4d73-9ff6-b96b3d94ffe1/details#rawLogs
He can't access your build logs. Could you copy in the relevant lines?
On Fri, May 20, 2022, 3:09 PM Laura Turk @.***> wrote:
Hi @kriszyp https://github.com/kriszyp , unfortunately, I am not able to create a reproducible test case for you to run locally. It's very random across our 5 gatsby sites. Per Gatsby support I upgraded to @.*** today, and only one site failed. Hopefully this log is helpful? https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/da690a16-3284-48d1-8e0f-4545d6169068/builds/cb4bfd92-c8a9-4d73-9ff6-b96b3d94ffe1/details#rawLogs
— Reply to this email directly, view it on GitHub https://github.com/DoctorEvidence/lmdb-js/issues/153#issuecomment-1133424246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARLBZIG2JN42MNASQW5B3VLAER5ANCNFSM5SVAVOBQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@lauraturk I would still love to see what logs you got for that error, if you can paste them in?
Thanks for your patience, here it is:
16:00:28 PM:
success Building HTML renderer - 28.962s
16:00:28 PM:
success Execute page configs - 0.126s
16:00:32 PM:
success Caching Webpack compilations - 0.001s
16:00:32 PM:
success Validating Rendering Engines - 4.071s
16:00:36 PM:
../dependencies/lmdb/libraries/liblmdb/mdb.c:7602: Assertion 'root > 1' failed in mdb_page_search()
16:00:36 PM:
Go here for troubleshooting tips: https://gatsby.dev/pqr-feedback
16:00:36 PM:
ERROR An error occurred during parallel query running.
16:00:36 PM:
16:00:36 PM:
[www]/[gatsby-worker]/dist/index.js:117:45
16:00:36 PM:
- index.js:117 ChildProcess.
16:00:36 PM:
16:00:36 PM:
Error: Worker exited before finishing task
16:00:36 PM:
- node:events:390 ChildProcess.emit
16:00:36 PM:
- child_process:290 Process.ChildProcess._handle.onexit
16:00:36 PM:
node:events:390:28
16:00:36 PM:
16:00:36 PM:
16:00:36 PM:
node:internal/child_process:290:12
16:00:36 PM:
16:00:36 PM:
16:00:36 PM:
not finished run queries in workers - 3.943s
Thanks for the logs, this is helpful. This is a db corruption error, which is basically impossible to occur in typical usage. There are only few things that I know that can lead to this:
- Physical damage or manipulation of hard drive.
- Copying a db file while it is use and using the copy.
- Use noSync flag and a computer crash (I don't think gatsby doesn't do this)
It is clear from these errors that there are multiple child processes (rather than just threads). This is fine to do with LMDB, but is a helpful insight. From this comment https://github.com/DoctorEvidence/lmdb-js/issues/164#issuecomment-1135060811 it sounds like that error might be specific cloud configuration where this occurs (GCP Kubernetes). I wonder if perhaps this configuration uses a remote/network drive/storage that isn't playing nicely memory maps (there are known problems with some remote file systems with LMDB )? I don't know enough about the gatsby infrastructure to know if this is the same platform where the error in this last message occurred?
I discovered and fixed an issue with file locking (causing locks to break), that should be fixed v2.4.5, and might possibly address this issue and #164.
The CircleCI tests in https://github.com/gatsbyjs/gatsby/pull/35724 seem to consistently fail on MDB_BAD_RSLOT: Invalid reuse of reader locktable slot: The reader lock pid 0, txn -1, doesn't match env pid 523
- at least when I re-trigger runs it fails very often: https://app.circleci.com/pipelines/github/gatsbyjs/gatsby/83182/workflows/29ded013-2e0a-4cb8-9837-b0d87062049c/jobs/990452
I can SSH into that pod and get you any info you need :)
Sorry, this is kind of trial and error. I believe this latest error is probably because the fix to better ensure database file identification was incorrectly handling the case of initial creation of the database (was doing the stat check before it was created). Published a fix for this in v2.5.1, if you can try it out when you get a chance.
I was OOO for a week, but I'll update the PR to 2.5.2 now and let you know how it goes :)
BTW, I would also be interested in adding an integration test of gatsby using lmdb-js. If you have any pointers/suggestion of a good test that would efficiently test gatsby's usage of lmdb, would love to add that.