eleventy
eleventy copied to clipboard
large data sets: 1.x has issues that 0.11.1, 0.12.1 do not
Describe the bug I've a semi-open site generator project that squishes gigabytes of data sources down to about 6.9k pages to be processed by eleventy in two ways:
- the legacy markdown repo
- the more up-to-date json data source (pagination ftw!)
0.11.1 handles the 6.9k documents & 15.6mb json file without issue, 1.0.0 falls over in a similar fashion to that described in #695
To Reproduce The site generator is semi-open in that the source is available at https://github.com/Satisfactory-Clips-Archive/Media-Search-Archive, but it's not feasible to stash 2.7gb+ of source data into the repo, so the repro steps aren't readily reproduceable by anyone that doesn't have the data set.
While the method mentioned in #695 of specifying --max-old-space-size does move the goalposts somewhat, it still falls over with 8gb assigned.
Steps to reproduce the behaviour:
npm run buildor./node_modules/.bin/elevent --config=./.eleventy.pages.js- watch & wait
Expected behaviour 1.x to handle 6.9k markdown documents or 6.9k json data file entries as reliably as 0.11.x does
Screenshots
<--- Last few GCs --->
[16544:000001D55D359230] 168029 ms: Mark-sweep 4038.5 (4130.3) -> 4024.3 (4130.3) MB, 3347.8 / 0.0 ms (average mu = 0.138, current mu = 0.017) task scavenge might not succeed
[16544:000001D55D359230] 171431 ms: Mark-sweep 4039.7 (4131.5) -> 4025.8 (4132.0) MB, 3345.7 / 0.0 ms (average mu = 0.081, current mu = 0.016) task scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
1: 00007FF70641DF0F v8::internal::CodeObjectRegistry::~CodeObjectRegistry+113567
2: 00007FF7063AD736 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+67398
3: 00007FF7063AE5ED node::OnFatalError+301
4: 00007FF706DA0CAE v8::Isolate::ReportExternalAllocationLimitReached+94
5: 00007FF706D8B2FD v8::Isolate::Exit+653
6: 00007FF706C2EC5C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1468
7: 00007FF706C3AC57 v8::internal::Heap::PublishPendingAllocations+1159
8: 00007FF706C37C3A v8::internal::Heap::PageFlagsAreConsistent+2874
9: 00007FF706C2B919 v8::internal::Heap::CollectGarbage+2153
10: 00007FF706BDC315 v8::internal::IndexGenerator::~IndexGenerator+22133
11: 00007FF70633F0AF X509_STORE_CTX_get_lookup_certs+4847
12: 00007FF70633DA46 v8::CFunctionInfo::HasOptions+16150
13: 00007FF70647C27B uv_async_send+331
14: 00007FF70647BA0C uv_loop_init+1292
15: 00007FF70647BBAA uv_run+202
16: 00007FF70644ABD5 node::SpinEventLoop+309
17: 00007FF706365BC3 v8::internal::UnoptimizedCompilationInfo::feedback_vector_spec+52419
18: 00007FF7063E3598 node::Start+232
19: 00007FF70620F88C CRYPTO_memcmp+342300
20: 00007FF707322AC8 v8::internal::compiler::RepresentationChanger::Uint32OverflowOperatorFor+14488
21: 00007FFB71217034 BaseThreadInitThunk+20
22: 00007FFB71402651 RtlUserThreadStart+33
Environment:
- OS and Version: Win 10, running tool via git bash
- Eleventy Version: 1.0.0
Additional context
- 6.9k markdown files (still falls over if the path is excluded due to the json file)
- 6.9k objects in a pretty-printed json file, totalling 15.6mb spread over 280k lines
Wow, that definitely wins for one of the larger sites/datasets I've seen in Eleventy!
You mentioned v0.11.1 and v1.0.0, but have you tried in v0.12.1 (which seems to be ~5 months newer than 0.11.x)? I'm curious if we can determine roughly where this may have changed/broke without having access to the ~2.7 GB of required data files.
npm info @11ty/eleventy time --json | grep -Ev "(canary|beta)" | tail -5
"0.11.1": "2020-10-22T18:40:22.846Z",
"0.12.0": "2021-03-19T19:24:27.860Z",
"0.12.1": "2021-03-19T19:55:13.306Z",
"1.0.0": "2022-01-08T20:27:32.789Z",
@pdehaan trying that now 👍
p.s. the data isn't exactly confidential, it's just more of a "I don't wanna have to spam up the git repo" thing :P
p.s. the data isn't exactly confidential, it's just more of a "I don't wanna have to spam up the git repo" thing :P
Oh, no worries. I totally don't want to download 2.7 GB of data unless… nope, I just really don't want to download roughly 1989 floppy disk's worth of data.
Although now I kind of want to add a "kb_to_floppy_disk" custom filter in Eleventy and represent all file sizes in relation to how many 3.5" floppy disks would be needed.
It's the subtitles and video pages for about 5.9k youtube videos. (not sure how I've got 1k more transcriptions than I have clips 🤷♂️)
You mentioned
v0.11.1andv1.0.0, but have you tried inv0.12.1
that completes as expected, although I haven't diffed the output to see if there are any changes/bugs etc.
that completes as expected, although I haven't diffed the output to see if there are any changes/bugs etc.
So, i think you're saying:
✔️ 0.11.1
✔️ 0.12.1
❌ 1.0.0
❌ 1.0.1-canary.3
Doubting this has already been fixed in 1.0.1-canary builds, but if you were looking to try the sharpest of cutting edge builds, you could try npm i @11ty/eleventy@canary. 🔪
npm info @11ty/eleventy dist-tags --json
{
"latest": "1.0.0",
"beta": "1.0.0-beta.10",
"canary": "1.0.1-canary.3"
}
<--- Last few GCs --->
[16756:0000024C7C0F9B80] 144598 ms: Mark-sweep (reduce) 4067.7 (4143.4) -> 4067.1 (4143.9) MB, 7589.3 / 0.0 ms (average mu = 0.141, current mu = 0.001) allocation failure scavenge might not succeed
[16756:0000024C7C0F9B80] 151443 ms: Mark-sweep (reduce) 4068.3 (4144.1) -> 4067.8 (4144.6) MB, 6831.9 / 0.1 ms (average mu = 0.080, current mu = 0.002) allocation failure scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
1: 00007FF70641DF0F v8::internal::CodeObjectRegistry::~CodeObjectRegistry+113567
2: 00007FF7063AD736 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+67398
3: 00007FF7063AE5ED node::OnFatalError+301
4: 00007FF706DA0CAE v8::Isolate::ReportExternalAllocationLimitReached+94
5: 00007FF706D8B2FD v8::Isolate::Exit+653
6: 00007FF706C2EC5C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1468
7: 00007FF706C2C151 v8::internal::Heap::CollectGarbage+4257
8: 00007FF706C29AC0 v8::internal::Heap::AllocateExternalBackingStore+1904
9: 00007FF706C464E0 v8::internal::FreeListManyCached::Reset+1408
10: 00007FF706C46B95 v8::internal::Factory::AllocateRaw+37
11: 00007FF706C5AB7A v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithFiller+90
12: 00007FF706C5AE63 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithMap+35
13: 00007FF706A689A6 v8::internal::HashTable<v8::internal::NameDictionary,v8::internal::NameDictionaryShape>::EnsureCapacity<v8::internal::Isolate>+246
14: 00007FF706A6E88E v8::internal::BaseNameDictionary<v8::internal::NameDictionary,v8::internal::NameDictionaryShape>::Add+110
15: 00007FF70697AE68 v8::internal::Runtime::GetObjectProperty+1624
16: 00007FF706E33281 v8::internal::SetupIsolateDelegate::SetupHeap+513585
17: 0000024C0028643A
$ ./node_modules/.bin/eleventy --version
1.0.1-canary.3
I totally don't want to download 2.7 GB of data unless…
@pdehaan the problematic json source is only 2.7mb gzipped (in case one wanted to produce a bare-minimum reproduceable case), although I suspect one could bulk generate random test data with for an array of objects this structure & it'd do the trick:
{
"id": "yt-0pKBBrBp9tM",
"url": "https:\/\/youtu.be\/0pKBBrBp9tM",
"date": "2022-02-15",
"dateTitle": "February 15th, 2022 Livestream",
"title": "State of Dave",
"description": "00:00 Intro\n00:11 Presentation on Update 6\n01:23 Just simmering\n02:04 Recapping last week\n02:24 Hot Potato Save File\n04:53 Outro\n05:26 One more thing!",
"topics": [
"PLbjDnnBIxiEo8RlgfifC8OhLmJl8SgpJE"
],
"other_parts": false,
"is_replaced": false,
"is_duplicate": false,
"has_duplicates": false,
"seealsos": false,
"transcript": [
/*
this is an array of strings that could technically be structured objects but are generally only strings of
single words up to full groups of paragraphs up, with this example having about 5-7kb of strings in total
*/
],
"like_count": 7,
"video_object": {
"@context": "https:\/\/schema.org",
"@type": "VideoObject",
"name": "State of Dave",
"description": "00:00 Intro\n00:11 Presentation on Update 6\n01:23 Just simmering\n02:04 Recapping last week\n02:24 Hot Potato Save File\n04:53 Outro\n05:26 One more thing!",
"thumbnailUrl": "https:\/\/img.youtube.com\/vi\/BBrBp9tM\/hqdefault.jpg",
"contentUrl": "https:\/\/youtu.be\/0pKBBrBp9tM",
"url": [
"https:\/\/youtu.be\/0pKBBrBp9tM",
"https:\/\/archive.satisfactory.video\/transcriptions\/yt-0pKBBrBp9tM\/"
],
"uploadDate": "2022-02-15"
}
}
p.s. this is the template that's in use in case it's a combination of size-of-data as well as the template: https://github.com/Satisfactory-Clips-Archive/Media-Search-Archive/blob/d5040ac3a42f8eca9517931812892d493b81d326/11ty/pages/transcriptions.njk, rather than just size-of-data
@pdehaan working on an isolated test case, have managed to trigger the bug in 0.12, going to check at what point 0.12 succeeds where 1.0 fails.
@pdehaan isolated test case currently fails on 0.11, 0.12, and 1.0 at about 21980 entries: https://github.com/SignpostMarv/11ty-eleventy-issue-2226
usage: git checkout ${branch} && npm install && node ./generate ${number} && ./node_modules/.bin/eleventy
the data & templates aren't as complex as those in the media-search-archive repo, will give a second pass at making this more complex if it's not useful enough to let you experiment with avoiding the heap out of memory issue?
test.json.gz p.s. because the generator is currently non-seeded, please find attached the gzipped test.json file that all three versions currently fail on
@pdehaan including the markdown repo as a source across all three versions definitely suggests it's either templating or data-related, rather than input-related, as all three versions can handle 7k of just straight-up markdown files. will amend further in the near future and keep you apprised.
@pdehaan bit of a delay with further investigation; Have started converting the runtime-generated data to pre-baked data, it looks like having the 131k line json data file in-memory causes the problems.
@pdehaan have updated the test-case repo that fails on 1.0 with 9k entries (node ./generate.js 9000) but runs on 0.11 and 0.12 without issue.
I'm hitting this problem as well. I have a site that (only about 1,600 pages) that builds fine with Eleventy 0.12.0, but when I upgraded to 1.0.0 I get out of memory errors.
I've got a global data file (JS) that pulls data from a database (about 660 rows of data) and uses pagination to create one page for each entry from the database. If I shut the database off so that those pages don't get built, the build runs fine with 1.0.0.
I can work around the issue by increasing Node's max memory thus:
NODE_OPTIONS=--max_old_space_size=8192 npm run build
Not sure what happened with 1.0.0 that increased the memory usage this much (with pagination, or global data?) but it'd be great to get it back down.
/summon @zachleat Possible performance regression between 0.12.x and 1.0.
Thanks @SignpostMarv, I'll try fetching the ZIP file from https://github.com/11ty/eleventy/issues/2226#issuecomment-1059791740 and see if it will build on my laptop locally (disclaimer: it's a higher end M1 MacBook Pro, so results may differ).
@esheehan-gsl How complex is your content from your database? (is it Liquid or Markdown? etc) I've toyed with creating a "11ty lorem ipsum" blog generator in the past which just creates X pages based on some global variable so I can poke at performance issues like this w/ bigger sites. But sometimes it comes down to more of how many other filters and plugins and the general complexity of the site instead of just 600 pages vs 6000 pages (which can be frustrating).
How complex is your content from your database? (is it Liquid or Markdown? etc)
There are quite a few fields coming from the database. There's probably over 30 fields coming from the database. Some of it is HTML, some of it is just metadata (paths to video files, categories) that get rendered on the page.
If it helps, it's used to build these pages: https://sos.noaa.gov/catalog/datasets/721/
Thanks @SignpostMarv, I'll try fetching the ZIP file from #2226 (comment) and see if it will build on my laptop locally (disclaimer: it's a higher end M1 MacBook Pro, so results may differ).
@pdehaan to clarify, the zip file isn't needed as the problem is replicable at a lower volume of generated pages (9k + supplemantary data) rather than the zip file's higher volume (21.9k w/ no supplementary data)
I created https://github.com/pdehaan/11ty-lorem which can generate 20k pages (in ~21s). If I bump it to around ~22k pages, it seems to hit memory issues (on Eleventy v1.0.0).
@pdehaan could you now grab the supplementary data file from my test repo (or generate something similar) and see how much lower you have to drop the page count?
Howdy y’all, there are a few issues to organize here so just to keep things concise I am opening #2360 to coordinate this one. Please follow along there!
@zachleat tracking updates specific to the test repo here, rather than on new/open tickets:
80000
- 0.12.1 (main branch) on https://github.com/SignpostMarv/11ty-eleventy-issue-2226/commit/ca8b5b4, fail w/
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory - 0.11.1 (eleventy-0.11 branch) on https://github.com/SignpostMarv/11ty-eleventy-issue-2226/commit/880b320 fail w/
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory - 1.0.0 (eleventy-1.0 branch) on https://github.com/SignpostMarv/11ty-eleventy-issue-2226/commit/ccb7d05 fail w/
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory - 2.0.0.canary-9 (eleventy-2.0 branch) on https://github.com/SignpostMarv/11ty-eleventy-issue-2226/commit/2e8f2da fail w/
EMFILE: too many open files
40000
as above, except for:
- 2.0.0.canary-9, succeeds
What are the success conditions here? Is 80K the goal?
@zachleat was basing the test cases from your google spreadsheet, one assumes if it succeeds at 40k it'll succeed at the other sizes you found.
p.s. I'm not sure if the 80k "too many open files" thing should be counted as a new issue or a won't-fix?
success @ 50k + 55k + 59k + 59.9k + 59.92k + 59.925k + 59.928k + 59.929k, too many open files @ 60k + 59.99k + 59.95k + 59.93k
A couple things that I'm noticing:
- whether it succeeds or fails, eleventy seems to hang on the last file for a while before spitting out the stats / completion messages
- when it errs out with too many open files, eleventy reports 0 files written, is 11ty holding these in tmp / memory somewhere, rather than the output in batches?
Is there any progress here? I also have a bigger JSON source (5.6MB with 270k rows) that made circa 17k pages. On my local setup, I can build it with --max_old_space_size in ~5 minutes, but on Netlify, it breaks with the heap limit.
On another topic: do you have any tips on importing this amount without breaking? Is an external database a better idea?
Thank you!
On another topic: do you have any tips on importing this amount without breaking? Is an external database a better idea?
Thank you!
The most terrible option would be to duplicate templates & split the data up.
Yeah, that is something that came to my mind, too, but it will kill the pagination and the collection as a whole. It would be cool if we could break these files into smaller pieces and source them under the same collection or something similar.
For some reason, I could build it on Netlify without the error (maybe it needed time for the NODE_OPTIONS or it had a better day, I am not sure, unfortunately), but still complicated to plan knowing this problem. And my demo is quite plain, almost only data, with the biggest extra is an image inliner (SVG) shortcode for the icons.
Thank you for the feedback. I'll update if there's anything worthwhile.
Yeah, that is something that came to my mind, too, but it will kill the pagination and the collection as a whole. It would be cool if we could break these files into smaller pieces and source them under the same collection or something similar.
If you're referring to pagination links, one assumes that if you're taking steps to have data automatically split, you can have pagination values automatically generated "correctly"?
Breaking the source file beforehand could work for me if I could handle it as one collection at import. Still, much more editorial work to manage but at least no hacking at the template level. For the pagination (to connect two sources): I think you can offset the second source's pagination but still, you have two not related data groups with more administration and hacky solutions.
I've yet to revisit upgrades on mine since migrating away from the mixed markdown + json sources to eliminate the markdown source 🤔