build
build copied to clipboard
Memory issue on fedora in latest V8 (8.8) requirement
According to
/usr/bin/time -von my machine, compilation ofarray-sort-tq-csa.otakes about 810 MB of memory.
Do you know if that's increased from before? It could very well be that this version of V8 has tipped the memory requirements for compilation such that the 2GiB is no longer enough and we need to add either more dedicated or swap to bring the Fedora hosts on par with the others (4GiB seems to be what other similar hosts are on, https://github.com/nodejs/node/pull/36139#issuecomment-762223064). Maybe open an issue over in nodejs/build?
Originally posted by @richardlau in https://github.com/nodejs/node/issues/36139#issuecomment-766850633
From https://github.com/nodejs/node/pull/36139#issuecomment-762200271:
Still failing: https://ci.nodejs.org/job/node-test-commit-linux/39436/nodes=fedora-latest-x64/console
It looks like the host is running out of memory.
09:56:58 make[2]: *** [tools/v8_gypfiles/v8_initializers.target.mk:385: /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-latest-x64/out/Release/obj.target/v8_initializers/gen/torque-generated/test/torque/test-torque-tq-csa.o] Terminated 09:58:43 FATAL: command execution failed 09:58:43 java.nio.channels.ClosedChannelExceptionand from the system log on
test-rackspace-fedora32-x64-1:Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: A process of this unit has been killed by the OOM killer. Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Main process exited, code=exited, status=143/n/a Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Failed with result 'oom-kill'. Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 16h 38min 49.487s CPU time. Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Scheduled restart job, restart counter is at 5. Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: Stopped Jenkins Slave. Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 16h 38min 49.487s CPU time. Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: Started Jenkins Slave.The same for the earlier https://ci.nodejs.org/job/node-test-commit-linux/nodes=fedora-latest-x64/39376/console
08:13:30 make[2]: *** [tools/v8_gypfiles/v8_initializers.target.mk:385: /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-latest-x64/out/Release/obj.target/v8_initializers/gen/torque-generated/third_party/v8/builtins/array-sort-tq-csa.o] Terminated 08:14:54 make[2]: *** [tools/v8_gypfiles/v8_initializers.target.mk:385: /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-latest-x64/out/Release/obj.target/v8_initializers/gen/torque-generated/test/torque/test-torque-tq-csa.o] Terminated 08:14:54 FATAL: command execution failed 08:14:54 java.nio.channels.ClosedChannelExceptionJan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: A process of this unit has been killed by the OOM killer. Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Main process exited, code=exited, status=143/n/a Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Failed with result 'oom-kill'. Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 1h 12min 3.598s CPU time. Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Scheduled restart job, restart counter is at 3. Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: Stopped Jenkins Slave. Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 1h 12min 3.598s CPU time. Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: Started Jenkins Slave.
From https://github.com/nodejs/node/pull/36139#issuecomment-762223064:
How much memory does it have compared to other similar hosts?
Appears to be 2GiB
[root@test-rackspace-fedora32-x64-1 ~]# free -h total used free shared buff/cache available Mem: 1.9Gi 290Mi 402Mi 10Mi 1.2Gi 1.5Gi Swap: 0B 0B 0B [root@test-rackspace-fedora32-x64-1 ~]#For comparison, the other
fedora-latest-x64host:$ ssh test-digitalocean-fedora32-x64-1 "free -h" total used free shared buff/cache available Mem: 1.9Gi 317Mi 202Mi 0.0Ki 1.4Gi 1.4Gi Swap: 0B 0B 0B $The two
fedora-last-latest-x64hosts:$ ssh test-digitalocean-fedora30-x64-1 "free -h" total used free shared buff/cache available Mem: 1.9Gi 292Mi 1.3Gi 0.0Ki 368Mi 1.5Gi Swap: 0B 0B 0B $ ssh test-digitalocean-fedora30-x64-2 "free -h" total used free shared buff/cache available Mem: 3.8Gi 284Mi 2.0Gi 0.0Ki 1.6Gi 3.3Gi Swap: 0B 0B 0B $
centos7-64-gcc8:$ ssh test-rackspace-centos7-x64-1 "free -h" total used free shared buff/cache available Mem: 1.8G 256M 975M 3.5M 600M 1.4G Swap: 2.0G 295M 1.7G $ ssh test-softlayer-centos7-x64-1 "free -h" total used free shared buff/cache available Mem: 1.8G 142M 1.3G 6.0M 376M 1.5G Swap: 2.0G 260M 1.7G $
I've added the build agenda label to this in case nobody gets around to looking at this. I know that the WG members from Red Hat are busy this week.
@rvagg had some suggestions in https://github.com/nodejs/node/pull/36139#issuecomment-766732513 to see if clearing things up on the existing hosts helps. Otherwise we might look at either adding 2GiB of swap to the Fedora hosts (if we have the disk space) or bumping the allocated memory.
I'm only just seeing this so don't have too much intelligent to add (such as why it's failing) other than:
- Failures on standard configurations are intended to be a signal that something is not right, switching to clang might "fix" this problem, but then you're just shipping software that's likely to fail on the particular configuration that's failing in CI. I see suggestion of memory, is there a known OOM here? I'm not seeing that in the log for that last CI run.
- Fedora 33 is out so
fedora-latestneeds to be upgraded to that when someone (probably me) has time to do that. But Fedora 32, which is failing here, will still be in the mix asfedora-last-latest. It'd be quite interesting to see whether this is still failing on 33. - Someone with @nodejs/build test permissions could log in to the two machines (test-rackspace-fedora32-x64-1 and test-digitalocean-fedora32-x64-1), run dnf upgrade, update slave.jar, clear out ~iojs/build/workspace and reboot. I reckon it's been ages since anyone was in these machines and there might be something local that would be fixed up with a clean (maybe there's a memory hogging program in the background?).
Originally posted by @rvagg in https://github.com/nodejs/node/issues/36139#issuecomment-766732513
@richardlau is the failure only on Fedora because the machines were configured with less memory or is it something specific to Fedora?
@mhdawson I haven't found anything yet to suggest a Fedora specific issue vs a simple memory issue.
@richardlau thanks, in terms of:
Someone with @nodejs/build test permissions could log in to the two machines (test-rackspace-fedora32-x64-1 and test-digitalocean-fedora32-x64-1), run dnf upgrade, update slave.jar, clear out ~iojs/build/workspace and reboot. I reckon it's been ages since anyone was in these machines and there might be something local that would be fixed up with a clean (maybe there's a memory hogging program in the background?).
Is that something you will have time to do on one of the machines?
@mhdawson I'm not sure. I don't have much work time available in the remainder of this week outside of the scheduled Red Hat meetings. I could make time next week.
I updated those two machines, cleared workspaces and rebooted. Here's a green run for you for that problematic PR: https://ci.nodejs.org/job/node-test-commit-linux/39601/
We've historically targeted ~2Gb ~2 core machines in CI, they should be our most common configuration. If it were a universal memory problem then I'd expect to see it in more places than just one type of machine. My guess is that it's a bug in the toolchain that's been resolved. There were a number of toolchain updates in the big batch of updates installed, including gcc and glibc. The biggest memory hog on the machine is java running Jenkins, sitting at ~200Mb, and they're back near that level after being restarted so it doesn't look like they were bloating and there wasn't anything else taking up very much.
:shrug: we'll keep an eye on these machines but for now it seems to be addressed.
Nice work ❤️ @rvagg
I'm reopening because everytime there's a V8 update that requires to recompile everything I have to run CI many times hoping it passes.
It also happens with centos7-arm64-gcc8
Refs: https://ci.nodejs.org/job/node-test-commit-arm/38530/nodes=centos7-arm64-gcc8/
Well .. centos7-arm64-gcc8 is interesting because it's got plenty of memory. I think we're dealing with too much parallelism on that machine. For all of the arm64 machines we have server_jobs: 50 which is .. a bit much, I think we need to pull that right back. Maybe something more reasonable like 12.
We are also migrating our arm64 machines to new hardware and it'll be a good opportunity to fix all of this. I was hoping to do some nice containerised arm64 infra like we have for our *linux-containered* builds but for different arm64 distros, but I think @sxa might have other ideas and has put up his hand to jump in on that. Something we'll need to pay attention to.
As for fedora, I'm still at a bit of a loss. But we do need to upgrade, we're stuck on 30 and 32 but should be on 34 (probably keep 32 as our "last"). I don't know why they stand out, they're running on the same spec hardware as many of our other VMs. They have JOBS set to 2 so shouldn't be overdoing it.
arm64 machines updated to JOBS=12, systems updated and rebooted, I think they should be good to go now
Maybe someone from RedHat might volunteer to update our Fedora systems? I usually take the older ones, reimage them with the latest fedora, then change labels so that fedora-latest is the new ones and fedora-last-latest is the ones that were previously newest (32 in this case). If not, I might get to it sometime, there's also Alpine to do I think.
I was hoping to do some nice containerised arm64 infra like we have for our linux-containered builds but for different arm64 distros, but I think @sxa might have other ideas and has put up his hand to jump in on that.
@rvagg Yep need to get on with that, but other critical stuff has come up - next week hopefullly (I'm on vacation until Tuesday now). Superficially Sounds like we're pretty much on the same page in terms of what it makes sense to do though :-)
Maybe someone from RedHat might volunteer to update our Fedora systems? I usually take the older ones, reimage them with the latest fedora, then change labels so that fedora-latest is the new ones and fedora-last-latest is the ones that were previously newest (32 in this case). If not, I might get to it sometime, there's also Alpine to do I think.
I've started this now (starting with test-digitalocean-fedora30-x64-1). Reimaging was fairly painless but I ran into https://www.digitalocean.com/community/questions/fedora-33-how-to-persist-dns-settings-via-etc-resolv-conf (with Fedora 34) meaning our playbooks failed until I went onto the machine and fixed the DNS settings (as per https://www.digitalocean.com/community/questions/fedora-33-how-to-persist-dns-settings-via-etc-resolv-conf?answer=66950).
Did we ever try increasing the swap space on these machines? (since it looks from the output earlier in thisk issue that they had none.)
I don't think so
We have not. Is it easily done via Ansible? I'm up for trying.
I've added swap to the two Fedora 32 hosts
dd if=/dev/zero of=/swapfile bs=1024 count=2097152
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
Might also try adding swap to the Debian 10 hosts, e.g. https://ci.nodejs.org/job/node-test-commit-linux/nodes=debian10-x64/43295/console
01:15:34 cc1plus: out of memory allocating 2097152 bytes after a total of 22560768 bytes
01:15:34 make[2]: *** [tools/v8_gypfiles/v8_compiler.target.mk:266: /home/iojs/build/workspace/node-test-commit-linux/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/pipeline.o] Error 1
Have added swap to test-rackspace-debian10-x64-1.
We ended up having an informal meeting and not streaming. In retrospect we probably should have streamed but at the start we were not sure we were going to discuss too much.
I see I got the wrong issue for the last comment.
Seems like adding swap has resolved the issue, but leaving this open until we have added the swap addition to our ansible scripts.
EDIT we agreed to add to manual instructions for now and then close this issue.
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.