build icon indicating copy to clipboard operation
build copied to clipboard

Memory issue on fedora in latest V8 (8.8) requirement

Open gengjiawen opened this issue 4 years ago • 24 comments

According to /usr/bin/time -v on my machine, compilation of array-sort-tq-csa.o takes about 810 MB of memory.

Do you know if that's increased from before? It could very well be that this version of V8 has tipped the memory requirements for compilation such that the 2GiB is no longer enough and we need to add either more dedicated or swap to bring the Fedora hosts on par with the others (4GiB seems to be what other similar hosts are on, https://github.com/nodejs/node/pull/36139#issuecomment-762223064). Maybe open an issue over in nodejs/build?

Originally posted by @richardlau in https://github.com/nodejs/node/issues/36139#issuecomment-766850633

gengjiawen avatar Jan 25 '21 14:01 gengjiawen

From https://github.com/nodejs/node/pull/36139#issuecomment-762200271:

Still failing: https://ci.nodejs.org/job/node-test-commit-linux/39436/nodes=fedora-latest-x64/console

It looks like the host is running out of memory.

09:56:58 make[2]: *** [tools/v8_gypfiles/v8_initializers.target.mk:385: /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-latest-x64/out/Release/obj.target/v8_initializers/gen/torque-generated/test/torque/test-torque-tq-csa.o] Terminated
09:58:43 FATAL: command execution failed
09:58:43 java.nio.channels.ClosedChannelException

and from the system log on test-rackspace-fedora32-x64-1:

Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: A process of this unit has been killed by the OOM killer.
Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Main process exited, code=exited, status=143/n/a
Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Failed with result 'oom-kill'.
Jan 18 09:58:43 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 16h 38min 49.487s CPU time.
Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Scheduled restart job, restart counter is at 5.
Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: Stopped Jenkins Slave.
Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 16h 38min 49.487s CPU time.
Jan 18 09:59:13 test-rackspace-fedora32-x64-1 systemd[1]: Started Jenkins Slave.

The same for the earlier https://ci.nodejs.org/job/node-test-commit-linux/nodes=fedora-latest-x64/39376/console

08:13:30 make[2]: *** [tools/v8_gypfiles/v8_initializers.target.mk:385: /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-latest-x64/out/Release/obj.target/v8_initializers/gen/torque-generated/third_party/v8/builtins/array-sort-tq-csa.o] Terminated
08:14:54 make[2]: *** [tools/v8_gypfiles/v8_initializers.target.mk:385: /home/iojs/build/workspace/node-test-commit-linux/nodes/fedora-latest-x64/out/Release/obj.target/v8_initializers/gen/torque-generated/test/torque/test-torque-tq-csa.o] Terminated
08:14:54 FATAL: command execution failed
08:14:54 java.nio.channels.ClosedChannelException
Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: A process of this unit has been killed by the OOM killer.
Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Main process exited, code=exited, status=143/n/a
Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Failed with result 'oom-kill'.
Jan 14 08:14:54 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 1h 12min 3.598s CPU time.
Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Scheduled restart job, restart counter is at 3.
Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: Stopped Jenkins Slave.
Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: jenkins.service: Consumed 1h 12min 3.598s CPU time.
Jan 14 08:15:24 test-rackspace-fedora32-x64-1 systemd[1]: Started Jenkins Slave.

From https://github.com/nodejs/node/pull/36139#issuecomment-762223064:

How much memory does it have compared to other similar hosts?

Appears to be 2GiB

[root@test-rackspace-fedora32-x64-1 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:          1.9Gi       290Mi       402Mi        10Mi       1.2Gi       1.5Gi
Swap:            0B          0B          0B
[root@test-rackspace-fedora32-x64-1 ~]#

For comparison, the other fedora-latest-x64 host:

$ ssh test-digitalocean-fedora32-x64-1 "free -h"
              total        used        free      shared  buff/cache   available
Mem:          1.9Gi       317Mi       202Mi       0.0Ki       1.4Gi       1.4Gi
Swap:            0B          0B          0B
$

The two fedora-last-latest-x64 hosts:

$ ssh test-digitalocean-fedora30-x64-1 "free -h"
              total        used        free      shared  buff/cache   available
Mem:          1.9Gi       292Mi       1.3Gi       0.0Ki       368Mi       1.5Gi
Swap:            0B          0B          0B
$ ssh test-digitalocean-fedora30-x64-2 "free -h"
              total        used        free      shared  buff/cache   available
Mem:          3.8Gi       284Mi       2.0Gi       0.0Ki       1.6Gi       3.3Gi
Swap:            0B          0B          0B
$

centos7-64-gcc8:

$ ssh test-rackspace-centos7-x64-1 "free -h"
              total        used        free      shared  buff/cache   available
Mem:           1.8G        256M        975M        3.5M        600M        1.4G
Swap:          2.0G        295M        1.7G
$ ssh test-softlayer-centos7-x64-1 "free -h"
              total        used        free      shared  buff/cache   available
Mem:           1.8G        142M        1.3G        6.0M        376M        1.5G
Swap:          2.0G        260M        1.7G
$

richardlau avatar Jan 25 '21 16:01 richardlau

I've added the build agenda label to this in case nobody gets around to looking at this. I know that the WG members from Red Hat are busy this week.

@rvagg had some suggestions in https://github.com/nodejs/node/pull/36139#issuecomment-766732513 to see if clearing things up on the existing hosts helps. Otherwise we might look at either adding 2GiB of swap to the Fedora hosts (if we have the disk space) or bumping the allocated memory.

richardlau avatar Jan 25 '21 17:01 richardlau

I'm only just seeing this so don't have too much intelligent to add (such as why it's failing) other than:

  • Failures on standard configurations are intended to be a signal that something is not right, switching to clang might "fix" this problem, but then you're just shipping software that's likely to fail on the particular configuration that's failing in CI. I see suggestion of memory, is there a known OOM here? I'm not seeing that in the log for that last CI run.
  • Fedora 33 is out so fedora-latest needs to be upgraded to that when someone (probably me) has time to do that. But Fedora 32, which is failing here, will still be in the mix as fedora-last-latest. It'd be quite interesting to see whether this is still failing on 33.
  • Someone with @nodejs/build test permissions could log in to the two machines (test-rackspace-fedora32-x64-1 and test-digitalocean-fedora32-x64-1), run dnf upgrade, update slave.jar, clear out ~iojs/build/workspace and reboot. I reckon it's been ages since anyone was in these machines and there might be something local that would be fixed up with a clean (maybe there's a memory hogging program in the background?).

Originally posted by @rvagg in https://github.com/nodejs/node/issues/36139#issuecomment-766732513

gengjiawen avatar Jan 26 '21 02:01 gengjiawen

@richardlau is the failure only on Fedora because the machines were configured with less memory or is it something specific to Fedora?

mhdawson avatar Jan 26 '21 22:01 mhdawson

@mhdawson I haven't found anything yet to suggest a Fedora specific issue vs a simple memory issue.

richardlau avatar Jan 26 '21 22:01 richardlau

@richardlau thanks, in terms of:

Someone with @nodejs/build test permissions could log in to the two machines (test-rackspace-fedora32-x64-1 and test-digitalocean-fedora32-x64-1), run dnf upgrade, update slave.jar, clear out ~iojs/build/workspace and reboot. I reckon it's been ages since anyone was in these machines and there might be something local that would be fixed up with a clean (maybe there's a memory hogging program in the background?).

Is that something you will have time to do on one of the machines?

mhdawson avatar Jan 26 '21 23:01 mhdawson

@mhdawson I'm not sure. I don't have much work time available in the remainder of this week outside of the scheduled Red Hat meetings. I could make time next week.

richardlau avatar Jan 26 '21 23:01 richardlau

I updated those two machines, cleared workspaces and rebooted. Here's a green run for you for that problematic PR: https://ci.nodejs.org/job/node-test-commit-linux/39601/

We've historically targeted ~2Gb ~2 core machines in CI, they should be our most common configuration. If it were a universal memory problem then I'd expect to see it in more places than just one type of machine. My guess is that it's a bug in the toolchain that's been resolved. There were a number of toolchain updates in the big batch of updates installed, including gcc and glibc. The biggest memory hog on the machine is java running Jenkins, sitting at ~200Mb, and they're back near that level after being restarted so it doesn't look like they were bloating and there wasn't anything else taking up very much.

:shrug: we'll keep an eye on these machines but for now it seems to be addressed.

rvagg avatar Jan 27 '21 06:01 rvagg

Nice work ❤️ @rvagg

gengjiawen avatar Jan 27 '21 06:01 gengjiawen

I'm reopening because everytime there's a V8 update that requires to recompile everything I have to run CI many times hoping it passes.

It also happens with centos7-arm64-gcc8

targos avatar Aug 02 '21 11:08 targos

Refs: https://ci.nodejs.org/job/node-test-commit-arm/38530/nodes=centos7-arm64-gcc8/

targos avatar Aug 02 '21 12:08 targos

Well .. centos7-arm64-gcc8 is interesting because it's got plenty of memory. I think we're dealing with too much parallelism on that machine. For all of the arm64 machines we have server_jobs: 50 which is .. a bit much, I think we need to pull that right back. Maybe something more reasonable like 12.

We are also migrating our arm64 machines to new hardware and it'll be a good opportunity to fix all of this. I was hoping to do some nice containerised arm64 infra like we have for our *linux-containered* builds but for different arm64 distros, but I think @sxa might have other ideas and has put up his hand to jump in on that. Something we'll need to pay attention to.

As for fedora, I'm still at a bit of a loss. But we do need to upgrade, we're stuck on 30 and 32 but should be on 34 (probably keep 32 as our "last"). I don't know why they stand out, they're running on the same spec hardware as many of our other VMs. They have JOBS set to 2 so shouldn't be overdoing it.

rvagg avatar Aug 04 '21 03:08 rvagg

arm64 machines updated to JOBS=12, systems updated and rebooted, I think they should be good to go now

Maybe someone from RedHat might volunteer to update our Fedora systems? I usually take the older ones, reimage them with the latest fedora, then change labels so that fedora-latest is the new ones and fedora-last-latest is the ones that were previously newest (32 in this case). If not, I might get to it sometime, there's also Alpine to do I think.

rvagg avatar Aug 04 '21 04:08 rvagg

I was hoping to do some nice containerised arm64 infra like we have for our linux-containered builds but for different arm64 distros, but I think @sxa might have other ideas and has put up his hand to jump in on that.

@rvagg Yep need to get on with that, but other critical stuff has come up - next week hopefullly (I'm on vacation until Tuesday now). Superficially Sounds like we're pretty much on the same page in terms of what it makes sense to do though :-)

sxa avatar Aug 04 '21 18:08 sxa

Maybe someone from RedHat might volunteer to update our Fedora systems? I usually take the older ones, reimage them with the latest fedora, then change labels so that fedora-latest is the new ones and fedora-last-latest is the ones that were previously newest (32 in this case). If not, I might get to it sometime, there's also Alpine to do I think.

I've started this now (starting with test-digitalocean-fedora30-x64-1). Reimaging was fairly painless but I ran into https://www.digitalocean.com/community/questions/fedora-33-how-to-persist-dns-settings-via-etc-resolv-conf (with Fedora 34) meaning our playbooks failed until I went onto the machine and fixed the DNS settings (as per https://www.digitalocean.com/community/questions/fedora-33-how-to-persist-dns-settings-via-etc-resolv-conf?answer=66950).

richardlau avatar Sep 02 '21 15:09 richardlau

Did we ever try increasing the swap space on these machines? (since it looks from the output earlier in thisk issue that they had none.)

sxa avatar Oct 07 '21 08:10 sxa

I don't think so

rvagg avatar Oct 07 '21 09:10 rvagg

We have not. Is it easily done via Ansible? I'm up for trying.

richardlau avatar Oct 07 '21 16:10 richardlau

I've added swap to the two Fedora 32 hosts

dd if=/dev/zero of=/swapfile bs=1024 count=2097152
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

richardlau avatar Oct 13 '21 10:10 richardlau

Might also try adding swap to the Debian 10 hosts, e.g. https://ci.nodejs.org/job/node-test-commit-linux/nodes=debian10-x64/43295/console

01:15:34 cc1plus: out of memory allocating 2097152 bytes after a total of 22560768 bytes
01:15:34 make[2]: *** [tools/v8_gypfiles/v8_compiler.target.mk:266: /home/iojs/build/workspace/node-test-commit-linux/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/pipeline.o] Error 1

richardlau avatar Oct 14 '21 10:10 richardlau

Have added swap to test-rackspace-debian10-x64-1.

richardlau avatar Oct 14 '21 13:10 richardlau

We ended up having an informal meeting and not streaming. In retrospect we probably should have streamed but at the start we were not sure we were going to discuss too much.

mhdawson avatar Dec 14 '21 23:12 mhdawson

I see I got the wrong issue for the last comment.

mhdawson avatar Dec 15 '21 21:12 mhdawson

Seems like adding swap has resolved the issue, but leaving this open until we have added the swap addition to our ansible scripts.

EDIT we agreed to add to manual instructions for now and then close this issue.

mhdawson avatar Jan 25 '22 23:01 mhdawson

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

github-actions[bot] avatar May 09 '23 00:05 github-actions[bot]