infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

/home/jenkins/workspace/Grinder: No space left on device

Open sophia-guo opened this issue 4 years ago • 28 comments

/home/jenkins/workspace/Grinder: No space left on device, the error found on following docker ones: test-docker-fedora33-x64-1 test-docker-fedora33-x64-2

https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1027/ https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1028/console

sophia-guo avatar Jul 05 '21 13:07 sophia-guo

From docker system df -v it appears that the f33l.2229, alp311.2231 and alp312.2230 containers are using around 150Gb of space each which is likely causing us some problems.

sxa avatar Jul 05 '21 14:07 sxa

Test_openjdk18_hs_extended.openjdk_x86-64_alpine-linux_testList_0 was using 43Gb on the Alpine 3.12 container. Similarly it was the _1 variant of the same that was chewup up a comparable amount o Alpine 3.11 so I suspect they had been aborted part way through.

Both have been cleared and the host now has about 131Gb available which should resole the problem. Therefore closing.

sxa avatar Jul 05 '21 14:07 sxa

I could see it happened again. test-docker-fedora33-x64-2 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/354/console

sophia-guo avatar Oct 18 '21 16:10 sophia-guo

test-docker-ubuntu1604-x64 similar issue: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/352/console

sophia-guo avatar Oct 18 '21 16:10 sophia-guo

docker-packet-ubuntu2004-amd-1 similar issue: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/356/console

sophia-guo avatar Oct 18 '21 18:10 sophia-guo

test-docker-ubuntu2010-x64-1 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/357/console

sophia-guo avatar Oct 18 '21 18:10 sophia-guo

test-docker-fedora33-x64-1: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/363/console

sophia-guo avatar Oct 18 '21 18:10 sophia-guo

test-docker-fedora33-x64-2: https://ci.adoptopenjdk.net/view/work-in-progress/job/WIP_Test_Job_Auto_Gen/65/console

llxia avatar Oct 19 '21 14:10 llxia

test-docker-ubuntu1804-x64-1 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/373/console

sophia-guo avatar Oct 20 '21 15:10 sophia-guo

test-docker-fedora33-x64-1 Exception: java.nio.file.FileSystemException: /home/jenkins/workspace/Grinder: No space left on device:

https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2875/console

llxia avatar Dec 29 '21 19:12 llxia

Not sure why the host is using so much space, but a docker system prune -a has recovered 30GB so that should keep it going for a while.

Biggest uses of space ont eh Fedora box appear to have been these, so I've also clear them out

639908	Test_openjdk19_hs_extended.openjdk_x86-64_linux_testList_1
779588	Test_openjdk17_hs_extended.system_x86-64_linux
1339544	Test_openjdk11_hs_extended.functional_x86-64_linux
1986392	Test_openjdk11_bisheng_sanity.openjdk_x86-64_linux
2164604	Test_openjdk8_bisheng_extended.openjdk_x86-64_linux_testList_2
2728576	Test_openjdk8_j9_sanity.openjdk_x86-64_linux

sxa avatar Dec 31 '21 14:12 sxa

test-docker-ubuntu1804-x64-1 https://ci.adoptopenjdk.net/view/Test_grinder/job/Test_Job_Auto_Gen/278/console

sophia-guo avatar Jan 13 '22 21:01 sophia-guo

test-docker-fedora33-x64-1 Exception: java.nio.file.FileSystemException: /home/jenkins/workspace/Grinder: No space left on device:

https://ci.adoptopenjdk.net/view/Test_grinder/job/Test_Job_Auto_Gen/277/

sophia-guo avatar Jan 13 '22 21:01 sophia-guo

@Haroon-Khel As the new expert in the DockerStatic stuff, can you take a look and see what we can do with this please? We probably need some sort of automation (jenkins job or otherwise) that goes over the dockerhost machines and checks and if necessary reports any problems with:

  • total disk space on the host
  • total disk space in use by docker
  • whether any particular container is chewing up more space than it ought to be (probably in the workspace directory

Doing something with the output of something like these commands may be a good place to start: df -k; docker system df; for CONTAINER in $(docker ps -q); do echo CONTAINER $CONTAINER = $(docker ps | awk "/^$CONTAINER/{print\$NF}"); docker exec $CONTAINER du -ks /home/jenkins/workspace / 2>/dev/null; done

sxa avatar Jan 14 '22 11:01 sxa

Ive created https://ci.adoptopenjdk.net/view/Tooling/job/DockerhostHealthStatus/ for now, which runs https://github.com/Haroon-Khel/openjdk-infrastructure/blob/dockerhosthealth/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/DockerStatic/scripts/dockerhosthealth.sh which is in its draft stage right now

Haroon-Khel avatar Jan 14 '22 16:01 Haroon-Khel

@Haroon-Khel The latest JDK11 release didn't appear to cause a filling up of the file system. I think you asserted that https://github.com/adoptium/aqa-tests/pull/3326 hadn't taken effect, although that may be a result of using the v0.8.0-release branch which won't have had the change merged. Can you try and check:

  • If it still run the tests, why we didn't see the filling up of the file system
  • If it didn't run the tests, whether it was due to running from the alternate branch that didn't have them disabled

sxa avatar Feb 11 '22 11:02 sxa

Fresh issue on test-docker-ubuntu2010-x64-2:

https://ci.adoptopenjdk.net/view/work-in-progress/job/WIP_Test_Job_Auto_Gen/72/console

Building remotely on [test-docker-ubuntu2010-x64-2](https://ci.adoptopenjdk.net/computer/test-docker-ubuntu2010-x64-2) (ci.role.test sw.os.linux hw.arch.x86) in workspace /home/jenkins/workspace/WIP_Test_Job_Auto_Gen
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to test-docker-ubuntu2010-x64-2
		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1800)
		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
		at hudson.remoting.Channel.call(Channel.java:1001)

smlambert avatar Mar 05 '22 21:03 smlambert

Adding to May 2022 plan (as it looks partly worked, and it does still affect releases)

smlambert avatar May 11 '22 13:05 smlambert

No current issues so removing from the May milestone. I'll keep it open for another month or so and then we can close if no more occurrences (Can always be reopened if required)

sxa avatar May 18 '22 12:05 sxa

test-docker-fedora34-x64-1:

Exception: java.nio.file.FileSystemException: /home/jenkins/workspace/Grinder: No space left on device

https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5012/console

llxia avatar Jun 20 '22 21:06 llxia

Hmmm cleared up some old volumes, although I thought I'd done a clearup on this host ealirer today so we'll see if it fills up again. If so we'll need to investigate what's using it up. I've only been able to reclaim 25% of the 400Gb volume, and it shouldn't be using anywhere near that amount.

sxa avatar Jun 20 '22 22:06 sxa

An extra 50Gb seems to have been used up overnight on the file system. That's not normal

sxa avatar Jun 21 '22 09:06 sxa

Could you list the top files/folders that use the most space? Maybe we can get some clues.

llxia avatar Jun 21 '22 12:06 llxia

Could you list the top files/folders that use the most space? Maybe we can get some clues.

It's not quite that simple when it's a load of docker containers on the host system unfortunately.

sxa avatar Jun 21 '22 12:06 sxa

Looks like this process might have been keeping a lot of space in use but with probably from deleted files which still had file handles open to them: jenkins 2266542 9825 99 Jun17 ? 11-17:17:13 /home/jenkins/workspace/Test_openjdk8_dragonwell_sanity.openjdk_x86-64_linux/openjdkbinary/j2sdk-image/bin/java -cp . -XX:+UseG1GC -XX:+MultiTenant -XX:+TenantHeapIsolation -XX:NativeMemoryTracking=detail -XX:+PrintGCDetails -Xloggc:gc.log -Xmx1g -Xmn32m TestLeak - I've killed it now and there's 320Gb free.

sxa avatar Jun 21 '22 13:06 sxa

https://ci.adoptopenjdk.net/job/SXA-processCheck/label=test-docker-fedora34-x64-1/295/console cannot complete on this machine due to the space issue.

In the test Jenkins script, it detects the leftover processes. I think we should enforce the logic to kill the leftover processes before and after the test job. The ideal place for this logic should be in TKG. If that cannot be completed soon, maybe we should do it in the Jenkins script for now. FYI @smlambert @renfeiw

llxia avatar Jun 22 '22 17:06 llxia

I thinks it has been done in jenkins script https://github.com/adoptium/aqa-tests/blob/57c4bc2f4907cffdecedbd5387d4e7b6f6a33f9a/buildenv/jenkins/JenkinsfileBase#L854-L859

sophia-guo avatar Jun 27 '22 16:06 sophia-guo

re https://github.com/adoptium/infrastructure/issues/2251#issuecomment-1167570875, the above code only lists the processes.

llxia avatar Jun 30 '22 14:06 llxia

test-sxa-armv7l-ubuntu2004-odroid-2 got No space left on device.

https://ci.adoptopenjdk.net/job/Grinder/6196/console

sophia-guo avatar Nov 15 '22 19:11 sophia-guo

test-sxa-armv7l-ubuntu2004-odroid-2 got No space left on device. https://ci.adoptopenjdk.net/job/Grinder/6196/console

Will cover this under https://github.com/adoptium/infrastructure/issues/2829

sxa avatar Nov 21 '22 14:11 sxa