infrastructure
infrastructure copied to clipboard
/home/jenkins/workspace/Grinder: No space left on device
/home/jenkins/workspace/Grinder: No space left on device, the error found on following docker ones: test-docker-fedora33-x64-1 test-docker-fedora33-x64-2
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1027/ https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1028/console
From docker system df -v it appears that the f33l.2229, alp311.2231 and alp312.2230 containers are using around 150Gb of space each which is likely causing us some problems.
Test_openjdk18_hs_extended.openjdk_x86-64_alpine-linux_testList_0 was using 43Gb on the Alpine 3.12 container. Similarly it was the _1 variant of the same that was chewup up a comparable amount o Alpine 3.11 so I suspect they had been aborted part way through.
Both have been cleared and the host now has about 131Gb available which should resole the problem. Therefore closing.
I could see it happened again. test-docker-fedora33-x64-2 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/354/console
test-docker-ubuntu1604-x64 similar issue: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/352/console
docker-packet-ubuntu2004-amd-1 similar issue: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/356/console
test-docker-ubuntu2010-x64-1 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/357/console
test-docker-fedora33-x64-1: https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/363/console
test-docker-fedora33-x64-2: https://ci.adoptopenjdk.net/view/work-in-progress/job/WIP_Test_Job_Auto_Gen/65/console
test-docker-ubuntu1804-x64-1 https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/373/console
test-docker-fedora33-x64-1 Exception: java.nio.file.FileSystemException: /home/jenkins/workspace/Grinder: No space left on device:
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2875/console
Not sure why the host is using so much space, but a docker system prune -a has recovered 30GB so that should keep it going for a while.
Biggest uses of space ont eh Fedora box appear to have been these, so I've also clear them out
639908 Test_openjdk19_hs_extended.openjdk_x86-64_linux_testList_1
779588 Test_openjdk17_hs_extended.system_x86-64_linux
1339544 Test_openjdk11_hs_extended.functional_x86-64_linux
1986392 Test_openjdk11_bisheng_sanity.openjdk_x86-64_linux
2164604 Test_openjdk8_bisheng_extended.openjdk_x86-64_linux_testList_2
2728576 Test_openjdk8_j9_sanity.openjdk_x86-64_linux
test-docker-ubuntu1804-x64-1 https://ci.adoptopenjdk.net/view/Test_grinder/job/Test_Job_Auto_Gen/278/console
test-docker-fedora33-x64-1 Exception: java.nio.file.FileSystemException: /home/jenkins/workspace/Grinder: No space left on device:
https://ci.adoptopenjdk.net/view/Test_grinder/job/Test_Job_Auto_Gen/277/
@Haroon-Khel As the new expert in the DockerStatic stuff, can you take a look and see what we can do with this please? We probably need some sort of automation (jenkins job or otherwise) that goes over the dockerhost machines and checks and if necessary reports any problems with:
- total disk space on the host
- total disk space in use by docker
- whether any particular container is chewing up more space than it ought to be (probably in the workspace directory
Doing something with the output of something like these commands may be a good place to start: df -k; docker system df; for CONTAINER in $(docker ps -q); do echo CONTAINER $CONTAINER = $(docker ps | awk "/^$CONTAINER/{print\$NF}"); docker exec $CONTAINER du -ks /home/jenkins/workspace / 2>/dev/null; done
Ive created https://ci.adoptopenjdk.net/view/Tooling/job/DockerhostHealthStatus/ for now, which runs https://github.com/Haroon-Khel/openjdk-infrastructure/blob/dockerhosthealth/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/DockerStatic/scripts/dockerhosthealth.sh which is in its draft stage right now
@Haroon-Khel The latest JDK11 release didn't appear to cause a filling up of the file system. I think you asserted that https://github.com/adoptium/aqa-tests/pull/3326 hadn't taken effect, although that may be a result of using the v0.8.0-release branch which won't have had the change merged. Can you try and check:
- If it still run the tests, why we didn't see the filling up of the file system
- If it didn't run the tests, whether it was due to running from the alternate branch that didn't have them disabled
Fresh issue on test-docker-ubuntu2010-x64-2:
https://ci.adoptopenjdk.net/view/work-in-progress/job/WIP_Test_Job_Auto_Gen/72/console
Building remotely on [test-docker-ubuntu2010-x64-2](https://ci.adoptopenjdk.net/computer/test-docker-ubuntu2010-x64-2) (ci.role.test sw.os.linux hw.arch.x86) in workspace /home/jenkins/workspace/WIP_Test_Job_Auto_Gen
Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to test-docker-ubuntu2010-x64-2
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1800)
at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
at hudson.remoting.Channel.call(Channel.java:1001)
Adding to May 2022 plan (as it looks partly worked, and it does still affect releases)
No current issues so removing from the May milestone. I'll keep it open for another month or so and then we can close if no more occurrences (Can always be reopened if required)
test-docker-fedora34-x64-1:
Exception: java.nio.file.FileSystemException: /home/jenkins/workspace/Grinder: No space left on device
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5012/console
Hmmm cleared up some old volumes, although I thought I'd done a clearup on this host ealirer today so we'll see if it fills up again. If so we'll need to investigate what's using it up. I've only been able to reclaim 25% of the 400Gb volume, and it shouldn't be using anywhere near that amount.
An extra 50Gb seems to have been used up overnight on the file system. That's not normal
Could you list the top files/folders that use the most space? Maybe we can get some clues.
Could you list the top files/folders that use the most space? Maybe we can get some clues.
It's not quite that simple when it's a load of docker containers on the host system unfortunately.
Looks like this process might have been keeping a lot of space in use but with probably from deleted files which still had file handles open to them: jenkins 2266542 9825 99 Jun17 ? 11-17:17:13 /home/jenkins/workspace/Test_openjdk8_dragonwell_sanity.openjdk_x86-64_linux/openjdkbinary/j2sdk-image/bin/java -cp . -XX:+UseG1GC -XX:+MultiTenant -XX:+TenantHeapIsolation -XX:NativeMemoryTracking=detail -XX:+PrintGCDetails -Xloggc:gc.log -Xmx1g -Xmn32m TestLeak - I've killed it now and there's 320Gb free.
https://ci.adoptopenjdk.net/job/SXA-processCheck/label=test-docker-fedora34-x64-1/295/console cannot complete on this machine due to the space issue.
In the test Jenkins script, it detects the leftover processes. I think we should enforce the logic to kill the leftover processes before and after the test job. The ideal place for this logic should be in TKG. If that cannot be completed soon, maybe we should do it in the Jenkins script for now. FYI @smlambert @renfeiw
I thinks it has been done in jenkins script https://github.com/adoptium/aqa-tests/blob/57c4bc2f4907cffdecedbd5387d4e7b6f6a33f9a/buildenv/jenkins/JenkinsfileBase#L854-L859
re https://github.com/adoptium/infrastructure/issues/2251#issuecomment-1167570875, the above code only lists the processes.
test-sxa-armv7l-ubuntu2004-odroid-2 got No space left on device.
https://ci.adoptopenjdk.net/job/Grinder/6196/console
test-sxa-armv7l-ubuntu2004-odroid-2 got
No space left on device. https://ci.adoptopenjdk.net/job/Grinder/6196/console
Will cover this under https://github.com/adoptium/infrastructure/issues/2829