glusterfs Reg recent CI failures on s390x builder

Reg recent CI failures on s390x builder

Open kun-lu20 opened this issue 1 year ago • 19 comments

Description of problem:

Hi,

We noticed that recent CI builds are failing on the s390x builder (https://build.gluster.org/job/centos8-s390-regression/6/consoleFull). It took more than 7 hours to build and run the regression test suite but aborted after only 68/832 test cases had been run.

We also found the following errors in the build log:

10:19:59 losetup: /d/dev/loop*: failed to use device: No such device

17:41:09 + rm -f /var/run/glusterd.socket
17:41:09 + rm -rf /var/run/gluster/
17:41:09 + pkill /opt/qa/regression.sh
17:41:09 FATAL: command execution failed

17:41:09 Exception when executing the batch command : no workspace from node hudson.slaves.DumbSlave[builder-el8-s390x-1.ibm-l1.gluster.org] which is computer hudson.slaves.SlaveComputer@adfae62 and has channel null
17:41:09 Build step 'Post build task' marked build as failure
17:41:09 ERROR: Step ‘Archive the artifacts’ failed: no workspace for centos8-s390-regression #6
17:41:09 ERROR: builder-el8-s390x-1.ibm-l1.gluster.org is offline; cannot locate java-1.6.0-openjdk-1.6.0.0.x86_64

Can any developers or admins from community take a look at this issue? Thank you very much!

Jul 19 '22 18:07 kun-lu20

@mscherer

Jul 19 '22 18:07 kun-lu20

Hi @amarts , could you please take a look at this issue? Thank you very much!

Aug 03 '22 13:08 kun-lu20

@Shwetha-Acharya @rakshithakamath94 @mscherer is it something you guys can help here? A resolution would help a lot. thanks.

Aug 10 '22 06:08 amarts

I do not know enough the code for that, I was hoping to have a dev to look at that, at least to give a hint on where to look.

Aug 10 '22 09:08 mscherer

I thought the issue happened because of below?

ERROR: builder-el8-s390x-1.ibm-l1.gluster.org is offline; cannot locate java-1.6.0-openjdk-1.6.0.0.x86_64

the error in previous snippet

10:19:59 losetup: /d/dev/loop*: failed to use device: No such device

is not an issue to be bothered, as this happens in local setup too.

Aug 10 '22 09:08 amarts

My understanding is that this step happen after the build failed, and it fail because jenkins reboot as part of the build (since when tests fail, they tend to also mess with the build, we usually reboot them)

See https://github.com/gluster/build-jobs/blob/main/build-gluster-org/scripts/regression-abort.sh#L13

So what happen is that the regression test fail, so we clean up the server and reboot. The error message about jdk is that jenkins lost contact with the builder and is expected.

Aug 10 '22 09:08 mscherer

Looking at the code, it fail as it reach a timeout:

23:41:02 Build timed out (after 450 minutes). Marking the build as aborted.

But given it only ran 62 tests out of 832, there is a massive slowdown somewhere, and it can come from either the code (eg, something slow on s390x specifically), or from the infra (eg, we are running that on a cluster of floppy disk, or more likely, on some oversubscribed storage).

I can try to extend the time out to see how long it will take to run, but if it take 3 to 4 days to run, it need to be investigated.

Do we have some benchmark or something to look at (eg, some perf test for gluster that would help see the issue ) ?

I can run some CPU one, some I/O ones, but that's all.

Aug 10 '22 09:08 mscherer

Good observation. Lets increase the timeout to 3-4 days for sure. And to start with CPU and I/O observations is good.

@kun-lu20 Any suggestions on this? Is it possible to check for the reasons for delay here? Non of us have any s390x system to run tests.

Aug 10 '22 10:08 amarts

I can see that volume start was failing at commit phase in every test.

19:51:46 ok 5 [ 34/ 848] < 33> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log volume create patchy 127.1.1.1:/d/backends/1/dist1 127.1.1.1:/d/backends/1/dist2 127.1.1.2:/d/backends/2/dist3 127.1.1.2:/d/backends/2/dist4' 19:51:46 volume start: patchy: failed: Commit failed on localhost. Please check log file for details.

Also no logs or build-artifacts were found in: https://build.gluster.org/job/centos8-s390-regression/6/ to understand more about the failure. IMO glusterd logs and console output should give us some clue.

My suggestion is to execute a simple gluster volume create and start command manually on the regression machine and find out what is actually happening. If ever we have another runs where logs are present that can also help us. @mscherer @amarts @rakshithakamath94

Aug 10 '22 10:08 Shwetha-Acharya

My suggestion is to execute a simple gluster volume create and start command manually on the regression machine and find out what is actually happening. If ever we have another runs where logs are present that can also help us.

It would be great if at least one test is run locally on those machines (if possible), and we get the reason for this.

Aug 10 '22 11:08 amarts

@mscherer or @rakshithakamath94 might be able to run the test locally

Aug 10 '22 11:08 Shwetha-Acharya

For now, I will increase the timeout to 4 days, and see how it goes.

Aug 10 '22 12:08 mscherer

Hi @amarts ,

Thank you so much for looking into this issue.

We also tried to run the regression tests from devel branch on local s390x VMs (8 vCPU, 16g vMEM, 250g Storage). All 831 test cases could be completed within 10 hours and only 10 test cases failed.

Currently there are 3 s390x builders in the Gluster builder pool:

148.100.84.186: (RHEL 8.4, 2 vCPUs/4GB vMem/50GB storage)
148.100.84.19, 148.100.84.23 : (RHEL 8.4, 8 vCPUs/12GB vMem/120GB storage)

I think Choosing the one with larger quota to run regression CI might be helpful.

@Shwetha-Acharya gluster volume create and start commands worked on our local s390x VMs.

Aug 10 '22 15:08 kun-lu20

Hi @amarts @mscherer ,

Just now I checked the CI status and found that the whole regression test suite could complete after increasing the timeout, but it took nearly 2 days and many test failures occurred.

I also checked the job scripts centos8-s390.yml and regression-test-burn-in.sh, but looks like the invoked scripts /opt/qa/build.sh and /opt/qa/regression.sh could not be found in this repo.

In our local build for running GlusterFS regression test on s390x, we need to install certain deps via yum and build DBENCH and thin-provisioning-tools from source on RHEL, see step 2.1 in Building GlusterFS on s390x. I wonder if this is part of the reason of the test failure in CI.

I think running the same CI scripts in our local environment might help to locate the root cause. Could you please show me where I can find the above mentioned missing scripts so that I can run the same scripts at my end?

Thank you very much!

Aug 17 '22 18:08 kun-lu20

They come from https://github.com/gluster/glusterfs-patch-acceptance-tests

That's installed by our ansible playbooks, just a checkout is enough.

Aug 18 '22 16:08 mscherer

Cool. I'll try it. Thanks @mscherer !

Aug 18 '22 16:08 kun-lu20

And the rest of the ansible playbooks are on https://github.com/gluster/gluster.org_ansible_configuration (if you need the rpms to install, that's somewhere in roles/jenkins_builder, along a bunch of conditional for distros, etc.

Aug 18 '22 19:08 mscherer

Hi all,

I've followed the steps in scripts build.sh and regression.sh from https://github.com/gluster/glusterfs-patch-acceptance-tests to run the regression tests on a local s390x VM (similar quota to our s390x builder), but I haven't observed the volume start failure at our end.

Below is the result of one of the test cases:

======================================== (2 / 833) ========================================
[21:40:29] Running tests in file ./tests/000-flaky/basic_changelog_changelog-snapshot.t
./tests/000-flaky/basic_changelog_changelog-snapshot.t ..
1..15
losetup: /d/dev/loop*: failed to use device: No such device
losetup: /d/dev/loop*: failed to use device: No such device
ok   1 [   2638/     51] <   9> 'verify_lvm_version'
ok   2 [     36/   2050] <  10> 'glusterd'
ok   3 [     17/      8] <  11> 'pidof glusterd'
ok   4 [     30/   2372] <  13> 'setup_lvm 1'
ok   5 [     20/    232] <  15> 'gluster --mode=script --wignore volume create patchy 10.11.58.155:/d/backends/patchy_snap_drafting1_mnt'
ok   6 [     44/    388] <  17> 'gluster --mode=script --wignore volume start patchy'
ok   7 [     24/    229] <  20> 'gluster --mode=script --wignore volume set patchy changelog.changelog on'
ok   8 [     33/    253] <  21> 'gluster --mode=script --wignore volume set patchy changelog.rollover-time 3'
ok   9 [     40/    145] <  22> '_GFS --attribute-timeout=0 --entry-timeout=0 --volfile-id=patchy --volfile-server=10.11.58.155 /mnt/glusterfs/0'
ok  10 [  14850/   2281] <  50> 'gluster --mode=script --wignore snapshot create patchy-snap1 patchy no-timestamp'
ok  11 [     34/    136] <  51> 'snapshot_exists 0 patchy-snap1'
ok  12 [     55/      8] <  53> 'grep "Enabled changelog barrier" /var/log/glusterfs/bricks/d-backends-patchy_snap_drafting1_mnt.log'
ok  13 [     32/      3] <  54> 'grep "Disabled changelog barrier" /var/log/glusterfs/bricks/d-backends-patchy_snap_drafting1_mnt.log'
ok  14 [     35/     49] <  56> 'glusterfs -s 10.11.58.155 --volfile-id=/snaps/patchy-snap1/patchy /mnt/glusterfs/1'
ok  15 [     35/   1324] <  59> 'gluster --mode=script --wignore volume stop patchy force'
  Logical volume "brick_lvm" successfully removed.
  Logical volume "e834be6d11114e0bb915a7ffba7c10f6_0" successfully removed.
  Logical volume "thinpool" successfully removed.
  Volume group "patchy_snap_drafting1_vg_1" successfully removed
losetup: /d/dev/loop*: failed to use device: No such device
losetup: /d/dev/loop*: failed to use device: No such device
ok
All tests successful.
Files=1, Tests=15, 35 wallclock secs ( 0.02 usr  0.01 sys +  2.25 cusr 10.06 csys = 12.34 CPU)
Result: PASS
Logs preserved in tarball basic_changelog_changelog-snapshot-iteration-1.tar.gz
End of test ./tests/000-flaky/basic_changelog_changelog-snapshot.t
================================================================================

When comparing it to the same test case in the CI test log (https://build.gluster.org/job/centos8-s390-regression/59/consoleFull), I noticed some odd failure info in the CI log:

10:44:10 ok   3 [    137/    105] <  11> 'pidof glusterd'
10:44:10 fallocate: fallocate failed: No space left on device
10:44:10 losetup: /d/backends/patchy_snap_builder_el8_s390x_1_vhd: Warning: file is smaller than 512 bytes; the loop device may be useless or invisible for system tools.
10:44:10   Cannot use /d/backends/patchy_snap_builder_el8_s390x_1_loop: device is too small (pv_min_size)
10:44:10   Cannot use /d/backends/patchy_snap_builder_el8_s390x_1_loop: device is too small (pv_min_size)
10:44:10   Command requires all devices to be found.
10:44:10   Volume group "patchy_snap_builder_el8_s390x_1_vg_1" not found
10:44:10   Cannot process volume group patchy_snap_builder_el8_s390x_1_vg_1
10:44:10   Volume group "patchy_snap_builder_el8_s390x_1_vg_1" not found
10:44:10   Cannot process volume group patchy_snap_builder_el8_s390x_1_vg_1
10:44:10 Error accessing specified device /dev/patchy_snap_builder_el8_s390x_1_vg_1/brick_lvm: No such file or directory
... ...
10:44:10 mount: /d/backends/patchy_snap_builder_el8_s390x_1_mnt: special device /dev/patchy_snap_builder_el8_s390x_1_vg_1/brick_lvm does not exist.

Looks like certain device space setup doesn't work properly. Also volume start still failed:

10:44:10 volume start: patchy: failed: Commit failed on localhost. Please check log file for details.
10:44:10 not ok   6 [    232/   1577] <  17> 'gluster --mode=script --wignore volume start patchy' -> ''

Could you please run some simple tests on the regression machine as mentioned by @Shwetha-Acharya and @amarts to see what was happening?

Thank you very much! @mscherer @rakshithakamath94

Sep 15 '22 15:09 kun-lu20

Hi @mscherer @rakshithakamath94 , could you please take a look when you have some time? Thank you!

Sep 21 '22 19:09 kun-lu20

I do not really know where to start. And there is already logs for the build, is there a specific part that I would see that can't be seen in neither the logs in the tarball or the test output ?

Sep 27 '22 13:09 mscherer

Hi @mscherer , Thank you very much!

A lot of volume start failures could be seen in the CI testing log as shown below, which could pass on our local s390x VMs.

10:44:10 volume start: patchy: failed: Commit failed on localhost. Please check log file for details.
10:44:10 not ok   6 [    232/   1577] <  17> 'gluster --mode=script --wignore volume start patchy' -> ''

Could you please run a single test case (such as ./tests/000-flaky/basic_distribute_rebal-all-nodes-migrate.t) on the s390x builder to check why this would happen?

Also, I found the following info in today's building log:

09:47:19 libtool: install: /usr/bin/install -c .libs/gluster-setgfid2path /build/install/sbin/gluster-setgfid2path
09:47:19 /usr/bin/install: cannot create regular file '/build/install/sbin/gluster-setgfid2path': No space left on device

I think we might need to free some space on the s390x builders.

Thanks again!

Sep 27 '22 14:09 kun-lu20

So 1 builder was in a weird state (the 2nd one), I rebooted.

Sep 27 '22 15:09 mscherer

The 3rd was also in the same state. /d was full, but du showed only 300 Mb used. And several gluster process where lingering.

Sep 27 '22 15:09 mscherer

Hi @mscherer ,

I checked the logs of the 3rd s390x builder after you rebooted it. The volume could be created successfully but still failed to be started. There were also other failures such as peer probe: failed in the log.

10:17:27 ======================================== (3 / 833) ========================================
10:17:27 [14:17:27] Running tests in file ./tests/000-flaky/basic_distribute_rebal-all-nodes-migrate.t
10:18:43 ./tests/000-flaky/basic_distribute_rebal-all-nodes-migrate.t .. 
10:18:43 1..38
10:18:43 losetup: /d/dev/loop*: failed to use device: No such device
10:18:43 ok   1 [    219/   3766] <  26> 'launch_cluster 3'
10:18:43 peer probe: failed: Probe returned with Transport endpoint is not connected
10:18:43 not ok   2 [     13/   1148] <  27> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log peer probe 127.1.1.2' -> ''
10:18:43 peer probe: failed: Probe returned with Transport endpoint is not connected
10:18:43 not ok   3 [     12/   1107] <  28> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log peer probe 127.1.1.3' -> ''
10:18:43 not ok   4 [     16/  60091] <  29> '2 peer_count' -> 'Got "0" instead of "2"'
10:18:43 ok   5 [     15/    103] <  33> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log volume create patchy 127.1.1.1:/d/backends/1/dist1 127.1.1.1:/d/backends/1/dist2 127.1.1.2:/d/backends/2/dist3 127.1.1.2:/d/backends/2/dist4'
10:18:43 volume start: patchy: failed: Commit failed on localhost. Please check log file for details.
10:18:43 not ok   6 [     13/   1119] <  35> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log volume start patchy' -> ''

I suspect it was caused by certain environment setup since these failures didn't occur on our local s390x VMs. Could you please take a look? Thanks very much for your time and effort.

Sep 30 '22 15:09 kun-lu20

Hi @amarts @mscherer ,

could you please check this issue again? Thank you so much!

Oct 18 '22 17:10 kun-lu20

I think that's outside of my competency with Gluster code base :/

Oct 25 '22 16:10 mscherer

Thanks for your time and efforts @mscherer .

Hi @amarts , could you please take a look when you have some time? Thank you.

Oct 27 '22 15:10 kun-lu20

On all of the s390x regression runs, I see the following error in the very beginning.

/opt/qa/regression.sh: line 47: python: command not found

I can see all the usual locations being present in PATH variable. Is python installed on the system? If yes (in case it's python3), maybe it is not symlinked as python?

/cc: @mscherer

Regards

Nov 04 '22 09:11 black-dragon74

I have provided a temporary local change to see if it resolving this error would makes the job work - https://build.gluster.org/job/centos8-s390-regression/73/console and the error is not showing in the logs.

Also, I noticed the same kind of error on cento8-regression but that runs fine.

Nov 04 '22 09:11 rakshithakamath94

Also, I noticed the same kind of error on cento8-regression but that runs fine.

Agreed.

I also see many losetup related errors. Is the module loop loaded into the kernel? Can verify by:

grep 'loop' /lib/modules/$(uname -r)/modules.builtin

I think this module is absent or not loaded otherwise we would have loop devices under /dev (logs are showing: /dev/loop* no such device).

P.S: SSH access to the machine would help and would eliminate cases of me asking such questions :P

Regards

Nov 04 '22 12:11 black-dragon74

glusterfs glusterfs copied to clipboard

Reg recent CI failures on s390x builder

glusterfs
glusterfs copied to clipboard