glusterd2 heal-info: cli command failed with Client.Timeout exceeded error

Observed behavior

glustercli heal info command failed with Client.Timeout exceeded while awaiting headers when I had lot of entries to heal (almost 3k)

[root@dhcp42-161 glusterfs]# glustercli volume heal info patchy
Failed to get heal info for volume patchy

Get http://127.0.0.1:24007/v1/volumes/patchy/heal-info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[root@dhcp42-161 glusterfs]# glustercli volume heal info patchy
Failed to get heal info for volume patchy

Get http://127.0.0.1:24007/v1/volumes/patchy/heal-info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Expected/desired behavior

Heal info should not fail, at least it should be print the entries

Details on how to reproduce (minimal and precise)

Create a 3-way replicate volume and start it
mount the volume
Kill one brick out of 3
mount the volume and start i/o (I did linux kernal untar)
after i/o finishes, enable self-heal and lunch full heal
execute heal-info command

Information about the environment:

Glusterd2 version used (e.g. v4.1.0 or master): Master
Operating system used: Fedora 29
Glusterd2 compiled from sources, as a package (rpm/deb), or container: source
Using External ETCD: (yes/no, if yes ETCD version):
If container, which container image:
Using kubernetes, openshift, or direct install:
If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside:

Dec 03 '18 06:12 rafikc30

heal info command does take longer time to finish and the default time out for heal command is 10 minutes in cli for GD1. (See cli_cmd_submit () ) . We need to have a way to ensure that client timeout can be configurable.

Dec 04 '18 10:12 atinmu

@rafikc30 can you use --timeout option in glustercli? the default timeout is 30sec user can adjust the timeout with --timeout flag in glustercli

Dec 11 '18 04:12 Madhu-1

client logs



Failed to get heal info for volume volrep100

Get http://127.0.0.1:24007/v1/volumes/volrep100/heal-info: EOF

server logs

.txt 71f255cb-c506-44a8-873b-e663cd64457c} {/linux-2.4.19/Documentation/zorro.txt b3a64e69-4a18-4ce8-a036-8c735dc95906} {/linux-2.4.19/Documentation/dnotify.txt a14db79f-9758-4914-a34d-ee1d19ea3bc7} {/linux-2.4.19/Documentation/mkdev.cciss a1f7bc16-740c-45a1-acdd-efe4149a264f} {/linux-2.4.19/Documentation/SubmittingPatches 5ee69530-90b2-4565-8d20-84d85687c94e} {/linux-2.4.19/Documentation/parisc 51dbf25c-d849-4824-b6c3-ddbfb12ae853} {/linux-2.4.19/Documentation/parisc/00-INDEX c38a6cdd-7ede-4bc7-8aa8-bd21ba86c870} {/linux-2.4.19/Documentation/parisc/IODC.txt 98ee5074-73d2-489e-bacd-152f08a9a4f2} {/linux-2.4.19/Documentation/parisc/debugging 008281dc-8bfb-4a01-af9f-0f0c916d112f} {/linux-2.4.19/Documentation/parisc/registers 7004d041-975a-4c0b-b58d-315fa6a5d6c2} {/linux-2.4.19/Documentation/cris 0b1e2622-7ff9-4ac1-9af4-a22548bf1782} {/linux-2.4.19/Documentation/cris/README 3f7871c0-1b2d-4767-ba3c-c882b33bf0f4} {/linux-2.4.19/Documentation/SAK.txt d83b83a3-1f14-42a7-9bb9-d1a4f139f58b} {/linux-2.4.19/Documentation/mips 5fe068d2-d535-40b0-aee0-78d2a760b789} {/linux-2.4.19/Documentation/mips/GT64120.README cbb0c9ae-d4d4-45c2-8816-15021f0efb3b} {/linux-2.4.19/Documentation/mips/time.README baa61a78-41ef-4585-a60c-447c3bfcd098} {/linux-2.4.19/Documentation/mips/pci c3667b38-6f9b-4d0e-bf47-acc18c00de2f} {/linux-2.4.19/Documentation/mips/pci/pci.README 787efd80-f696-4e40-85d1-84fad8a796f6} {/linux-2.4.19/Documentation/power fd0df8f6-2c49-4adc-9bc0-14951fb2964b} {/linux-2.4.19/Documentation/power/pci.txt f004ea91-5ce5-4434-8a51-c93d9b76f915} {/linux-2.4.19/Documentation/README.nsp_cs.eng e298ce62-eb95-4496-b7f5-9c413dfd6d17} {/linux-2.4.19/REPORTING-BUGS 4c7f88fd-6472-4a29-969b-7104057518d9}]}]  error="write tcp 127.0.0.1:24007->127.0.0.1:53628: i/o timeout" reqid=78a7e01e-7d79-4647-a142-348f4ff7280f source="[utils.go:47:utils.SendHTTPResponse]"
INFO[2018-12-12 07:44:59.830510] 127.0.0.1 - - [12/Dec/2018:12:56:24 +0530] "GET /v1/volumes/volrep100/heal-info HTTP/1.1" 200 3769  reqid=78a7e01e-7d79-4647-a142-348f4ff7280f

Dec 12 '18 07:12 Madhu-1

Not within the scope of GCS/1.0 , in any case this looks like not a bug if we use a higher timeout option through cli. For a ReST call, what happens here is what we should figure out.

Jan 17 '19 10:01 atinmu