Add retry to getDependencies downloadFile
We've seen intermittent problems on some nodes (eg. Solaris) where curl trips up with rc=18 (https://github.com/adoptium/infrastructure/issues/4119). This PR adds a x10 retry to downloadFile to try and mitigate this.
11:50:14 [exec] Starting download third party dependent jars
11:50:14 [exec] --------------------------------------------
11:50:14 [exec] downloading dependent third party jars to /export/home/vagrant/aqa-tests/TKG/../TKG/lib
11:50:14 [exec] downloading -L https://download.dacapobench.org/chopin/dacapo-23.11-MR2-chopin-minimal.zip
11:50:14 [exec] download attempt 1 for -L https://download.dacapobench.org/chopin/dacapo-23.11-MR2-chopin-minimal.zip
11:50:14 [exec] --> file downloaded to /export/home/vagrant/aqa-tests/TKG/../TKG/lib/dacapo.zip
11:50:14 [exec] downloaded dependent third party jars successfully
--retry on its own does not work for this situation. It requires --retry 5 --retry-all-errors otherwise it gives up in this situation:
Attempt o
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
38 1368M 38 525M 0 0 1918k 0 0:12:10 0:04:40 0:07:30 1944k
curl: (56) Recv failure: Connection reset by peer
Warning: Problem (retrying all errors). Will retry in 1 second. 5 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
78 1368M 78 1071M 0 0 1960k 0 0:11:54 0:09:19 0:02:35 1909k
curl: (18) end of response with 310561138 bytes missing
Warning: Problem (retrying all errors). Will retry in 1 second. 4 retries left.
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 1368M 100 1368M 0 0 1915k 0 0:12:11 0:12:11 --:--:-- 1874k
Attempt p
Ref: https://everything.curl.dev/usingcurl/downloads/retry.html
--retry-all-errors
Maybe, but i've not got access to test on all platforms, does "--retry 10 --retry-all-errors" it work on zOS ? @llxia
We are modifying the getDependency workflow. See #758. In the improved version, we will only download the 3rd-party libs as needed. For example, if we do not run dacapo test, we will not download dacapo jar.
jdk8u-solaris-sparcv9-temurin-simpletest does not leverage our pre-starged 3rd-party libs on the machine logic in AQA test pipeline. It seems to download it to the workspace
downloading dependent third party jars to /export/home/vagrant/aqa-tests/TKG/../TKG/lib. This results lib being downloaded for every run.With #758, this PR will make the download 20x instead of 10x. Even 10x is excessive, I would like to know what the root cause of this. In the following example, json-simple-1.1.1.jar cannot be downloaded after 10x.
[exec] downloading dependent third party jars to /export/home/vagrant/aqa-tests/TKG/../TKG/lib [exec] downloading https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 1 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 2 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 3 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 4 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 5 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 6 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 7 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 8 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 9 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] download attempt 10 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar [exec] % Total % Received % Xferd Average Speed Time Time Time Current [exec] Dload Upload Total Spent Left Speed [exec] [exec] ERROR: downloading https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar failed, return code: 99 [exec] 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 23931 100 23931 0 0 290k 0 --:--:-- --:--:-- --:--:-- 295k BUILD FAILED /export/home/vagrant/aqa-tests/TKG/scripts/build_tools.xml:58: The following error occurred while executing this line: /export/home/vagrant/aqa-tests/TKG/scripts/getDependencies.xml:121: exec returned: 9https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-sparcv9-temurin-simpletest/133/console
- Firstly, ignore the json-simple failure in that job, as that was me testing an early version with a mistake in my logic!
- I'm not sure #758 will help us in our Solaris scenario, as the problem is intermittent hardware/network issues. The point here is take make downloadFile() more robust, just like we already do in get.sh with various commands and curl https://github.com/adoptium/aqa-tests/blob/c5b9b68617e160f1e6373ee243cc572fdf44f36b/get.sh#L624C1-L624C20
- We can make it x5 like in get.sh ?
- The underlying cause is documented here https://github.com/adoptium/infrastructure/issues/4119
The intermittent problems on some nodes (eg. Solaris), not happens to specific jars. I think it's addressing the different issue from https://github.com/adoptium/TKG/pull/758 @llxia
@llxia Is this acceptable as described? thanks
@annaibm could you please test it internally? Thanks