TKG icon indicating copy to clipboard operation
TKG copied to clipboard

Add retry to getDependencies downloadFile

Open andrew-m-leonard opened this issue 1 month ago • 3 comments

We've seen intermittent problems on some nodes (eg. Solaris) where curl trips up with rc=18 (https://github.com/adoptium/infrastructure/issues/4119). This PR adds a x10 retry to downloadFile to try and mitigate this.

11:50:14      [exec] Starting download third party dependent jars
11:50:14      [exec] --------------------------------------------
11:50:14      [exec] downloading dependent third party jars to /export/home/vagrant/aqa-tests/TKG/../TKG/lib
11:50:14      [exec] downloading -L https://download.dacapobench.org/chopin/dacapo-23.11-MR2-chopin-minimal.zip
11:50:14      [exec] download attempt 1 for -L https://download.dacapobench.org/chopin/dacapo-23.11-MR2-chopin-minimal.zip
11:50:14      [exec] --> file downloaded to /export/home/vagrant/aqa-tests/TKG/../TKG/lib/dacapo.zip
11:50:14      [exec] downloaded dependent third party jars successfully

andrew-m-leonard avatar Oct 31 '25 09:10 andrew-m-leonard

--retry on its own does not work for this situation. It requires --retry 5 --retry-all-errors otherwise it gives up in this situation:

Attempt o
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 38 1368M   38  525M    0     0  1918k      0  0:12:10  0:04:40  0:07:30 1944k
curl: (56) Recv failure: Connection reset by peer
Warning: Problem (retrying all errors). Will retry in 1 second. 5 retries left.
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
 78 1368M   78 1071M    0     0  1960k      0  0:11:54  0:09:19  0:02:35 1909k
curl: (18) end of response with 310561138 bytes missing
Warning: Problem (retrying all errors). Will retry in 1 second. 4 retries left.
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 1368M  100 1368M    0     0  1915k      0  0:12:11  0:12:11 --:--:-- 1874k
Attempt p

Ref: https://everything.curl.dev/usingcurl/downloads/retry.html

sxa avatar Oct 31 '25 16:10 sxa

--retry-all-errors

Maybe, but i've not got access to test on all platforms, does "--retry 10 --retry-all-errors" it work on zOS ? @llxia

andrew-m-leonard avatar Nov 03 '25 09:11 andrew-m-leonard

  • We are modifying the getDependency workflow. See #758. In the improved version, we will only download the 3rd-party libs as needed. For example, if we do not run dacapo test, we will not download dacapo jar.

    • jdk8u-solaris-sparcv9-temurin-simpletest does not leverage our pre-starged 3rd-party libs on the machine logic in AQA test pipeline. It seems to download it to the workspace downloading dependent third party jars to /export/home/vagrant/aqa-tests/TKG/../TKG/lib. This results lib being downloaded for every run.

    • With #758, this PR will make the download 20x instead of 10x. Even 10x is excessive, I would like to know what the root cause of this. In the following example, json-simple-1.1.1.jar cannot be downloaded after 10x.

[exec] downloading dependent third party jars to /export/home/vagrant/aqa-tests/TKG/../TKG/lib
     [exec] downloading https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 1 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 2 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 3 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 4 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 5 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 6 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 7 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 8 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 9 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec] download attempt 10 for https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar
     [exec]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
     [exec]                                  Dload  Upload   Total   Spent    Left  Speed
     [exec] 
     [exec] ERROR: downloading https://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1.1/json-simple-1.1.1.jar failed, return code: 99
     [exec]   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 23931  100 23931    0     0   290k      0 --:--:-- --:--:-- --:--:--  295k

BUILD FAILED
/export/home/vagrant/aqa-tests/TKG/scripts/build_tools.xml:58: The following error occurred while executing this line:
/export/home/vagrant/aqa-tests/TKG/scripts/getDependencies.xml:121: exec returned: 9

https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-sparcv9-temurin-simpletest/133/console

  • Firstly, ignore the json-simple failure in that job, as that was me testing an early version with a mistake in my logic!
  • I'm not sure #758 will help us in our Solaris scenario, as the problem is intermittent hardware/network issues. The point here is take make downloadFile() more robust, just like we already do in get.sh with various commands and curl https://github.com/adoptium/aqa-tests/blob/c5b9b68617e160f1e6373ee243cc572fdf44f36b/get.sh#L624C1-L624C20
  • We can make it x5 like in get.sh ?
  • The underlying cause is documented here https://github.com/adoptium/infrastructure/issues/4119

andrew-m-leonard avatar Nov 04 '25 11:11 andrew-m-leonard

The intermittent problems on some nodes (eg. Solaris), not happens to specific jars. I think it's addressing the different issue from https://github.com/adoptium/TKG/pull/758 @llxia

sophia-guo avatar Dec 04 '25 14:12 sophia-guo

@llxia Is this acceptable as described? thanks

andrew-m-leonard avatar Dec 04 '25 15:12 andrew-m-leonard

@annaibm could you please test it internally? Thanks

llxia avatar Dec 04 '25 15:12 llxia