rustup icon indicating copy to clipboard operation
rustup copied to clipboard

Checksum download order interacts poorly with Artifactory caches

Open sfackler opened this issue 4 years ago • 5 comments

I'm trying to set up an internal caching mirror of static.rust-lang.org with Artifactory, but am running into a bad interaction between assumptions Artifactory makes around checksum files and the way rustup downloads files.

Specifically, when downloading e.g. foo.tar.gz, rustup will always download the foo.tar.gz.sha256 file first. It uses this to both validate that it properly downloaded the file, but also to avoid downloading the file at all if it already has a copy and the checksums match.

Unfortunately, even when set up to just act as a transparent cache around another server, Artifactory doesn't treat .sha256 files like just another "normal" file. It instead serves its own internal hash of the file at that URL. On its own, this wouldn't be a huge deal. The format of the file Artifactory uses isn't quite the same (it doesn't include the filename), but rustup should still be able to work with it just fine. However, if you request the checksum of a file Artifactory hasn't cached yet, it will return a 404. Since rustup always downloads the checksum file first, this is a problem.

This feels like a bug in Artifactory (it should at the very least treat a request to a checksum URL as a request to download the actual file), but it's not clear to me that the behavior there is going to change any time soon and Artifactory is widely enough used that it might be worth looking into changing rustup to work around this behavior.

Specifically, I think we can change rustup to use etags to avoid redownloading files rather than comparing checksums:

  1. Rather than storing the sha256 hashes in the update-hashes directory, we store the value of the Etag header attached to the response when we download the actual file.
  2. If we have a cached etag, we send an If-None-Match header to the server with the saved etag.
  3. The server will respond with a 304 Not Modified if the etag we send matches the etag of its file and won't send the actual file contents back.
  4. If we did need to download the file, we download the checksum file from the server and validate.

I've already checked that static.rust-lang.org does attach etags and respects If-None-Match.

sfackler avatar Oct 03 '21 14:10 sfackler

Importantly this is not a bug in rustup but in Artifactory itself. I'm happy for us to look at ways to improve rustup so that it doesn't trigger this problem, but I refuse to consider it a bug in rustup itself.

We use checksum files for more than avoiding re-downloading files, though that is the primary use for them in the manifest case. I would have no problem storing the ETag into perhaps $RUSTUP_HOME/update-hashes/$CHANNEL.etag and then using that if present instead of trying to download the hash instead. This would be a fairly isolated change and likely moderately easy to test for.

kinnison avatar Oct 09 '21 10:10 kinnison

I would appreciate if @kinnison's idea would be evaluated. I do not expect a solution from Artifactory: https://jfrog.atlassian.net//si/jira.issueviews:issue-html/RTFACT-26199/RTFACT-26199.html

MichaelKorn avatar Aug 23 '23 10:08 MichaelKorn

@MichaelKorn would you be able to contribute to getting this fixed? The rustup team is currently pretty underresourced so it would be great if you could help out with this -- we're happy to help guide you if you have specific questions.

djc avatar Aug 23 '23 12:08 djc

@djc to be honest, I'll see what the demand is first. Unfortunately I have to implement a workaround anyway: the Generic Repository in Artifactory cannot refresh any files either. So I will now delete the .tomls via cronjob. As a workaround for this issue, I will use curl to force the files back into the cache immediately after deleting them. @sfackler How did you solve the refresh/update issue?

MichaelKorn avatar Aug 25 '23 07:08 MichaelKorn

@MichaelKorn In our case we wrote a small artifactory plugin, which does check if the remote file is newer and expires the file if the remote timestamp is newer.
Attaching the Plugin in the codeblock below. (genericCacheUpdate.groovy) (But, artifactory will deprecate plugins soon in favor of "workers", thus it may not be the best option)

I'll open up another ticket at jfrog, referencing this. Potentially a small plugin in artifactory, which will attempt to download the file, if a request to a non-existing sha is made could be the way to go for an automatic solution if jfrog doesnt want to fix this.

genericCacheUpdate.groovy Plugin Code
# Many imports are not necessary, but I didnt clean it up, so do whatever you want with this :D 
import groovy.transform.Field
import org.artifactory.request.Request
// HttpArtifactoryRequest is when recieving general request via /artifactory/repo/path
import org.artifactory.webapp.servlet.HttpArtifactoryRequest
// InternalArtifactoryRequest is when recieving request via /artifactory/api/apitype/repo/path
import org.artifactory.api.request.InternalArtifactoryRequest
import org.artifactory.repo.RepoPath
import org.artifactory.model.common.RepoPathImpl
import org.artifactory.repo.RepositoryConfiguration
import org.artifactory.repo.HttpRepositoryConfigurationImpl
import org.artifactory.repo.RepositoryConfigurationBase
import org.artifactory.repo.Repositories

import java.util.Map

import java.lang.Exception

import java.util.regex.Matcher;
import java.util.regex.Pattern;

// logger class 
import ch.qos.logback.classic.Logger

// security class
import org.artifactory.addon.plugin.services.SecurityImpl

// repository class
import org.artifactory.addon.plugin.services.RepositoriesImpl
import org.artifactory.fs.ItemInfo
import org.jfrog.build.api.repository.RepositoryConfig

@Field final TOKEN = "" // token to login to the remote artifactory, reading potential passwords from repository config is impossible

private void log_error(String repo, String target, String event, String message) {
  SecurityImpl s = security
  String requestingUser = s.currentUsername()

  String errorString = "user=\"${requestingUser}\""
  if (repo != null) {
    errorString += " repo=\"${repo}\""
  }
  if (target != null) {
    errorString += " target=\"${target}\""
  }
  if (event != null) {
    errorString += " event=\"${event}\""
  }
  if (message != null) {
  errorString += " details=\"${message}\""
  }
  if (errorString.length() > 0) {
    log.error(errorString)
  }
}

private String filterTraceParamFromQueryString(String queryParam) {

  String filtered = queryParam.replace("trace", "")
      filtered = filtered.replace("?&", "?") //fix leading trace with more params after
      filtered = filtered.replace("&&", "&") // quickfix for removed trace param, if param is used in conjunction with others. 
      if (filtered.endsWith("?")) {
        filtered = "" // if ends with ? there are no params left. 
      }
      if (filtered.endsWith("&")) {
        filtered = filtered.substring(0, filtered.length() - 1);
      }
  return filtered
}

private String errorLineFromStacktrace(String stacktrace) {
  // match pattern "genericCacheUpdate\.groovy:\d+"

  Pattern pattern = Pattern.compile("genericCacheUpdate\\.groovy:\\d+")
  Matcher matcher = pattern.matcher(stacktrace)

  String errorLine = "??"
  if (matcher.find()) {
    errorLine = matcher.group()
  }
  
  return " in line ${errorLine}"
}


download {
  beforeDownloadRequest { Request request, RepoPath repoPath ->

    try {
      String repoKey = repoPath.getRepoKey()
      Repositories r = repositories

      RepositoryConfiguration rc = r.getRepositoryConfiguration(repoKey)
      HttpRepositoryConfigurationImpl rcb = rc

      // docker|debian|...
      String packageType = rc.getPackageType()
      // remote|local
      String repoType = rc.getType()

      if (packageType != "generic" || repoType != "remote") {
        return
      }

      //return if file is not cached
      if(!r.exists(repoPath)) {
        return
      }

      // if NoLastModifiedHeader is 1, return
      String noLastModifiedHeader = r.getProperty(repoPath, "NoLastModifiedHeader")
      if (noLastModifiedHeader && noLastModifiedHeader == "1") {
        // previous check did result in a hit, but no last modified header was found
        // returning to prevent unnecessary check
        return
      }

      // get cached ressource last modified date

      ItemInfo fileInfo = r.getFileInfo(repoPath)
      // also check property "lastModifyCheck" on item
      String lastModifiedlastCheckProperty = r.getProperty(repoPath, "lastModifiedLastCheck")

      long lastModifiedlastCheck = lastModifiedlastCheckProperty ? lastModifiedlastCheckProperty.toLong() : -1
      
      // if lastModifiedCheck is set, use the property instead of lastModifiedTimestamp

      long lastModifiedTimestamp = fileInfo.getLastModified()
      if (!lastModifiedlastCheck || lastModifiedlastCheck <= 0 ) {
        lastModifiedlastCheck = lastModifiedTimestamp
      }
      
      long currentTimestamp = new Date().getTime()

      // get repo "Metadata Retrival Cache Period" time
      long mrcp = rc.getRetrievalCachePeriodSecs()*1000

      if (lastModifiedlastCheck + mrcp > currentTimestamp) {
        // were not yet over "check period", ignore request, return cache.
        return
      }

      log_error(repoKey, repoPath.getPath(), "newerFileCheck", "Checking for newer file in remote repository.")

      // send head request to ressource remote

      String remoteBaseURL = rc.getUrl()

      // check if request had query params
      String queryParam = request.httpRequest.getQueryString()
      boolean isP = queryParam && queryParam.length() > 0
      
      String queryString = (isP && rcb.propagateQueryParams) ? "?${queryParam}" : ""
      queryString = filterTraceParamFromQueryString(queryString)

      if(isP && rcb.propagateQueryParams) {
        log.error("Propagating query Params: ${queryString}")
      }
      URL req = new URL("${remoteBaseURL}/${repoPath.getPath()}${queryString}")
      URLConnection conn = req.openConnection()

      if (rc.isBypassHeadRequests()) {
        conn.setRequestMethod("GET")
      }
      else {
        conn.setRequestMethod("HEAD")
      }
      // since we dont have access to the request, we cannot use the basic auth from the repository, as password is encrypted by artifactory. 
      // using hard coded general credentials for now.
      conn.addRequestProperty("Authorization", "Bearer ${TOKEN}")
      conn.getInputStream().close();

      Map resHeader = conn.getHeaderFields()

      // headers are Map<String, List<String>>
      // some Websites may return "Last-Modified" instead of "last-modified", and java is case sensitive, so we need to check both.
      String[] lastModifiedResponse = resHeader.getOrDefault("last-modified", null)
      if (!lastModifiedResponse) {
        lastModifiedResponse = resHeader.getOrDefault("Last-Modified", null)
      }

      if (!lastModifiedResponse) {
        log_error(repoPath.getRepoKey(), repoPath.getPath(), "noLastModifiedHeader", "No Last-Modified Header returned from remote server.")
        r.setProperty(repoPath, "NoLastModifiedHeader", "1" )
        log.error("property for ${repoPath.getPath()} set to 1")
        return
      }

      // last modified header is string in the format: "Wed, 21 Oct 2015 07:28:00 GMT" convert to timestamp
      long lastModifiedResponseTimestamp = new Date(lastModifiedResponse[0]).getTime()

      // set to expired, so that artifactory tries to fetch the new file.
      if (lastModifiedResponseTimestamp > lastModifiedTimestamp) {
        log_error(repoPath.getRepoKey(), "${repoPath.getPath()}${queryString}", "forceExpireArtifact", "Remote Artifact is newer, marking as expired to force download.")
        expired = true
      }

      // set lastCheck property to current time
      lastModifiedlastCheck = new Date().getTime()
      r.setProperty(repoPath, "lastModifiedLastCheck", (String) lastModifiedlastCheck)
      log.error("lastModifiedLastCheck set to ${lastModifiedlastCheck}")

    }
    catch(Exception e) {
      log_error(repoPath.getRepoKey(), repoPath.getPath(), "genericPluginException", "${e.getMessage()} → ${errorLineFromStacktrace(e.getStackTrace().toString())}")
    }

  }
}

Bloodiko avatar Jul 30 '24 14:07 Bloodiko

There is a "Retrieve SHA256 from Remove Server" option for remote generic repositories in Aritfactory. When set, instead of a 404 for .sha256 path the result is a hash, e.g. 5ffe190473b7896d1f39e9d0ddfa04bec72000f25897669bb296814e10ceba42.

Will update if I hear back that it is confirmed this fixes things on the rustup side.

itamarst avatar Apr 16 '25 14:04 itamarst

That option seems to work!

sfackler avatar Apr 22 '25 11:04 sfackler