helpdesk icon indicating copy to clipboard operation
helpdesk copied to clipboard

[Update Center] generate HTML pages with absolute links

Open dduportal opened this issue 1 year ago • 5 comments

As described in https://github.com/jenkins-infra/helpdesk/issues/2649#issuecomment-2380569628, the HTML files generated by jenkins-infra/update_center2 are using relative links.

It used to be a good technique when dealing with both domains updates.jenkins-ci.org and updates.jenkins.io in the past when they both served files.

But it is now an issue in the context of the new Update Center system which uses HTTP(S) mirrors to serve content to end users to:

  • Limit the outbound bandwidth on the Jenkins Infra main clouds as it is really expensive
  • Ensure content is served as fast as possible to provide better user experience. It's particularly visible in China as per https://github.com/jenkins-infra/helpdesk/issues/2787 or https://github.com/jenkins-infra/helpdesk/issues/3636

Examples of pages:

  • Working on the "current/legacy" Update Center VM: https://aws.updates.jenkins.io/download/plugins/gradle/index.html
  • Failing page on the new Update Center mirror, in one of the used mirrors (West Europe on Cloudflare): https://westeurope.cloudflare.jenkins.io/download/plugins/gradle/index.html

dduportal avatar Sep 28 '24 08:09 dduportal

Comment by @daniel-beck about the bandwidth in a discussion we got together on this topic:

By @dduportal I don't recall the exact amount of data transferred but it was huge even for these tiny HTML files. We're speaking about Tbs per month (globally, it's 50 Tb per month)

Did you just group by file extension, or also path? Because some of the "JSON" files also have an HTML file extension. So > if you count https://updates.jenkins.io/update-center.json.html as HTML, that'll skew this a lot.

=> Important point as it means we could have to change the routing pattern.

Cloudflare Analytics shows that HTML was far behind in amount of requests but we can't tell the different HTML files appart:

Capture d’écran 2024-09-28 à 10 26 45

dduportal avatar Sep 28 '24 08:09 dduportal

Proposal: Given the context of the new Update Center, let's use absolute URL links.

  • We can assume updates.jenkins.io will be the entry point.
    • It's a CNAME (and not a A) record now
    • There are no blockers to add a redirect from updates.jenkins-ci.org to this domain.
  • It will need a bit of adaptation on the testing for update_center2 to ensure we can generate pages to a "custom" hostname/scheme.
    • Requires allowing to parameterize https://github.com/jenkins-infra/update-center2/blob/bce1fdac6c45d989e2fc91e633a2d3ce2c19d5a1/src/main/java/io/jenkins/update_center/MavenArtifact.java#L48
  • Implementing it means removing the getPath()method when retrieving download URL from getDownloadUrl() during the HTML building (but NOT when retrieving data from Artifactory or building htaccess files!)
  • (2 occurrences: HPI and non HPI links) https://github.com/jenkins-infra/update-center2/blob/bce1fdac6c45d989e2fc91e633a2d3ce2c19d5a1/src/main/java/io/jenkins/update_center/IndexHtmlBuilder.java#L93-L95
  • "Latest" permalink: https://github.com/jenkins-infra/update-center2/blob/bce1fdac6c45d989e2fc91e633a2d3ce2c19d5a1/src/main/java/io/jenkins/update_center/DirectoryTreeBuilder.java#L61
  • Jenkins War permalink: https://github.com/jenkins-infra/update-center2/blob/bce1fdac6c45d989e2fc91e633a2d3ce2c19d5a1/src/main/java/io/jenkins/update_center/DirectoryTreeBuilder.java#L88

What are your thoughts on this @daniel-beck @timja @MarkEWaite ?

dduportal avatar Sep 28 '24 09:09 dduportal

Absolute URL makes sense to me.

timja avatar Sep 29 '24 21:09 timja

Cloudflare Analytics shows that HTML was far behind in amount of requests

It's by far the most popular content type? How does that make any sense?

Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?

It doesn't look like we understand enough what's going on here to base any decisions on.

daniel-beck avatar Oct 01 '24 06:10 daniel-beck

Cloudflare Analytics shows that HTML was far behind in amount of requests

It's by far the most popular content type? How does that make any sense?

Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?

It doesn't look like we understand enough what's going on here to base any decisions on.

We understand the mirroring mechanism which is why i opened this issue. If we start to select files which are mirrored vs which one are not, the architectural complexity will be a pain as we will need to maintain a list of conditions. It is already nightmare-ish on get.jenkins.io tbh

hence the question about pros and cons of switching to absolute URLs which is non mutually exclusive with analysing usage to understand better.

the costs involved here are huge compared to optimization: but it is mandatory to have a finer grain of understanding

dduportal avatar Oct 01 '24 14:10 dduportal

Hello @daniel-beck 👋

Cloudflare Analytics shows that HTML was far behind in amount of requests

It's by far the most popular content type?

My apologies, I mistakenly used the word "behind". You are correct, I meant that HTML seems to be, by far, the most popular type of file downloaded, at least as per the Cloudflare dashboard during the 24 hours experiment.

Let me check if we see the same result on the current VM (analysing the logs from a few days ago).

How does that make any sense?

I don't know. Let's compare with current behavior. That could also be "assumed" content type (including HTTP/404) as they are served as HTML as well.

Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?

I ... don't know. We did not even know there was an HTML version of this one. Where should we look (except our access logs)?

dduportal avatar Oct 16 '24 13:10 dduportal

Initial check for the 09 October 2024 (both HTTP and HTTPS, both updates.jenkins-ci.org and updates.jenkins.io vhosts):

  • ~ 8,478,760 hits

  • ~ 444.350 visitors

  • ~5,000,000 redirections (HTTP/3XX) for around 1.2 Gib

  • ~3,200,000 files served (HTTP/2XX) for around 2.1 Tib

  • ~ 257,890 client errors (HTTP/4XX) for around 43 Mib

Report (generated with GoAccess from the "combined" access log):

report.html.zip

dduportal avatar Oct 16 '24 14:10 dduportal

@daniel-beck If we compare with Cloudflare numbers for 24 hours, which are only HTTP/2XX and HTTP/4XX (as the redirects are NOT sent to Cloudflare), it maps:

  • Total requests on Cloudflare where ~10,8M
    • 7,67M where HTTP/4XX => It's ~3,2M HTTP/2XX which is the same number as what we see on the actual production.
  • 9,88M requests where "HTML": it's ~2,21M HTTP/2XX HTML types (removing the HTTP/4XX)
    • JSON (which are HTTP/2XX) are ~1,14M which means we have 1/3 of JSON, and 2/3 of HTML (all files included, tools installer and metadatas).

Need to check the repartition HTML/JSON on the current production, but the high rate of HTTP/4XX clearly explains the ratio change during the brownout.

It also adds more weight in using an absolute URL in the HTML generated files to decrease this amount of HTTP/4XX.

dduportal avatar Oct 16 '24 14:10 dduportal

Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.

1/3 of JSON, and 2/3 of HTML

The problem with this view is that there are different kinds of HTML files on this domain.

The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's wget --recursive goes brrrr.

Various update-center.json.html exist and are irrelevant for this topic. Half the tool installer files (e.g. in https://updates.jenkins.io/updates/ ) are HTML files and are irrelevant for this topic.

daniel-beck avatar Oct 16 '24 20:10 daniel-beck

Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.

the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access20241003gz. Got 4 files (unsecured and secured, for both hostnames)

dduportal avatar Oct 17 '24 08:10 dduportal

Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.

the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access_20241003_gz. Got 4 files (unsecured and secured, for both hostnames)

Additions:

  • I concatenated the 4 access logs files from production into a single one and ran the goaccess tool on it (specifying combined logs format). The "concatenated" file weight 1.2 Gb: do you want me to send it to you (compressed) through a private channel @daniel-beck to avoid further unneded tasks for you?

dduportal avatar Oct 17 '24 08:10 dduportal

The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's wget --recursive goes brrrr.

Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than updates.jenkins.io due to redirections.

I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used?

What did I miss?

dduportal avatar Oct 17 '24 08:10 dduportal

As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.

The most popular URL that this issue is about is accessed just 24 times across the 4 logs:

  24 /download/plugins/htmlpublisher/

Compared to:

508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html
387857 /updates/hudson.tasks.Ant.AntInstaller.json.html
339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html
334649 /updates/hudson.tools.JDKInstaller.json.html

Methodology (prove me wrong):

cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000
sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted
uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed
sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted

daniel-beck avatar Oct 17 '24 08:10 daniel-beck

As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.

The most popular URL that this issue is about is accessed just 24 times across the 4 logs:

  24 /download/plugins/htmlpublisher/

Compared to:

508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html
387857 /updates/hudson.tasks.Ant.AntInstaller.json.html
339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html
334649 /updates/hudson.tools.JDKInstaller.json.html

Methodology (prove me wrong):

cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000
sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted
uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed
sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted

Yes, I had the same results before generating the goaccess. I fail to understand the relationship with the current issue: the domain change when serving files from mirrors leads to wrong hyperlinks in the generated pages. what did I miss?

dduportal avatar Oct 17 '24 09:10 dduportal

Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than updates.jenkins.io due to redirections.

I wonder whether this is necessary. Seems like mirrors make sense for anything that's actual "content" (the stuff being downloaded), not glorified directory indexes.

I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used?

What did I miss?

This came from https://github.com/jenkins-infra/helpdesk/issues/4311#issuecomment-2384923753 / https://github.com/jenkins-infra/helpdesk/issues/4311#issuecomment-2416879452

Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)

daniel-beck avatar Oct 17 '24 09:10 daniel-beck

I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used? What did I miss?

This came from #4311 (comment) / #4311 (comment)

Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)

Oh i see, thanks for clarifying. We agree then on the result from the current production.

Let me compile my thoughts and analysis on the Cloudflare part:

  • Cloudflare still does not provides us access logs, only the terrible dashboard I screenshot. Request sent to them to enable access log publication (streamed to datadog as we cannot access them directly). Like any sponsorship programs, the beginning is back and forth
  • My 1/3 vs. 2/3 is a ratio in number of hits, not in downloaded volume. We need to calculate this on the current access logs (I'll try to do it and publish my shell commands, because goaccess is too limited for such analysis), either by content type or by URL patterns.
  • The huge spike in HTTP/4XX means we still have some endpoints sent to mirrors which should not. The links on the pages here (most probably due to crawler patterns) are part of this, but we don't really know how much.

@smerle33 did propose to use non Cloudflare mirror as a safety net if things goes south with CF. It would use a custom webserver we manage (or two) and hosted in DigitalOcean (we have 4-5 Tb bandwidth for free and 15k credits valids until end of year) so we can check access logs in details. Cost is OK for another brownout (assuming 2 to 3 Tb of download for 24h), but we'll need to be careful if we add it permanently.

dduportal avatar Oct 17 '24 09:10 dduportal

I met with @dduportal to move this topic along. Outcome:

  • He's looking into continuing to serve download link/index files from updates.jenkins.io, probably involving migrating RedirectMatch to RewriteRule in the uc2 .htaccess file due to how weird Apache is, if that's reasonably straightforward to accomplish. This prevents users from linking/bookmarking to "implementation detail" hostnames.
  • I look into making URLs in --download-links-directory and --latest-links-directory absolute instead of relative, independent of the outcome of your task. This is implemented in https://github.com/jenkins-infra/update-center2/pull/810

daniel-beck avatar Oct 17 '24 13:10 daniel-beck

I met with @dduportal to move this topic along. Outcome:

* He's looking into continuing to serve download link/index files from updates.jenkins.io, probably involving migrating `RedirectMatch` to `RewriteRule` in the uc2 `.htaccess` file due to how weird Apache is, if that's reasonably straightforward to accomplish. This prevents users from linking/bookmarking to "implementation detail" hostnames.

* I look into making URLs in `--download-links-directory` and `--latest-links-directory` absolute instead of relative, independent of the outcome of your task. This is implemented in [Use absolute URLs for links from download indexes update-center2#810](https://github.com/jenkins-infra/update-center2/pull/810)

Following this summary, I've opened the PR https://github.com/jenkins-infra/update-center2/pull/812 to focus on the second solution.

With the use of RewriteRule for the "fallback" rule (tested with success), we can add a rewrite condition to test the absence of a file: that would allow us to server the /downloads/**/*html file from Apache since it's only a low volume, and would solve the HTTP/404 links without requiring absolute links.

dduportal avatar Oct 18 '24 14:10 dduportal

Update:

  • https://github.com/jenkins-infra/update-center2/pull/812 has been tested and then merged with success. No more RedirectMatch on pkg VM + it keeps working as expected in the UC in Azure + mirrors.

  • It unblock the issue here: opened https://github.com/jenkins-infra/update-center2/pull/813 to start serving the HTML files from download/*** from Apache (and the uctest.json 😉 ) instead of mirrors.

dduportal avatar Oct 21 '24 17:10 dduportal

  • PR https://github.com/jenkins-infra/update-center2/pull/813 has been merged: update-center2 job ran with success
  • We can see that https://azure.updates.jenkins.io/download/plugins/gradle/index.html does NOT redirect anymore to a mirror. As such, its links (even if still relative) are not broken
  • The ?uctest trick is also working as expected:
# Before the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 307 
date: Tue, 22 Oct 2024 09:54:37 GMT
content-type: text/html; charset=iso-8859-1
location: https://mirrors.updates.jenkins.io/uctest.json?uctest
strict-transport-security: max-age=2592000; includeSubDomains; preload

# After the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 200 
date: Tue, 22 Oct 2024 09:55:06 GMT
content-type: application/json
content-length: 3
last-modified: Tue, 22 Oct 2024 09:54:46 GMT
etag: "3-6250dc26ce6f7"
accept-ranges: bytes
strict-transport-security: max-age=2592000; includeSubDomains; preload

dduportal avatar Oct 22 '24 09:10 dduportal