helpdesk
helpdesk copied to clipboard
[Update Center] generate HTML pages with absolute links
As described in https://github.com/jenkins-infra/helpdesk/issues/2649#issuecomment-2380569628, the HTML files generated by jenkins-infra/update_center2 are using relative links.
It used to be a good technique when dealing with both domains updates.jenkins-ci.org and updates.jenkins.io in the past when they both served files.
But it is now an issue in the context of the new Update Center system which uses HTTP(S) mirrors to serve content to end users to:
- Limit the outbound bandwidth on the Jenkins Infra main clouds as it is really expensive
- Ensure content is served as fast as possible to provide better user experience. It's particularly visible in China as per https://github.com/jenkins-infra/helpdesk/issues/2787 or https://github.com/jenkins-infra/helpdesk/issues/3636
Examples of pages:
- Working on the "current/legacy" Update Center VM: https://aws.updates.jenkins.io/download/plugins/gradle/index.html
- Failing page on the new Update Center mirror, in one of the used mirrors (West Europe on Cloudflare): https://westeurope.cloudflare.jenkins.io/download/plugins/gradle/index.html
Comment by @daniel-beck about the bandwidth in a discussion we got together on this topic:
By @dduportal I don't recall the exact amount of data transferred but it was huge even for these tiny HTML files. We're speaking about Tbs per month (globally, it's 50 Tb per month)
Did you just group by file extension, or also path? Because some of the "JSON" files also have an HTML file extension. So > if you count https://updates.jenkins.io/update-center.json.html as HTML, that'll skew this a lot.
=> Important point as it means we could have to change the routing pattern.
Cloudflare Analytics shows that HTML was far behind in amount of requests but we can't tell the different HTML files appart:
Proposal: Given the context of the new Update Center, let's use absolute URL links.
- We can assume updates.jenkins.io will be the entry point.
- It's a CNAME (and not a A) record now
- There are no blockers to add a redirect from
updates.jenkins-ci.orgto this domain.
- It will need a bit of adaptation on the testing for update_center2 to ensure we can generate pages to a "custom" hostname/scheme.
- Requires allowing to parameterize https://github.com/jenkins-infra/update-center2/blob/bce1fdac6c45d989e2fc91e633a2d3ce2c19d5a1/src/main/java/io/jenkins/update_center/MavenArtifact.java#L48
- Implementing it means removing the
getPath()method when retrieving download URL fromgetDownloadUrl()during the HTML building (but NOT when retrieving data from Artifactory or building htaccess files!) - (2 occurrences: HPI and non HPI links) https://github.com/jenkins-infra/update-center2/blob/bce1fdac6c45d989e2fc91e633a2d3ce2c19d5a1/src/main/java/io/jenkins/update_center/IndexHtmlBuilder.java#L93-L95
- "Latest" permalink: https://github.com/jenkins-infra/update-center2/blob/bce1fdac6c45d989e2fc91e633a2d3ce2c19d5a1/src/main/java/io/jenkins/update_center/DirectoryTreeBuilder.java#L61
- Jenkins War permalink: https://github.com/jenkins-infra/update-center2/blob/bce1fdac6c45d989e2fc91e633a2d3ce2c19d5a1/src/main/java/io/jenkins/update_center/DirectoryTreeBuilder.java#L88
What are your thoughts on this @daniel-beck @timja @MarkEWaite ?
Absolute URL makes sense to me.
Cloudflare Analytics shows that HTML was far behind in amount of requests
It's by far the most popular content type? How does that make any sense?
Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?
It doesn't look like we understand enough what's going on here to base any decisions on.
Cloudflare Analytics shows that HTML was far behind in amount of requests
It's by far the most popular content type? How does that make any sense?
Is this just the tool installers via
DownloadServiceor are we still downloading theupdate-center.json.htmlfrom Jenkins?It doesn't look like we understand enough what's going on here to base any decisions on.
We understand the mirroring mechanism which is why i opened this issue. If we start to select files which are mirrored vs which one are not, the architectural complexity will be a pain as we will need to maintain a list of conditions. It is already nightmare-ish on get.jenkins.io tbh
hence the question about pros and cons of switching to absolute URLs which is non mutually exclusive with analysing usage to understand better.
the costs involved here are huge compared to optimization: but it is mandatory to have a finer grain of understanding
Hello @daniel-beck 👋
Cloudflare Analytics shows that HTML was far behind in amount of requests
It's by far the most popular content type?
My apologies, I mistakenly used the word "behind". You are correct, I meant that HTML seems to be, by far, the most popular type of file downloaded, at least as per the Cloudflare dashboard during the 24 hours experiment.
Let me check if we see the same result on the current VM (analysing the logs from a few days ago).
How does that make any sense?
I don't know. Let's compare with current behavior. That could also be "assumed" content type (including HTTP/404) as they are served as HTML as well.
Is this just the tool installers via
DownloadServiceor are we still downloading theupdate-center.json.htmlfrom Jenkins?
I ... don't know. We did not even know there was an HTML version of this one. Where should we look (except our access logs)?
Initial check for the 09 October 2024 (both HTTP and HTTPS, both updates.jenkins-ci.org and updates.jenkins.io vhosts):
-
~ 8,478,760 hits
-
~ 444.350 visitors
-
~5,000,000 redirections (HTTP/3XX) for around 1.2 Gib
-
~3,200,000 files served (HTTP/2XX) for around 2.1 Tib
-
~ 257,890 client errors (HTTP/4XX) for around 43 Mib
Report (generated with GoAccess from the "combined" access log):
@daniel-beck If we compare with Cloudflare numbers for 24 hours, which are only HTTP/2XX and HTTP/4XX (as the redirects are NOT sent to Cloudflare), it maps:
- Total requests on Cloudflare where ~10,8M
- 7,67M where HTTP/4XX => It's ~3,2M HTTP/2XX which is the same number as what we see on the actual production.
- 9,88M requests where "HTML": it's ~2,21M HTTP/2XX HTML types (removing the HTTP/4XX)
- JSON (which are HTTP/2XX) are ~1,14M which means we have 1/3 of JSON, and 2/3 of HTML (all files included, tools installer and metadatas).
Need to check the repartition HTML/JSON on the current production, but the high rate of HTTP/4XX clearly explains the ratio change during the brownout.
It also adds more weight in using an absolute URL in the HTML generated files to decrease this amount of HTTP/4XX.
Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.
1/3 of JSON, and 2/3 of HTML
The problem with this view is that there are different kinds of HTML files on this domain.
The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's wget --recursive goes brrrr.
Various update-center.json.html exist and are irrelevant for this topic. Half the tool installer files (e.g. in https://updates.jenkins.io/updates/ ) are HTML files and are irrelevant for this topic.
Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.
the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access20241003gz. Got 4 files (unsecured and secured, for both hostnames)
Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.
the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access_20241003_gz. Got 4 files (unsecured and secured, for both hostnames)
Additions:
- I concatenated the 4 access logs files from production into a single one and ran the
goaccesstool on it (specifying combined logs format). The "concatenated" file weight 1.2 Gb: do you want me to send it to you (compressed) through a private channel @daniel-beck to avoid further unneded tasks for you?
The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's
wget --recursivegoes brrrr.
Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than updates.jenkins.io due to redirections.
I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files.
Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used?
What did I miss?
As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.
The most popular URL that this issue is about is accessed just 24 times across the 4 logs:
24 /download/plugins/htmlpublisher/
Compared to:
508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html
387857 /updates/hudson.tasks.Ant.AntInstaller.json.html
339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html
334649 /updates/hudson.tools.JDKInstaller.json.html
Methodology (prove me wrong):
cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000
sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted
uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed
sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted
As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.
The most popular URL that this issue is about is accessed just 24 times across the 4 logs:
24 /download/plugins/htmlpublisher/Compared to:
508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html 387857 /updates/hudson.tasks.Ant.AntInstaller.json.html 339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html 334649 /updates/hudson.tools.JDKInstaller.json.htmlMethodology (prove me wrong):
cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000 sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted
Yes, I had the same results before generating the goaccess. I fail to understand the relationship with the current issue: the domain change when serving files from mirrors leads to wrong hyperlinks in the generated pages. what did I miss?
Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than
updates.jenkins.iodue to redirections.
I wonder whether this is necessary. Seems like mirrors make sense for anything that's actual "content" (the stuff being downloaded), not glorified directory indexes.
I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the
wget --recursiveis used?What did I miss?
This came from https://github.com/jenkins-infra/helpdesk/issues/4311#issuecomment-2384923753 / https://github.com/jenkins-infra/helpdesk/issues/4311#issuecomment-2416879452
Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)
I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the
wget --recursiveis used? What did I miss?This came from #4311 (comment) / #4311 (comment)
Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)
Oh i see, thanks for clarifying. We agree then on the result from the current production.
Let me compile my thoughts and analysis on the Cloudflare part:
- Cloudflare still does not provides us access logs, only the terrible dashboard I screenshot. Request sent to them to enable access log publication (streamed to datadog as we cannot access them directly). Like any sponsorship programs, the beginning is back and forth
- My 1/3 vs. 2/3 is a ratio in number of hits, not in downloaded volume. We need to calculate this on the current access logs (I'll try to do it and publish my shell commands, because goaccess is too limited for such analysis), either by content type or by URL patterns.
- The huge spike in HTTP/4XX means we still have some endpoints sent to mirrors which should not. The links on the pages here (most probably due to crawler patterns) are part of this, but we don't really know how much.
@smerle33 did propose to use non Cloudflare mirror as a safety net if things goes south with CF. It would use a custom webserver we manage (or two) and hosted in DigitalOcean (we have 4-5 Tb bandwidth for free and 15k credits valids until end of year) so we can check access logs in details. Cost is OK for another brownout (assuming 2 to 3 Tb of download for 24h), but we'll need to be careful if we add it permanently.
I met with @dduportal to move this topic along. Outcome:
- He's looking into continuing to serve download link/index files from updates.jenkins.io, probably involving migrating
RedirectMatchtoRewriteRulein the uc2.htaccessfile due to how weird Apache is, if that's reasonably straightforward to accomplish. This prevents users from linking/bookmarking to "implementation detail" hostnames. - I look into making URLs in
--download-links-directoryand--latest-links-directoryabsolute instead of relative, independent of the outcome of your task. This is implemented in https://github.com/jenkins-infra/update-center2/pull/810
I met with @dduportal to move this topic along. Outcome:
* He's looking into continuing to serve download link/index files from updates.jenkins.io, probably involving migrating `RedirectMatch` to `RewriteRule` in the uc2 `.htaccess` file due to how weird Apache is, if that's reasonably straightforward to accomplish. This prevents users from linking/bookmarking to "implementation detail" hostnames. * I look into making URLs in `--download-links-directory` and `--latest-links-directory` absolute instead of relative, independent of the outcome of your task. This is implemented in [Use absolute URLs for links from download indexes update-center2#810](https://github.com/jenkins-infra/update-center2/pull/810)
Following this summary, I've opened the PR https://github.com/jenkins-infra/update-center2/pull/812 to focus on the second solution.
With the use of RewriteRule for the "fallback" rule (tested with success), we can add a rewrite condition to test the absence of a file: that would allow us to server the /downloads/**/*html file from Apache since it's only a low volume, and would solve the HTTP/404 links without requiring absolute links.
Update:
-
https://github.com/jenkins-infra/update-center2/pull/812 has been tested and then merged with success. No more
RedirectMatchonpkgVM + it keeps working as expected in the UC in Azure + mirrors. -
It unblock the issue here: opened https://github.com/jenkins-infra/update-center2/pull/813 to start serving the HTML files from
download/***from Apache (and the uctest.json 😉 ) instead of mirrors.
- PR https://github.com/jenkins-infra/update-center2/pull/813 has been merged: update-center2 job ran with success
- We can see that https://azure.updates.jenkins.io/download/plugins/gradle/index.html does NOT redirect anymore to a mirror. As such, its links (even if still relative) are not broken
- The
?uctesttrick is also working as expected:
# Before the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 307
date: Tue, 22 Oct 2024 09:54:37 GMT
content-type: text/html; charset=iso-8859-1
location: https://mirrors.updates.jenkins.io/uctest.json?uctest
strict-transport-security: max-age=2592000; includeSubDomains; preload
# After the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 200
date: Tue, 22 Oct 2024 09:55:06 GMT
content-type: application/json
content-length: 3
last-modified: Tue, 22 Oct 2024 09:54:46 GMT
etag: "3-6250dc26ce6f7"
accept-ranges: bytes
strict-transport-security: max-age=2592000; includeSubDomains; preload