git-scm.com icon indicating copy to clipboard operation
git-scm.com copied to clipboard

Migrate git-scm.com to a static site, generated via Hugo, served via GitHub Pages

Open dscho opened this issue 1 year ago • 46 comments

Changes

This Pull Request adjusts the existing files such that the site is no longer served via a Rails App, but by GitHub Pages instead. A preview can be seen here: https://dscho.github.io/git-scm.com/ (which is generated and deployed from this Pull Request's branch, and will be updated via automation whenever that branch changes).

It is the culmination of a very long, and large, effort that started in February 2017 with the first attempt to migrate the site to Jekyll. Several years, and a substantial effort by @spraints, @vdye and myself, later, here is the result: No longer a Jekyll site but a Hugo site (because of render times: 20 minutes vs 30 seconds), search implemented using Pagefind.

The main themes of the subsequent migration from the Rails App to a Hugo-generated static site are:

  • We move the original Rails App files that contain Rails code mixed into HTML to content/, where the files defining the pages live in the Hugo world, then modify them to drop the Rails code and replace it with Hugo constructs. More often than not, we separate the commits that move the files from the commits that adjust the contents, to help Git realize that there has been a move (as opposed to a delete/add). This allows for noticing upstream changes that need to be reflected in moved & modified files when rebasing to upstream.

  • In Hugo setups, the files live in the following locations:

    • hugo.yml

      This is the central configuration file that tells Hugo how to render the site.

    • content/

      This defines the content of the pages that are served. Only a subset of Hugo's functionality is available here (the idea is to leave the complicated stuff to the layout used to render the pages).

      Most, but not all, of the files living in this directory tree are HTML files that are generated (and then committed) using external repositories, e.g. the ProGit book and its translations.

    • layouts/

      This is where the "boiler plate" is defined that ties the site together, i.e. the header, the footer and the sidebar as well as the main scaffolding in which the pages' content is to be rendered.

      This is the location where most of Hugo's functionality is available and complex stuff can happen such as looping or accessing site parameters.

    • layouts/partials/

      This directory contains recurring templates, i.e. reusable partial layouts that are used to structure the elements of the site. This includes the side bar, how videos are rendered, etc.

    • layouts/shortcodes/

      This directory contains so-called "shortcodes", i.e. reusable elements similar to partial layouts. The major difference is that shortcodes can be used within content/ while partial layouts can only be used from within layouts/.

      See https://gohugo.io/content-management/shortcodes/ for more information on this topic.

    • static/

      These files are not processed by Hugo, but copied as-are. Good for images, for example.

    • assets/

      These files are processed in specific ways. That is where the SASS-based style sheets live, for example.

    • data/

      These files define metadata that can be used in Hugo's functions. For example, it contains the list of documentation categories that are rendered in various ways.

  • In contrast to most Hugo-managed sites, we will refrain from using a Hugo theme, and instead stick with the existing style sheets.

    Likewise, we refrain from using Markdown at all: The existing site did not use it, therefore it makes little sense to start using it now.

  • In addition to Hugo's directories, we also have these:

    • script/

      This directory contains scripts to perform recurring tasks such as rendering Git's manual pages into HTML that are then stored inside contents/docs/.

      For historical reasons, these are Ruby scripts for the most part, as it is easier to follow the development when that functionality is extracted from the current Rails App and turned into Ruby scripts that can be run stand-alone.

    • .github/workflows/ and .github/actions/

      The latter directory contains a file that defines a custom GitHub Action that accommodates for the lack of Hugo support in GitHub Pages: By default, only Jekyll pages are supported out of the box, but Hugo sites require a custom GitHub workflow to deploy the site.

      The former directory contains files that define GitHub workflows that are typically run on a schedule, updating the various parts that are generated from external sources: the Git version, the ProGit Book, manual pages, etc. These workflows essentially keep the rendered HTML files in content/ up to date with the respective external repositories.

      These workflows can be seen in action (pun intended) here: https://github.com/dscho/git-scm.com/actions

    • _generated-asciidoc/

      This directory serves as a cache of "expanded AsciiDoc": many of Git's manual pages include content from other files, and therefore it is non-trivial to determine whether or not a manual page has changed and needs to be re-rendered (essentially, the only way is to expand them by inlining the included files). Caching this content speeds up updating the manual pages drastically.

  • Most of the core logic lives in layouts/. Hugo discerns between logic that is allowed in layouts/ and logic that is allowed in content/; The latter can only access so-called "shortcodes" https://gohugo.io/content-management/shortcodes/. These shortcodes are free to use the entire set of Hugo's functionality.

    tl;dr whenever we need to do something complicated that is confined to only a few pages, we have to implement it in layouts/shortcodes/ and insert the corresponding {{< shortcode-name >}} in the page itself. Whenever we need to something complicated that is used in more places, it is implemented elsewhere in layouts/.

  • Some of the logic that cannot be performed statically (such as telling the user how long ago the latest macOS installer was released, or adjusting the Windows downloads to reflect the CPU architecture indicated by the current user agent) are implemented using Javascript instead.

  • The site search needs to move to the client side, as there is no longer a server that can perform that functionality. Luckily, Pagefind (https://pagefind.app/) matured in the meantime, a very performant client-side search solution implemented in Javascript that relies on a search index that is generated at build time and that is served incrementally, as needed, via static files. This is what we use, then.

Context

Changes required to finalize the migration in addition to this Pull Request

  • This Pull Request is not actually meant to be merged, not to the main branch at least, but to the (not-yet-existing) gh-pages branch.

  • To successfully deploy to GitHub Pages, the Pages configuration needs to be switched from "Deploy from a branch" to "GitHub Actions":

    image

  • Once everything is golden in this Pull Request and the decision to move to GitHub Pages is final, git-scm.com needs to pointed to GitHub Pages (read: CNAME needs to be configured to make use of the GitHub Pages-deployed site).

  • The Pull Request branch could actually be pushed to gh-pages already way before closing this Pull Request, as https://git-scm.github.io/ would be serving a different site than https://git-scm.com/ before the CNAME entry is adjusted.

Why make these changes?

  • Heroku stopped their free tier and ever since https://git-scm.com/ has required sponsorship whose funding could be put to better use elsewhere.
  • Static sites are much easier to manage, and to develop. With this Pull Request, developing the site locally is as easy as checking out the repository and running hugo serve -w, then editing the files to your heart's extent.

dscho avatar Oct 16 '23 11:10 dscho

:tada: This is great! Thank you so much for picking this up! The demo site looks great!

spraints avatar Oct 17 '23 12:10 spraints

👋 Sneaking in here with some thoughts from the search side!

On first interactions, the search has some notable issues compared to the production rails search, for a few reasons on both sides of the fence.

  1. All tagged releases are indexed, so a search for rebase returns /docs/git-rebase/ and /docs/git-rebase/2.41.0/ and /docs/git-rebase/2.23.0/ and ...
    • The best fix here would be for you to omit the data-pagefind-body attribute from the numbered release pages, so that only /docs/git-rebase/ is indexed and returned
  2. Titles definitely need stronger affinity here. A search for list on the rails site returns rev-list-description, git-rev-list, and rev-list-options as the top results. Pagefind's search is significantly more varied, with a lot of results for mailing lists and related items.
    • CloudCannon/pagefind#437 is relevant and discussing much the same thing.
    • I don't have an immediate solution for this but I would love to find one.
  3. Typing rebase into the live search and hitting enter does not show the rebasing book result. Typing the query in does.
    • This helped narrow down a bug — filed as CloudCannon/pagefind#478
  4. The rails site live search has a nice Reference / Book split that would be great to recreate with filters, if possible.

(Amazing work migrating this to Hugo! ❤️)

bglw avatar Oct 18 '23 09:10 bglw

Oh wow, Mr Pagefind himself! I'm honored to meet you, @bglw!

  • The best fix here would be for you to omit the data-pagefind-body attribute from the numbered release pages, so that only /docs/git-rebase/ is indexed and returned

I kind of wanted to be able to find stuff in old versions that is no longer present in current versions. That's why I added https://github.com/dscho/git-scm.com/commit/e9fa9630417b075b4a136518ea4dfbc7a1e884f4).

  • Titles definitely need stronger affinity here. A search for list on the rails site returns rev-list-description, git-rev-list, and rev-list-options as the top results. Pagefind's search is significantly more varied, with a lot of results for mailing lists and related items.

Excellent!

Heh, thank you for that!

  • The rails site live search has a nice Reference / Book split that would be great to recreate with filters, if possible.

Right, I had not worked on that because I hoped that the sorting by relevance would be "good enough"...

dscho avatar Oct 18 '23 15:10 dscho

About Heroku

That is true, but here has been an update since that 2022 mail.

https://lore.kernel.org/git/ZRHTWaPthX%[email protected]/

Heroku has a new (?) program for giving credits to open-source projects. The details are below:

https://www.heroku.com/open-source-credit-program

I applied on behalf of the Git project on 2023-09-25, and will follow-up on the list if/when we hear back from them.

It does seem like the PLC is still in favor of moving to a static solution, though.

https://lore.kernel.org/git/[email protected]/

  • Biggest expense is Heroku - Fusion has been covering the bill
    • There's on and off work on porting from a Rails app to a static site: https://github.com/git/git-scm.com/issues/942
  • Dan Moore from FusionAuth has been providing donations
  • Ideally we are able to move away from using Heroku, but in the meantime we'll have coverage either from (a) FusionAuth, or (b) Heroku's new open-source credit system

About the preview:

Search

All tagged releases are indexed, so a search for rebase returns /docs/git-rebase/ and /docs/git-rebase/2.41.0/ and /docs/git-rebase/2.23.0/ and ...

That is true. And in both the search results page as well as the little preview (<div id="search-results">) it's not visually obvious which result is the current version and which results are older versions. Maybe that could be improved by adding the version number to the page title for non-current versions? Or maybe a filter in the search results to exclude historical documentation? If we don't want to mangle the titles, pagefind would show the version number below the result if we configured it as metadata.

Minor issues

There are some broken links in the preview on https://dscho.github.io/git-scm.com/docs/ that lead to https://dscho.github.io/docs/ <topic>

There's a broken link on https://dscho.github.io/git-scm.com/about/free-and-open-source/ to https://dscho.github.io/git-scm.com/trademark. On the live site that redirects from https://git-scm.com/trademark to https://git-scm.com/about/trademark (https://github.com/dscho/git-scm.com/pull/1)

The "Setup and Config" headline on https://dscho.github.io/git-scm.com/docs/ is blue in the preview, but not in the live site. This is not happening for me in local testing.

There's some redirect that swallows anchors. https://dscho.github.io/git-scm.com/docs/ links to https://dscho.github.io/git-scm.com/docs/git#_git_commands , which redirects to https://dscho.github.io/git-scm.com/docs/git/ instead of https://dscho.github.io/git-scm.com/docs/git/#_git_commands Looks like the slash-free version isn't possible with the GitHub pages/Hugo combination (https://github.com/gohugoio/hugo/issues/492). We should update these links to contain the slash from the beginning to avoid the redirect.(https://github.com/dscho/git-scm.com/pull/3)

https://dscho.github.io/git-scm.com/downloads/mac/ has an odd grammar issue that https://git-scm.com/download/mac doesn't. (https://github.com/dscho/git-scm.com/pull/2) It says

which was released about 2 year, on 2021-08-30.

https://git-scm.com/download/mac correctly says

which was released about 2 years ago, on 2021-08-30.

Also note the slight url change there from dowload to downloads. There is a redirect for that, though, so that should be fine.

rimrul avatar Oct 20 '23 10:10 rimrul

One additional note: There is a commit about porting the old 404 page, 18a3ac2, but I've only seen the generic GitHub pages 404 page on the preview in my testing.

rimrul avatar Oct 20 '23 10:10 rimrul

Switching to pagefind also changed search behaviour in another way.

The rails site always searches the english content. Pagefind defaults to what they call multilingual search, i.e. searching only pages in the same language as the one you're searching from. That's theoretically a usability improvement, but with the partial nature of our non-english content (availability of any given language can vary from man page to man page, the book exists in languages that don't have any man pages, everything else only exists in english), we might need a fallback to english here. Pagefind offers an option to force all pages to be indexed as english, but I think we can slightly abuse mergeIndex with language set to en for a better result.

rimrul avatar Oct 21 '23 06:10 rimrul

The "Setup and Config" headline on https://dscho.github.io/git-scm.com/docs/ is blue in the preview, but not in the live site. This is not happening for me in local testing.

I managed to fix it via 2d0f6c80293192f7882914e7f6a683c60afe3159

dscho avatar Oct 24 '23 10:10 dscho

All tagged releases are indexed, so a search for rebase returns /docs/git-rebase/ and /docs/git-rebase/2.41.0/ and /docs/git-rebase/2.23.0/ and ...

That is true. And in both the search results page as well as the little preview (<div id="search-results">) it's not visually obvious which result is the current version and which results are older versions.

Hmm. The more I think about it, the more I get convinced that the older versions of the manual pages should be excluded from the search, I thought it was a feature, but it looks as if it incurs more downsides than upsides.

dscho avatar Oct 24 '23 11:10 dscho

this was a major effort @dscho , thank you very much! sorry for the silence, but i've been busy with other stuff. in the meanwhile, and to ensure this effort wont be wasted, can you summarize what do you need to make this merge-ready?

what do you still need to tackle? where do you need help from other people? :)

pedrorijo91 avatar Nov 06 '23 20:11 pedrorijo91

can you summarize what do you need to make this merge-ready?

@pedrorijo91 Yes.

  • [x] The search needs some love:
    • [x] exclude the manual pages of previous versions from the search instead of trying to demote them; It's just too confusing
    • [x] in the "live search" (i.e. when typing in the search box on any page other than the search results page), we will want to reinstate the "Reference"/"Book" separation of the search results. I'm currently unsure how we can accomplish that.
  • [x] to make the URLs nicer by having no trailing slash (just like the existing Rails App), we will need to uglify the URLs.
  • [x] general QA:
    • [x] ensure that current URLs would work after migration
      • [x] e.g. /about#branching-and-merging, /about#staging-area etc
    • [x] add test -z "$(git grep "\\(href\|src\) *= *[\"']/")" to CI
  • [x] rebase to the latest main

The big blocker is the "live search" one.

dscho avatar Nov 06 '23 21:11 dscho

Oh, and there's a ton of work still needed to address @rimrul's excellent feedback.

dscho avatar Nov 06 '23 23:11 dscho

  • general QA:

    • ensure that current URLs would work after migration

      • e.g. /about#branching-and-merging, /about#staging-area etc

@pedrorijo91 TBH I would love to have help with that.

dscho avatar Nov 07 '23 16:11 dscho

  • ensure that current URLs would work after migration

    • e.g. /about#branching-and-merging, /about#staging-area etc

@pedrorijo91 TBH I would love to have help with that.

I just realized that https://git-scm.com/about#branching-and-merging does not actually redirect to https://git-scm.com/about/branching-and-merging... so I guess this is a non-issue.

dscho avatar Nov 07 '23 21:11 dscho

Typing rebase into the live search and hitting enter does not show the rebasing book result. Typing the query in does.

  • This helped narrow down a bug — filed as https://github.com/CloudCannon/pagefind/issues/478

@bglw I just tested this at https://dscho.github.io/git-scm.com/ and it seems to work as expected. Thank you!

  • The rails site live search has a nice Reference / Book split that would be great to recreate with filters, if possible.

Right, I had not worked on that because I hoped that the sorting by relevance would be "good enough"...

I worked on this (7142149b5, ddbbe381c and 08183b0b0) and it seems to work now. Could you please test?

dscho avatar Nov 07 '23 22:11 dscho

@pedrorijo91 I believe that this is now ready for wider testing. Do you have any objections against me pushing this to gh-pages and enabling the Actions to deploy to https://git.github.io/git-scm.com/?

dscho avatar Nov 08 '23 08:11 dscho

i agree that's likely the best way to test the new website @dscho . kinda impossible to review this huge diff manually :D

pedrorijo91 avatar Nov 16 '23 21:11 pedrorijo91

@bglw wow, the innocuous release notes item "Fixed a bug, resulting in a (very) large improvement to the NodeJS Indexing API performance (~100x)." seems to have a profound impact. While it is definitely not a scientific experiment (read: take the numbers with a grain of salt), the latest run with Pagefind v1.0.3 took 148.456s and the first run with Pagefind v1.0.4 took only 106.626s. Well done!

dscho avatar Nov 17 '23 07:11 dscho

i agree that's likely the best way to test the new website @dscho .

Thank you @pedrorijo91. It's live! https://git.github.io/git-scm.com/

kinda impossible to review this huge diff manually :D

Right, I should have clarified that the majority of the diff is in the generated pages that do not actually need to be reviewed because they come from external sources where they are reviewed already. For example, content/book/ and content/docs/ contain only one non-generated file: content/docs/_index.html. You can see that in the tree of the commit before all the generated pages were added by automated GitHub workflow runs: https://github.com/git/git-scm.com/tree/ef17ce6ee91e30aba30e37478104b4384d9142ea/content

dscho avatar Nov 17 '23 08:11 dscho

the latest run with Pagefind v1.0.3 took 148.456s and the first run with Pagefind v1.0.4 took only 106.626s.

Interesting! That bug fix should only be affecting this NodeJS API — not npx usage — so it's either just an outlier run, or something else in this release has an outsized performance impact 🤔 In either case, glad to hear it's running a bit faster 😅

(edit: I think you just landed a much faster machine — the Hugo build time also dropped from 24s in your first link, to 16s in the second)

bglw avatar Nov 17 '23 08:11 bglw

the latest run with Pagefind v1.0.3 took 148.456s and the first run with Pagefind v1.0.4 took only 106.626s.

Interesting! That bug fix should only be affecting this NodeJS API — not npx usage — so it's either just an outlier run, or something else in this release has an outsized performance impact 🤔 In either case, glad to hear it's running a bit faster 😅

Huh. So it might actually be a fluke. I just thought that npx, being a node.js way to generate the search index, would internally use the node.js API ;-)

(edit: I think you just landed a much faster machine — the Hugo build time also dropped from 24s in your first link, to 16s in the second)

Possible. I experienced something like that recently in a different context, where subtle differences between the large macos runners relative to the non-large ones caused git/git CI to fail (because Python2 was on the PATH in the non-large runners, but hidden in the large ones). So it's quite possible. Unfortunately, I do not see any breadcrumb in the logs to confirm or deny that the job is running on a large runner...

dscho avatar Nov 17 '23 09:11 dscho

~~Bug report~~ Has been fixed

~~HTML entities are rendered verbatim in version dropdowns in documentation reference.~~ Fixed now.

Bug report contents click to expand

Steps to reproduce

  1. Go to https://git.github.io/git-scm.com/docs/git-config/2.40.0
  2. Click on dropdown "Version 2.40.0 ▾"
  3. Observe the dropdown between items for versions 2.40.0 and 2.39.0

Actual result

The dropdown has small text in italics on the right 2.39.1 &rarr; 2.39.3 no changes Actual result

Expected result

The dropdown has small text in italics on the right 2.39.1 → 2.39.3 no changes

Expected result

rybak avatar Nov 17 '23 14:11 rybak

it strikes me as a bad idea to commit all the generated content to the main repo. how about a submodule?

ossilator avatar Nov 17 '23 15:11 ossilator

HTML entities are rendered verbatim in version dropdowns in documentation reference.

With some quick testing in my browsers dev tools, it seems like changing this <span> to a <div> would fix this.

https://github.com/dscho/git-scm.com/blob/475b6f1039839b790e5a9f40dda45ca431e4f9e8/layouts/partials/ref/versions.html#L31

rimrul avatar Nov 17 '23 15:11 rimrul

HTML entities are rendered verbatim in version dropdowns in documentation reference.

With some quick testing in my browsers dev tools, it seems like changing this <span> to a <div> would fix this.

https://github.com/dscho/git-scm.com/blob/475b6f1039839b790e5a9f40dda45ca431e4f9e8/layouts/partials/ref/versions.html#L31

Actually, that does not fix it for me, but this here diff does:

diff --git a/layouts/partials/ref/versions.html b/layouts/partials/ref/versions.html
index 3eca7a4c5..6ca23230f 100644
--- a/layouts/partials/ref/versions.html
+++ b/layouts/partials/ref/versions.html
@@ -28,7 +28,7 @@
         </a>
         </li>
       {{ else }}
-        <li class="no-change"><span>{{ $v.name }} no changes</span></li>
+        <li class="no-change"><span>{{ safeHTML $v.name }} no changes</span></li>
       {{ end }}
     {{ end }}
     <li>&nbsp;</li>

I'll commit it as a fixup for 0501ad1ad1b94f70821ab79fcfd0365ab2e5b3ae.

dscho avatar Nov 17 '23 22:11 dscho

HTML entities are rendered verbatim in version dropdowns in documentation reference.

@rybak thank you for the detailed bug report!

it strikes me as a bad idea to commit all the generated content to the main repo. how about a submodule?

@ossilator Friends don't let friends use submodules.

Seriously speaking again, it seriously won't work with submodules because those pages need to be generated in GitHub workflows with write access to the repository (so that the changes can be pushed), which is not possible (or at least not in a way that makes it easy to contribute) in a workflow that is defined in a different repository. Besides, the generated files need to live in subdirectories of content/ that are not always completely generated. For example, content/docs/_index.html is not generated. And hugo.yml is partially re-generated (download data and Git version), so that generated data has to be in the same repository.

In addition to making it harder to contribute, submodules would also make the deployment to GitHub Pages more fragile because of the need to clone multiple repositories for the price of one.

No, I fear that the submodules idea is actually the bad idea, not the one to commit generated files in well-defined places ;-)

dscho avatar Nov 17 '23 22:11 dscho

HTML entities are rendered verbatim in version dropdowns in documentation reference.

With some quick testing in my browsers dev tools, it seems like changing this <span> to a <div> would fix this. https://github.com/dscho/git-scm.com/blob/475b6f1039839b790e5a9f40dda45ca431e4f9e8/layouts/partials/ref/versions.html#L31

Actually, that does not fix it for me, but this here diff does:

diff --git a/layouts/partials/ref/versions.html b/layouts/partials/ref/versions.html
index 3eca7a4c5..6ca23230f 100644
--- a/layouts/partials/ref/versions.html
+++ b/layouts/partials/ref/versions.html
@@ -28,7 +28,7 @@
         </a>
         </li>
       {{ else }}
-        <li class="no-change"><span>{{ $v.name }} no changes</span></li>
+        <li class="no-change"><span>{{ safeHTML $v.name }} no changes</span></li>
       {{ end }}
     {{ end }}
     <li>&nbsp;</li>

I'll commit it as a fixup for 0501ad1.

This is now fixed (in 367254d3a) and deployed. Thank you @rybak!

dscho avatar Nov 18 '23 01:11 dscho

Besides, the generated files need to live in subdirectories of content/ that are not always completely generated. For example, content/docs/_index.html is not generated. And hugo.yml is partially re-generated (download data and Git version), so that generated data has to be in the same repository.

to me this sounds like a complete nightmare. not cleanly separating the sources from the generated content is a recipe for undesired "special effects" of all kinds. and that's atop of obvious issues of working with the repo itself.

why does the generated content need to be versioned in the first place? can't github just serve the build artifacts? for all i can tell you just need a simple configuration management system.

ossilator avatar Nov 18 '23 15:11 ossilator

@ossilator I appreciate that you think about these issues.

But generating everything from scratch every time, that would be hell twice over. That's a nightmare. Too many things that could go wrong and testing locally would be another nightmare on top.

And your suggestion to use submodules actually gave me the creeps. I've been using submodules in the past and there are many good reasons why I don't do that anymore. I know many, many engineers with the same learning trajectory.

And honestly, I definitely do not understand why you're so averse against committing generated content. It makes so many things much easier, from easily being on the same page when two contributors are looking at the same generated page, to testing locally, to link checking, to running this in a GitHub workflow after a new Git version was released and expecting updates as quickly as possible.

Merely looking at re-generating all of the manual pages makes re-generating them all the time distinctly a total non-starter. That would be adding over 10 minutes to every single deployment, for work that really only needs to be done once!

No, in this instance, committing what has been generated, by automation that can be trusted and verified, you basically know at all times what you've got, there are no hidden surprises. You know from which progit2/progit2 commit this and that file was generated, and you can verify that it was generated correctly by re-running the script and calling git diff.

So from a practical point of view, if you want to accept this from a person who has worked on this project for over a year and hence has gained a lot of experience in this space (i.e. me), committing the generated content in a well-defined way, to well-defined locations within the same repository, is making everything a lot less painful than it would otherwise be.

And if you're still not convinced, I would love to be presented with hard evidence (read: not just talk) that stands a chance of convincing me that your suggestion should be preferred over the current proposal.

dscho avatar Nov 20 '23 14:11 dscho

@ossilator I appreciate that you think about these issues.

But generating everything from scratch every time, that would be hell twice over. That's a nightmare. Too many things that could go wrong and testing locally would be another nightmare on top.

And your suggestion to use submodules actually gave me the creeps. I've been using submodules in the past and there are many good reasons why I don't do that anymore. I know many, many engineers with the same learning trajectory.

I know you're passionate about this topic, but there's no need for hostility (re: "gave me the creeps"). The suggestion of submodules seemed more the result of some initial brainstorming w.r.t avoiding the storage of generated files in the repo. It's a starting point for a conversation, not a firm design proposal.

And honestly, I definitely do not understand why you're so averse against committing generated content. It makes so many things much easier, from easily being on the same page when two contributors are looking at the same generated page, to testing locally, to link checking, to running this in a GitHub workflow after a new Git version was released and expecting updates as quickly as possible.

As someone that maintained a repository in the past with a similar concentration of generated files, there are a number of things it can make harder as well:

  • It's difficult to enforce "don't update the generated files" (even with "DON'T UPDATE THIS FILE" banners, READMEs, etc.), so when people inevitably try it, it leads to more time spent going back-and-forth on pull requests.
  • It's not necessarily straightforward (esp. for new contributors) to figure out which file(s) need to be changed to update something they see in a given generated file (is it the main body of the file? header or footer? etc.).
  • Changes in the generation process can result in massive diffs across the repository that don't add any real value.
  • The generated files are usually not updated in the commit that prompted the change, making debugging/bisecting more difficult.
  • Subjectively, I don't see generated files as any different from other build artifacts (e.g. binaries), and I consider "storing build artifacts alongside source code" generally bad practice (muddles the concept of "source of truth").

There are probably more specific issues I'm not remembering but, overall, I can say that maintaining a ton of generated files was indeed a nightmare for myself and other developers. The only reason I didn't jettison them when I was maintainer is that I never got the time to update the tooling accordingly.

All that said, you've made some valid points as to why we should store generated files. So IMO the decision of whether or not to commit generated files is fairly nuanced, and warrants discussion & possibly further investigation before settling on an approach.

Merely looking at re-generating all of the manual pages makes re-generating them all the time distinctly a total non-starter. That would be adding over 10 minutes to every single deployment, for work that really only needs to be done once!

10 minutes doesn't seem too bad, to be honest. But one way to avoid that while still keeping generated files out of the repo could be to store them as artifacts (e.g., a tarball of the generated files) tied to a given commit hash in the artifact storage of your choice, then use that as a sort of "pre-build" of the repository.

No, in this instance, committing what has been generated, by automation that can be trusted and verified, you basically know at all times what you've got, there are no hidden surprises. You know from which progit2/progit2 commit this and that file was generated, and you can verify that it was generated correctly by re-running the script and calling git diff.

So from a practical point of view, if you want to accept this from a person who has worked on this project for over a year and hence has gained a lot of experience in this space (i.e. me), committing the generated content in a well-defined way, to well-defined locations within the same repository, is making everything a lot less painful than it would otherwise be.

As someone that has also worked on this project (albeit not as extensively), I'm not convinced that committing generated content is the right way to go. Personal experience is valuable in informing your opinions, but it is not on its own a justification for the correctness of your approach, and it's definitely not cause to dismiss @ossilator's (or anyone else's) concerns out of hand.

And if you're still not convinced, I would love to be presented with hard evidence (read: not just talk) that stands a chance of convincing me that your suggestion should be preferred over the current proposal.

It's generally the job of the person developing a change to convince reviewers to accept that change, not the other way around. Reviewers can certainly help that process by providing technical justification when they disagree with an approach, but it's nevertheless important to understand & address concerns so that we reach a consensus based on technical merit. After all, wouldn't it be better for everyone if alternatives are thoroughly explored? If they don't end up better than what you have now, at least everyone will understand why we settled on a given approach. And if it is better than what you have now, then we end up with...something better!

I know that kind of exploration takes time, and you've already put a lot of time into this, so what I'm asking is probably more frustrating than not. But with such a massive change to such a valuable resource, it's critical that concerns are thoroughly addressed before moving forward on merging/deploying.

vdye avatar Nov 20 '23 22:11 vdye

How would we make it easy to work with artifacts attached to commits, especially on PR branches?

I really like the simplicity of pushing to my fork and having a deployed site after the workflow run is done. Minimal surface for network issues because only one repository is checked out. And I can't think of any way to make it as simple without committing the generated files, I'm sorry.

dscho avatar Nov 21 '23 09:11 dscho