git-scm.com WIP: Convert git-scm.com to use GitHub Pages

Re @peff's recent ML message, I started playing around with converting this site to be a Jekyll site, so it can be hosted on GitHub Pages instead of heroku. https://github.com/spraints/git-scm.com is the new code, and http://pages-test-git-scm.pickardayune.com/ is the rendered site. So far, I've converted the home page and the "about" pages. My goal is for the pages site to be able to handle all of the same URLs that the current site knows about. Also, I'm not a designer, so I'm also shooting to make the site look exactly identical.

The home page was a fairly mechanical conversion. In the rails app, the "about" page is a single page that rewrites its URLs; I split it up so the links are normal links.

I'll continue to poke at it in my spare time. I'd also accept pull requests 😻 to my gh-pages branch. It should be pretty easy to figure out what needs to be done: find a broken link on http://pages-test-git-scm.pickardayune.com/, and copy content from the rails app in https://github.com/git/git-scm.com to the right place in the jekyll app. Rails helpers need to be changed into flat HTML or liquid tags. I hear that there are man-pages in the current site's database, which may take some more effort to convert.

Feb 04 '17 15:02 spraints

@spraints Cool! I will be glad to help though I don't know about rails but I do know about Jekyll. If you open up the issues on your repo, it would be awesome.

Feb 05 '17 20:02 pranitbauva1997

@spraints I had a question about your choice of Jekyll, is there any specific reason why you chose it over middleman?

Feb 07 '17 19:02 maxlazio

@maxlazio because Jekyll is better than Middleman :P Faster to compile, develop with, and generally more people know it. Plus it's the main SSG for GitHub Pages.

Feb 07 '17 20:02 connorshea

@maxlazio because of GitHub Pages. It's what I use for static sites, and seemed like a good fit here.

@pranitbauva1997 why does https://github.com/spraints/git-scm.com need issues? Pull requests are available, just open a pull request with spraints:gh-pages as the base branch.

Feb 07 '17 22:02 spraints

@spraints Thanks for your answer, makes sense. I'll send my PR's to the fork and we can coordinate in issues here when necessary.

Feb 08 '17 08:02 maxlazio

I like this idea very much. I think its a good idea.

But if this becomes a reality, doesn't it make more sense to do a fluent migration? Like having a small PR with adds Jekyll and than i/others can start to migrate it step by step?

Feb 08 '17 09:02 Sicaine

But if this becomes a reality, doesn't it make more sense to do a fluent migration? Like having a small PR with adds Jekyll and than i/others can start to migrate it step by step?

@Sicaine I would like to wait until the site is completely migrated before cutting this repo over to jekyll. The migration can go step-by-step in my fork. For now, I've been manually verifying that the migrated pages look the same on https://git-scm.com/ and http://pages-test-git-scm.pickardayune.com/. I have a small linter script, too.

Feb 08 '17 11:02 spraints

@spraints i just think that it is harder to maintain two repos up to date. But i also don't know how long it will take.

Feb 08 '17 11:02 Sicaine

Had some time so I've sent a couple of PR's.

Feb 09 '17 22:02 maxlazio

To be honest - and I don't mean to demean the effort - I don't think this migration is worth it. As I suggested in the email chain, I think it makes more sense to simply generate static assets from the Rails codebase and dump them to a separate GH Pages repo for hosting. The main reason for that is simply because there's already a large body of static assets that need to be generated independently (the man pages and Pro Git) which you would somehow need to add to a Jekyll-based solution. On top of that, although the site is mostly static, there are some dynamic back-end components that need to get fleshed out first, the biggest one being the ElasticSearch layer.

Feb 09 '17 22:02 sxlijin

@sxlijin Yes, I think the shortest way to get to a static site is to crawl the generated Rails site and dump it into a static repo for hosting (and in a sense, that's really what a CDN is doing; it's just a big cache). I'm not sure it's a good idea for long-term maintainability, though.

Going through Rails makes things a lot more complicated, mostly because the contents of this repository only tell half the story. Most of the actual site content is imported into the database, and its freshness is tracked in a totally separate way (and mostly a primitive one; updating the import code requires manually invalidating the database entries to pull in new versions of the manpages or book content).

So if you imagine that the conversion process is to occasionally run:

rails server
wget --mirror http://localhost:3000
git add -A
git commit -m 'regenerated static site'

That is heavily dependent on what happens to be in your database at the time you generate. We can work around that, but I think the end result will be simpler if we can just import directly to the filesystem and skip the round-trip through the database. And then we can also perhaps leverage filesystem-aware tools like make during the build.

In a sense, the bits of conversion that have happened so far aren't really the interesting part. There's some tedious work, but it's mostly just converting the templates from one form to another. The heavy lifting, I think, will be the conversion of the manpage and book importers.

I dunno. Maybe I am underestimating what value the database is providing to the display code.

Feb 13 '17 18:02 peff

I think the page itself is a great example of a good static website. Independend on how often it is regenerated. A nightly build isn't hard and when done properly the heroku issue is gone.

But independed from any decision: As long as nothing in particular is decided, its hard to tackle any issues or improvements. I don't wannt to start another approach while this one is going

Feb 13 '17 19:02 Sicaine

The heavy lifting, I think, will be the conversion of the manpage and book importers.

The other hard part, I think, is deciding what to do with search. That's the actual dynamic part of the app. I don't know what kinds of 3rd-party solutions we could use there. Obviously linking to google.com?q=site:git-scm.com&search+terms is one way, but that is less nice than the search that's there (which does type-ahead suggestions). I think something like Google's Site Search is a good match, but I don't know the pricing or complexity there.

I'm pretty unfamiliar with this space in general. That's one of the reasons I was soliciting opinions on the list. :)

Feb 13 '17 19:02 peff

It is free and easy to integrate google custom search for a website

Feb 13 '17 19:02 pranitbauva1997

@peff we can dynamically generate the site if we host it with GitLab Pages, which would auomate the currently-manual update process (See this blog post for example). If you don't want to do this that's fine too :P

Also we can use Algolia DocSearch for free search, only problem might be that we'd have an Algolia logo in the search results.

Feb 13 '17 19:02 connorshea

To be clear: I'm a fan of Jekyll. I've used it before and I find it, in combination with GH pages, to be absolutely fantastic for its purpose.

Pending a resolution to the search question, though, I would be hesitant to move forward with it. (The man pages and book matter too, but since those are simply static assets, there are definitely ways around that.)

One friend I spoke to suggested using hosted ElasticSearch if we really want to keep it (which would be 40$/mo from a cursory search) but my honest opinion is that ES is absolutely bonkers overkill for searching the site, especially since as it stands, search doesn't even search page contents, just page titles. And to be frank, that seems like something that's very easily migrated to just one big JS file.

Feb 19 '17 19:02 sxlijin

JFTR, I am discussing with another guy who is ready to translate the site to another language. Does migrating to Jekyll would allow providing translated versions of the site?

Feb 19 '17 21:02 jnavila

There's no clear obvious solution (I'm having trouble building the json-1.8.3 gem locally for some reason, so I can't check what gems are bundled in github-pages; I did turn this up while googling) that I can find which is compatible with GH pages.

That being said, Jekyll allows you to have data files (stored in repo_root/_data/), the contents of which can be referenced from the page templates themselves. As an example, I manage this website and use one such file to control the links shown in the navbar, so I imagine it's easily possible to roll a custom framework around the _data files to support i18n work.

Feb 19 '17 21:02 sxlijin

I agree that search is a big wildcard on the static-site transition. Moving the first few individual pages over has been a good validation that the end result looks good (both rendered and in code). But the next step is probably validating those other assumptions (search, imports, etc). I had always assumed we would move to something like Google's site search, but it's possible the "one big JS file" approach might be even simpler.

I hadn't given i18n much thought, as the site is not currently translated. But it does seem like an obvious future direction to go in. I'll ping GH pages folks and see what their thoughts are on the state of the art.

Feb 20 '17 08:02 peff

Well, Google Site Search definitely isn't an option anymore.

Feb 22 '17 18:02 sxlijin

I think they're just discontinuing the on-premise Enterprise product. The thing we would use is probably https://cse.google.com/cse/. I hear they are also getting rid of the paid version of that, but I'm not sure we would have used it anyway. That leaves the "ad-supported" search. Which I guess is how normal Google works, but I'm not sure how ugly or intrusive it is on the site.

Feb 22 '17 20:02 peff

searching using algolia seems to be a solution worth exploring if we ever go into jekyll: https://blog.frankel.ch/search-static-website/

Apr 09 '18 14:04 pedrorijo91

👋

I got interested in this project ever since it was announced that Heroku ended its free tier that we use in https://git-scm.com/. We currently have a sponsor, but it would make most sense to spend that money more wisely and making https://git-scm.com/ a nice show case for how powerful GitHub Pages are.

Thank you @spraints for starting your incredible work on this, it gave me quite the head start!

This is where I'm at right now:

I pushed up a re-done version of @spraints' gh-pages branch. It is tree-same, i.e. it does all the same things, but it arrives there by a different set of commits, avoiding to "start from scratch" but instead move and modify the files. I did that to ensure that a rebase would catch changes in upstream that we would not want to undo.
I rebased these patches on top of upstream and added a few chocolate bits on top:
- There is now a script to update the Git version that is shown on the front page (meant to be run in a scheduled GitHub workflow that can also be triggered via a workflow_dispatch)
- The links are now relative (or at least use site.baseurl) so that they work via the regular GitHub Pages, see the proof here: https://dscho.github.io/git-scm.com/

Note: This is just a start. There is still quite a bit to be done.

[x] Site search needs to be done. I am not actually a fan of Algolia because I think we can do much better, by using lunr.js (this, this and this post should be helpful). The idea is to use Jekyll itself to output all pages as JSON and then build a static index that is usable via lunr.js. This will be no small task.
[x] The Git documentation needs to be redone (it was deleted for the sake of getting something to work, quickly). This will be a bit tricky because we will need to store plenty of different versions of the manual pages (for example, git config's manual page was modified between v2.37.2 and v2.38.0). Which means that we have to commit generated pages, and keep some sort of record about previously-imported versions so that we can add a GitHub workflow that adds a new version's manual pages incrementally, then commit and push the result.
[x] The ProGit book was deleted, for now, and we need to reinstate it. Also non-trivial a task.
[x] The Downloads section has not been done yet, this will also be a bit tricky in particular because we will need a scheduled GitHub workflow to update the links e.g. to Git for Windows' latest version.
[x] Some links in the "About" pages of the current official https://git-scm.com/ are a bit funny: they pretend to be relative to the current page but then redirect to another one. I think we will need to reinstate that behavior.
[ ] See also the comments like NEEDS-WORK in the commit history of my pushed-up branch, such as my suspicion that we would do well by using permalink and redirect_from directives.

Plenty of things still to do 😉

Oct 06 '22 10:10 dscho

Just a few thoughts, as somebody who has thought about this off and on in the intervening 5 years:

I think we'll probably want custom scripts rather than driving things from the top with Jekyll, just because of the complexity of the manpage and book imports. That doesn't mean we might not use something like Jekyll for the templating, but I think we'd end up driving the other updates with a script that's more aware of what needs to be updated and can do it incrementally.
However we handle the document imports (whether importing into the repo, or generating a static site via a workflow), it needs to be easy to re-run. A frequent gotcha on the site is that we fix a bug in the import routines, but we have to manually kick off jobs to re-import existing versions. It would be really nice if this could be run locally and the results diff'd.
Having investigated a bit in the interim, I'm definitely in favor of a pure static search index like lunr, as opposed to relying on a third party indexer.
I don't think we strictly need scheduled jobs to do updates for things like manpage imports and updating the downloads list. If those updates can be done in an easy and automated way, having somebody press the button for "hey, there's a new release" is not so bad (in fact, we do that now anyway, because we prefer the update to happen at release time, rather than hours later). And once we have that, of course, kicking it off in a workflow should be a nice cherry on top.

Just my two cents, of course. I haven't looked carefully at the problem in a while.

Oct 06 '22 12:10 peff

we'll probably want custom scripts

Yes, we will. Things like importing/pre-generating the ProGit book cannot be done by Jekyll.

What can be done by Jekyll is to take generated .html files and integrate them into the look-and-feel of git-scm.com.

Oct 06 '22 15:10 dscho

I don't think we strictly need scheduled jobs to do updates for things like manpage imports and updating the downloads list.

I agree, but it would be very nice to have. I vaguely recalled that I thought GitHub makes this somewhat easy to do, and indeed, we can schedule a workflow to run every night (or any any schedule) here. Such workflows can also be triggered by "pressing a button", which is convenient to do at release time.

In fact, it would be really nice to have those workflows automatically get kicked off whenever new tag(s) are pushed to git/git. But I don't know if such cross-repo monitoring is possible.

Oct 06 '22 16:10 ttaylorr

In fact, it would be really nice to have those workflows automatically get kicked off whenever new tag(s) are pushed to git/git. But I don't know if such cross-repo monitoring is possible.

I think you'd have to do it by setting up a workflow on git/git that triggers a workflow here; I'm not aware of any mechanism to subscribe to a different repo and I don't see one by skimming the workflow triggers docs.

It would be pretty easy to schedule an hourly or nightly job on git-scm to check for tag pushes to git/git though.

Oct 06 '22 19:10 sxlijin

I guess the guts of the processing of sources of progit and the manpages can be kept the same. One roadblock, however, is that instead of feeding a DB, we want to spit files, add content, commit and push, while being in a github workflow. Not sure, this is easy...

The other issue as @dscho pointed out, is that unlike progit, the git manpages are versioned, at least in English, and need to be flatten in the filesystem (instead of relying on the hash in a DB).

Oct 06 '22 19:10 jnavila

But I don't know if such cross-repo monitoring is possible.

It's not possible right now with GitHub workflows as-are, not unless you convince the Git maintainer to integrate a workflow whose sole beneficiary is git-scm.com.

However, there is a very easy way to do this: register a webhook with git/git pointing to an Azure Function that uses a PAT to triggers the workflow_dispatch workflow in git-scm.com. It's no wizardry, really, it is very similar to what we do with GitGitGadget (with minor variations, it's not a webhook but a GitHub App, and it's not a GitHub workflow but an Azure Pipeline that is triggered).

I guess the guts of the processing of sources of progit and the manpages can be kept the same.

That matches my understanding.

One roadblock, however, is that instead of feeding a DB, we want to spit files, add content, commit and push, while being in a github workflow. Not sure, this is easy...

It is actually very easy. All you need to do is to mark the workflow in question as being permitted to write contents.

the git manpages are versioned, at least in English, and need to be flatten in the filesystem (instead of relying on the hash in a DB).

They do not really need to be flattened, but we do need to put them into the correct locations. As pointed out in my earlier comment, the URLs for the versioned manual pages add a suffix /<version> to https://git-scm.com/docs/<command>. What is tricky about this is that docs/git-config is kind of a file, but also kind of a directory, in the Rails app. We will need to figure out how to handle this in Jekyll (it might be necessary to write the current version into docs/git-config/index.html and have docs/git-config auto-resolve to that file, somehow).

Oct 06 '22 21:10 dscho

My biggest wish for the import/build system is that I be able to run it locally, and that the results be available in the filesystem for local inspection. And ideally also intermediate results, like manpages that have been rendered via asciidoctor but not yet Jekyll-ified (or whatever template / presentation system we use).

The reason is that most of our bugs (including the one I introduced a few days ago!) come from refactoring or fixing a problem in the import code, which have unforeseen effects. I.e., they are things that could be caught easily with a simple diff of the results before and after the change, but our current build flow makes that really painful to do.

I suspect everyone is on board with that direction, and hopefully it just falls out naturally from any static-site build plan, but that was what I was trying to get at earlier.

Oct 10 '22 19:10 peff

git-scm.com git-scm.com copied to clipboard

WIP: Convert git-scm.com to use GitHub Pages

git-scm.com
git-scm.com copied to clipboard