dvc.org
dvc.org copied to clipboard
regular: fix expired and broken links
See https://github.com/iterative/dvc.org/actions/workflows/link-check-all.yml
Run iterative/[email protected]
* content/blog/2020-07-16-devops-for-data-scientists.md
- http://engineering.microsoft.com/devops/ (404)
* content/blog/2020-11-11-november-20-dvc-heartbeat.md
- https://torontomachinelearning.com/ (409)
- https://torontomachinelearning.com/ (409)
* content/blog/2020-12-30-december-20-community-gems.md
- https://github.com/iterative/cml/blob/master/docker/Dockerfile (404)
* content/blog/2021-02-22-cml-runner-prerelease.md
- https://github.com/iterative/cml/blob/master/docker/Dockerfile (404)
* content/blog/2021-04-16-april-21-dvc-heartbeat.md
- https://weworkremotely.com/remote-jobs/iterative-senior-frontend-engineer (404)
* content/docs/cml/self-hosted-runners.md
- https://github.com/iterative/cml/blob/master/docker/Dockerfile (404)
* content/docs/cml/usage.md
- https://github.com/iterative/cml/blob/master/docker/Dockerfile (404)
Seems like the most important ones (last 2) are about CML cc @casperdcl
Do we want to maintain old blogs like this though? I guess these are not so old so why not fix them indeed...
Do we want to maintain old blogs like this though?
yes, I would maintain. This is good to have a healthy website to my mind.
The link-check is showing some more broken doc links (Went ahead and tested them to confirm they weren't false negatives):
* content/docs/command-reference/status.md
- /doc/command-reference/reproduce = https://dvc.org/doc/command-reference/reproduce (404)
* content/docs/dvclive/dvclive-with-dvc.md
- /doc/dvclive/api-reference/get_step = https://dvc.org/doc/dvclive/api-reference/get_step (404)
* content/docs/dvclive/ml-frameworks/mmcv.md
- https://github.com/iterative/dvclive/blob/master/dvclive/mmcv.py (404)
* content/docs/dvclive/ml-frameworks/pytorch-lightning.md
- https://pytorch-lightning.readthedocs.io/en/latest/common/weights_loading.html#automatic-saving (404)
cc @iterative/docs
I'll fix the broken links from content. @jendefig what should we do about all the broken links from old blog posts, want to look or re-assign within devrel? Thanks
p.s. latest list: https://github.com/iterative/dvc.org/runs/5530668423?check_suite_focus=true
Also, @daavoo what about this broken link?
* content/docs/dvclive/ml-frameworks/mmcv.md
- https://github.com/iterative/dvclive/blob/master/dvclive/mmcv.py (404)
Do we even still support that ML framework in DVCLive? Thanks
Ping @jendefig and @daavoo 🙂
yes, I would maintain. This is good to have a healthy website to my mind.
To my mind it's more important to have a healthy workforce. Spending time updating links on old blogs that aren't being looked at very much and links that probably have a low click through rate anyway falls into the P2-nice-to-have category. Our mountain of higher priority tasks is large. This is very low priority for me.
It's in my back log of to dos in simmer mode.
Maybe @iterative/websites should be free to remove broken links from old blogs as needed so the check stops failing.
I can take this on while refreshing the list and investigating why link check fails these.
Thank you @rogermparent 🙏
Thanks @rogermparent , ping me if you have doubts about certain links, I can help with them in Slack. Should be quick.
Maybe https://github.com/orgs/iterative/teams/websites should be free to remove broken links from old blogs as needed so the check stops failing.
yes, websites (and everyone tbh) is free to take and edit any blog post and create a PR
This is very low priority for me.
yep, it's def p2. We should keep our home clean though. Together. If no one pays attention to anything website quality will deteriorate quick and this is not acceptable. There should be an easy process to do this. And if it's done more or less regularly I doubt it would be taking >1-5minutes a week from anyone.
One suggestion btw is to remove or archive (put a message that it's outdated, not maintained, remove from search, and the landing page) older blog posts, especially things like gems, heartbeats. Eventually garbage collect them completely. This was we can put an exception into link checker to avoid these outdated posts at all.
Btw, another reason for this, and this is p1 (even p0) - the whole intention to have link checker is to detect broken links that are important (e..g on a landing page, on a recent blog post, in docs, etc, etc). It's super important to keep notifications denoised other we have a fatigue and we can miss important things among older / non-relevant links. Again, it can be addressed with some simple measures - remove old blogs posts, archive them, remove / fix links as we them.
And to add more color to this, some actual stats, top visited pages, according to plausible:
/blog/iterative-studio-model-registry 1.8k
/blog 694
/blog/aws-remotes-in-dvc 505
/blog/DVC-VS-Code-extension 455
/blog/shtab-completion-release 214
/blog/azure-remotes-in-dvc
/blog/ml-experiment-versioning 154
/blog/using-gcp-remotes-in-dvc 100
some of them are quite old (shtab release!), some of them will be used as tutorials (GCP, AWS - we should have done this in docs not as blogs in the first place, and I would love dev rel team to also participate in this - it's clear that those topic are important)
everything else is <100 visits and which are very old we can start archiving at some point.
@rogermparent per @shcheklein suggestion, to help with this one, is it possible to set up the blog to continually archive (somehow as opposed to just trash. There may be reasons we want to revisit these) Heartbeat and Community Gems posts that are more than a year old?
This should help with link problems, clean up the old differing in style images, and hopefully eliminate most of the noise problem whatever links that show up after that would be worth changing as the material is either more recent or from a tutorial or release.
Old Heartbeat and Gems posts are not really revisited like the release and tutorial posts, so just those two types of posts would drop off.
I suppose it would depend on the definition of "archive", but most definitions I can imagine are things we can do.
- automatically not create pages that are older than a certain date while still retaining them in the repo would be an easy change to the blog engine, but I imagine that's not what we're thinking
- automatically making some sort of lesser page for archived pages is also on the table, but doesn't solve the link check issue.
- automatically not checking links on pages older than an arbitrary date could be also done, the implementation would involve adding date-checking functionality either to link-check itself or just the GitHub Action that calls it (the latter would probably be better, at least if you subscribe to the unix philosophy as an ideal like I do)
I'm thinking we want the latter, so I'll default to thinking about link check improvements to enable that. Adding in the ability to specify input files is an easy start, we'll need at least that even if we have the date checking done by GitHub Actions.
If we want a post archiving feature that's more involved I can do that too, but it seems we're mostly just talking about link check here.