Support sitemaps with more than 50,000 items
What this PR does / why we need it:
Actual generation sitemap file doesn't take care the numbers of entries. But sitemap specification indicate "each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 50MB (52,428,800 bytes)" (see sitemap.org)
Which issue(s) this PR closes:
Closes #8936
Special notes for your reviewer:
Sitemap generation is done by sitemapgen4j library, which has the advantage of simplifying the code. I also use the new java 8 date formatter DateTimeFormatter which is threadsafe.
I make this PR as a draft because issue have been moved in "SPRINT READY" in IQSS Dataverse Project.
Suggestions on how to test this:
A new Unit Test have been added. And if the code is lunch into a running Dataverse, you can also call the API (for ex with curl -X POST http://localhost:8080/api/admin/sitemap).
Is there a release notes update needed for this change?:
No, I don't thinks.
Additional documentation:
Installation configuration need to be updated with the new file name of the sitemap.
coverage: 20.682% (+0.1%) from 20.57% when pulling f6b0438baf8f6455116ec667b8c159bcd299688f on Recherche-Data-Gouv:8936_huge_sitemap into 4f46d157263312d65d9c6688831425ec1b437c4a on IQSS:develop.
Hi @jeromeroucou @pdurbin is this set to be QAed? (or does it still need more review / follwoup for the review).
Also, can we change this from not being a draft?
Just a general note here: the dependency added has not been updated since 2019 (no activity, stale PRs and issues) and uses old Java APIs. At least it doesn't introduce any other transitive dependencies. There are some forks with small patches.
My suggestion: let's create a fork at github.com/gdcc and have our own package at Maven Central. In case we need patches because of bugs or changes at search engines, we would be much quicker to get this done. It also allows us to modernize the Java code over time.
Good catch @poikilotherm ! If there is possible to have a specific fork with patches, it could be nice, and I will be happy to change my PR with this dependency.
@poikilotherm not a bad suggestion. So we'd fork https://github.com/dfabulich/sitemapgen4j under https://github.com/gdcc
@scolapasta and others reading this, what do you think?
I see a 👍 from @scolapasta and no objections in Slack so I forked it to https://github.com/gdcc/sitemapgen4j
Now what? Fix it up and release to Maven Central? Do we want a separate issue to track this? @jeromeroucou are you interested in helping?
@pdurbin Yes I am interested. And I also think a separate issue on the forked project can be better to review patchs applications with PR
@jeromeroucou great thanks. @poikilotherm is already working away in this "modernize" branch, if you'd like to try building it: https://github.com/gdcc/sitemapgen4j/tree/modernize
He doesn't even have to. Just enable Maven Central Snaps and point to the new coordinates: io.gdcc:sitemapgen4j:2.0.0-SNAPSHOT
https://github.com/gdcc/sitemapgen4j has been released as v2.0.0 to Maven Central. (Might take another hour until available from all mirrors.)
Have fun! (If we need fixes, updated -SNAPSHOT versions will be auto-released to Central now, too. Hopefully will make things a lot easier!)
I have just released a version 2.1.0 that has again more optimizations. Most important: addressed loads of code smells and removed obsolete, no longer supported Google Sitemap Extensions.
@jeromeroucou Is this all ready for QA now? (if so can you change it from draft - I jsut want to be sure there's not something else we're waiting for here)
Hi @scolapasta , the PR is not yet ready as it's waiting for the sitemap4j library namespaces fix, referenced by gdcc/sitemapgen4j#18
Hi @scolapasta , the PR is not yet ready as it's waiting for the sitemap4j library namespaces fix, referenced by gdcc/sitemapgen4j#18
Ah, OK, that makes sense. We'll keep an eye on for when you switch it out of draft state.
@jeromeroucou I merged your PR and a new SNAPSHOT version is available. Would you mind testing this in Dataverse again before I cut a release?
@poikilotherm Thanks for the update ! On my laptop with the latest SNAPSHOT version, unit tests and sitemap generation with some dataverses and datasets work fine. When the new release will be published, I'll mention it on pom.xml file.
@jeromeroucou sitemap4j 2.1.2 is released
@poikilotherm thanks for you work on the sitemapgen4j ! The 2.1.2 version is now used on pom.xml
@jeromeroucou I took a close look at the documentation and I'm suggesting a significant rewrite here:
- https://github.com/Recherche-Data-Gouv/dataverse/pull/6
If you like the changes, please feel free to merge. Otherwise, I'm happy to talk about it! 😄
I looked at the code and poked around with the new testHugeSiteMap test. I'm basically ready to approve this and move it to "ready for QA" but I'd like to see the doc PR I made above get merged first.
After discussion with @pdurbin, a new commit have been done to fix the sitemap URL in case of sitemap_index.xml file. A example below :
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://domain.tld/sitemap/sitemap1.xml</loc>
<lastmod>2024-04-22</lastmod>
</sitemap>
<sitemap>
<loc>https://domain.tld/sitemap/sitemap2.xml</loc>
<lastmod>2024-04-22</lastmod>
</sitemap>
</sitemapindex>
In <loc> tag, if /sitemap/ isn't present, the location is invalid. There is no "pretty-faces" rewrite for these files (sitemap1.xml, ...), unlike sitemap.xml and sitemap_index.xml files.
I've also modified the unit tests to validate this modification, and I've also added some comments.
Merging. (I checked out and synced the branch with develop locally; but it seems safe to merge without updating it in back in Recherche-Data-Gouv fork... one way to find out).