dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Support sitemaps with more than 50,000 items

Open jeromeroucou opened this issue 1 year ago • 18 comments

What this PR does / why we need it:

Actual generation sitemap file doesn't take care the numbers of entries. But sitemap specification indicate "each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 50MB (52,428,800 bytes)" (see sitemap.org)

Which issue(s) this PR closes:

Closes #8936

Special notes for your reviewer:

Sitemap generation is done by sitemapgen4j library, which has the advantage of simplifying the code. I also use the new java 8 date formatter DateTimeFormatter which is threadsafe. I make this PR as a draft because issue have been moved in "SPRINT READY" in IQSS Dataverse Project.

Suggestions on how to test this:

A new Unit Test have been added. And if the code is lunch into a running Dataverse, you can also call the API (for ex with curl -X POST http://localhost:8080/api/admin/sitemap).

Is there a release notes update needed for this change?:

No, I don't thinks.

Additional documentation:

Installation configuration need to be updated with the new file name of the sitemap.

jeromeroucou avatar Feb 14 '24 15:02 jeromeroucou

Coverage Status

coverage: 20.682% (+0.1%) from 20.57% when pulling f6b0438baf8f6455116ec667b8c159bcd299688f on Recherche-Data-Gouv:8936_huge_sitemap into 4f46d157263312d65d9c6688831425ec1b437c4a on IQSS:develop.

coveralls avatar Feb 14 '24 15:02 coveralls

Hi @jeromeroucou @pdurbin is this set to be QAed? (or does it still need more review / follwoup for the review).

Also, can we change this from not being a draft?

scolapasta avatar Feb 28 '24 20:02 scolapasta

Just a general note here: the dependency added has not been updated since 2019 (no activity, stale PRs and issues) and uses old Java APIs. At least it doesn't introduce any other transitive dependencies. There are some forks with small patches.

My suggestion: let's create a fork at github.com/gdcc and have our own package at Maven Central. In case we need patches because of bugs or changes at search engines, we would be much quicker to get this done. It also allows us to modernize the Java code over time.

poikilotherm avatar Feb 29 '24 08:02 poikilotherm

Good catch @poikilotherm ! If there is possible to have a specific fork with patches, it could be nice, and I will be happy to change my PR with this dependency.

jeromeroucou avatar Feb 29 '24 14:02 jeromeroucou

@poikilotherm not a bad suggestion. So we'd fork https://github.com/dfabulich/sitemapgen4j under https://github.com/gdcc

@scolapasta and others reading this, what do you think?

pdurbin avatar Feb 29 '24 19:02 pdurbin

I see a 👍 from @scolapasta and no objections in Slack so I forked it to https://github.com/gdcc/sitemapgen4j

Now what? Fix it up and release to Maven Central? Do we want a separate issue to track this? @jeromeroucou are you interested in helping?

pdurbin avatar Feb 29 '24 19:02 pdurbin

@pdurbin Yes I am interested. And I also think a separate issue on the forked project can be better to review patchs applications with PR

jeromeroucou avatar Mar 01 '24 16:03 jeromeroucou

@jeromeroucou great thanks. @poikilotherm is already working away in this "modernize" branch, if you'd like to try building it: https://github.com/gdcc/sitemapgen4j/tree/modernize

pdurbin avatar Mar 01 '24 16:03 pdurbin

He doesn't even have to. Just enable Maven Central Snaps and point to the new coordinates: io.gdcc:sitemapgen4j:2.0.0-SNAPSHOT

poikilotherm avatar Mar 01 '24 18:03 poikilotherm

https://github.com/gdcc/sitemapgen4j has been released as v2.0.0 to Maven Central. (Might take another hour until available from all mirrors.)

Have fun! (If we need fixes, updated -SNAPSHOT versions will be auto-released to Central now, too. Hopefully will make things a lot easier!)

poikilotherm avatar Mar 04 '24 08:03 poikilotherm

I have just released a version 2.1.0 that has again more optimizations. Most important: addressed loads of code smells and removed obsolete, no longer supported Google Sitemap Extensions.

poikilotherm avatar Mar 04 '24 12:03 poikilotherm

@jeromeroucou Is this all ready for QA now? (if so can you change it from draft - I jsut want to be sure there's not something else we're waiting for here)

scolapasta avatar Mar 18 '24 20:03 scolapasta

Hi @scolapasta , the PR is not yet ready as it's waiting for the sitemap4j library namespaces fix, referenced by gdcc/sitemapgen4j#18

jeromeroucou avatar Mar 19 '24 13:03 jeromeroucou

Hi @scolapasta , the PR is not yet ready as it's waiting for the sitemap4j library namespaces fix, referenced by gdcc/sitemapgen4j#18

Ah, OK, that makes sense. We'll keep an eye on for when you switch it out of draft state.

scolapasta avatar Mar 19 '24 15:03 scolapasta

@jeromeroucou I merged your PR and a new SNAPSHOT version is available. Would you mind testing this in Dataverse again before I cut a release?

poikilotherm avatar Mar 20 '24 09:03 poikilotherm

@poikilotherm Thanks for the update ! On my laptop with the latest SNAPSHOT version, unit tests and sitemap generation with some dataverses and datasets work fine. When the new release will be published, I'll mention it on pom.xml file.

jeromeroucou avatar Mar 20 '24 14:03 jeromeroucou

@jeromeroucou sitemap4j 2.1.2 is released

poikilotherm avatar Mar 21 '24 08:03 poikilotherm

@poikilotherm thanks for you work on the sitemapgen4j ! The 2.1.2 version is now used on pom.xml

jeromeroucou avatar Mar 21 '24 10:03 jeromeroucou

@jeromeroucou I took a close look at the documentation and I'm suggesting a significant rewrite here:

  • https://github.com/Recherche-Data-Gouv/dataverse/pull/6

If you like the changes, please feel free to merge. Otherwise, I'm happy to talk about it! 😄

pdurbin avatar Apr 18 '24 19:04 pdurbin

I looked at the code and poked around with the new testHugeSiteMap test. I'm basically ready to approve this and move it to "ready for QA" but I'd like to see the doc PR I made above get merged first.

pdurbin avatar Apr 18 '24 19:04 pdurbin

After discussion with @pdurbin, a new commit have been done to fix the sitemap URL in case of sitemap_index.xml file. A example below :

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://domain.tld/sitemap/sitemap1.xml</loc>
    <lastmod>2024-04-22</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://domain.tld/sitemap/sitemap2.xml</loc>
    <lastmod>2024-04-22</lastmod>
  </sitemap>
</sitemapindex>

In <loc> tag, if /sitemap/ isn't present, the location is invalid. There is no "pretty-faces" rewrite for these files (sitemap1.xml, ...), unlike sitemap.xml and sitemap_index.xml files.

I've also modified the unit tests to validate this modification, and I've also added some comments.

jeromeroucou avatar Apr 24 '24 14:04 jeromeroucou

Merging. (I checked out and synced the branch with develop locally; but it seems safe to merge without updating it in back in Recherche-Data-Gouv fork... one way to find out).

landreev avatar May 08 '24 18:05 landreev