ipt icon indicating copy to clipboard operation
ipt copied to clipboard

only a changed archive/dataset should result in a new version

Open jhpoelen opened this issue 7 years ago • 10 comments

IPT provides a way to periodically publish a new version of some dataset .

expected: periodically check for changes, but only publish a new version if changes occurred.

actual: new versions are published even though no changes occurred.

example:

Versions published through https://data.gbif.no/ipt/resource?r=trom_entomology suggests a daily publication period. However, a versions published on 2018-08-31 and 2018-09-03 (see attached dwca) have identical occurrence data. The only different in the eml file is the publication date and version information.

When running a diff on the packaged eml.xml - we find:

5c5
<          packageId="f2a77c80-1e74-4c23-a3c9-c52cede89434/v1.231" system="http://gbif.org" scope="system"
---
>          packageId="f2a77c80-1e74-4c23-a3c9-c52cede89434/v1.234" system="http://gbif.org" scope="system"
38c38
<       2018-08-31
---
>       2018-09-03
79c79
<           <dc:replaces>f2a77c80-1e74-4c23-a3c9-c52cede89434/v1.231.xml</dc:replaces>
---
>           <dc:replaces>f2a77c80-1e74-4c23-a3c9-c52cede89434/v1.234.xml</dc:replaces>

When running a diff on occurrence.txt, we find that the files are identical.

dwca-f2a77c80-1e74-4c23-a3c9-c52cede89434-20180903.zip dwca-f2a77c80-1e74-4c23-a3c9-c52cede89434-20180831.zip

When publishing new versions on changed content only, a more intuitive and direct relationship between changes and versions is established.

jhpoelen avatar Oct 29 '18 20:10 jhpoelen

By the way, I was able to do this analysis by using Preston @ https://preston.guoda.bio to track all of datasets and emls in the GBIF network over time .

fyi @dagendresen

jhpoelen avatar Oct 29 '18 20:10 jhpoelen

Another way to see this is with the ingestion history interface — the records are always 100% unmodified for this dataset: https://management-tools.gbif.org/crawl-history?uuid=f2a77c80-1e74-4c23-a3c9-c52cede89434

Which is mostly this API call nicely formatted: https://api.gbif.org/v1/dataset/f2a77c80-1e74-4c23-a3c9-c52cede89434/process?limit=500

MattBlissett avatar Oct 30 '18 09:10 MattBlissett

@MattBlissett thanks for sharing! Perhaps a bit off-topic, but I was wondering whether there is there a way to find these ingested datasets by their content hashes (e.g., md5, sha256)? If not, did you consider calculating sha256 hashes so that you can do things like https://github.com/bio-guoda/preston#generating-citations and https://github.com/bio-guoda/preston#finding-copies-with-hash-archiveorg ?

jhpoelen avatar Oct 30 '18 15:10 jhpoelen

@MattBlissett a question about the management tool and the process api - is there a way to retrieve the content that was crawled in the past?

jhpoelen avatar Oct 30 '18 15:10 jhpoelen

whether there is there a way to find these ingested datasets by their content hashes

Not that we maintain. The GBIF registry and the IPT support adding alternative identifiers to datasets, but at the moment I think they're overwritten when we re-read EML from a DWCA. Normally the identifier goes inside the DWCA, but in this case that's no good.

is there a way to retrieve the content that was crawled in the past?

Maybe.

We keep a quarterly archive of the GBIF occurrence database, though there isn't any public interface for this. (There's no internal interface either, other than SQL queries.)

GBIF downloads are kept for as long as is practical (it's not practical to keep all the large ones), DWCA format has the verbatim records.

Neither of these is really old versions of records.

MattBlissett avatar Oct 30 '18 16:10 MattBlissett

So, in summary: data consumers and/or IPTs are responsible for archiving their downloaded or source data, respectively.

Also, I learned that a method exists in the GBIF management tool suite to detect changes in (occurrence) records. Would it be possible to use that change detection method and apply it such that IPT software only publishes archives on content updates?

jhpoelen avatar Oct 30 '18 22:10 jhpoelen

After some digging, I think I might have found the piece of code that determines whether an occurrence records (or fragment?) is new, was updated or left unchanged -

https://github.com/gbif/occurrence/blob/master/occurrence-processor/src/main/java/org/gbif/occurrence/processor/FragmentProcessor.java#L104

Still some work needed to figure out details. . .

jhpoelen avatar Dec 10 '18 16:12 jhpoelen

The idea of determining if there is a changeset seems sensible to consider, although in 10yrs of operating it's only been raised once that I can see.

Are there others who'd like it?

Related, is that it would perhaps be sensible to store hashcodes next to the archives since they reside on the same URL. This would allow clients to determine if there are changes before downloading.

timrobertson100 avatar Apr 28 '21 10:04 timrobertson100

One of the things that bugs me a bit is that changing the metadata (e.g. correcting a typo, adding something per request of the author) results in a new version. At Zenodo for example changes to the metadata don't require (but allow) a new version, while changes to the data always do. I would like that approach for the IPT as well.

peterdesmet avatar Apr 28 '21 11:04 peterdesmet

@timrobertson100 thanks for taking another look at the IPT version idea I shared years ago. I appreciate you take care to revisit older issues.

Also, I'd say that a reason that few brought up this idea is that versioning is hard and it takes a while for a community like ours to gradually get better at data (management) hygiene. I still remember the pains of getting started with cvs/subversion to do version control. Oef. That was hard!

@peterdesmet I much like the idea to add content hashes (e.g., hash://sha256/abc123...) to reliable identify content for reasons mentioned. Perhaps this could also be a way to identify the data source version in the GBIF "mediated" records?

jhpoelen avatar Apr 28 '21 14:04 jhpoelen