ipt icon indicating copy to clipboard operation
ipt copied to clipboard

Publishing a new version by a script (command line)

Open meliezer opened this issue 4 years ago • 19 comments

Hello, Would it be possible to add an option to create a new version after I overwrite the source files with a script or simply because database content has changed, maybe by API using curl? Of course the HTTP response code would tell my script if something is wrong, and the details will be found in the validation log.

Cheers, Menashè

meliezer avatar Jun 17 '20 07:06 meliezer

Hello @meliezer

I totally agree with this feature. On my side, I'm using a python script to publish and register automatically a list of resources.

I list the resource IDs in a file (one ID by row) The python script will do the HTTP calls to login, then publish each or register each. https://github.com/gbiffrance/ipt-batch-import/blob/master/src/py/Automate-IPT-INPN.py

But I agree that a REST API would be better !

sylvain-morin avatar Nov 02 '20 10:11 sylvain-morin

I'm not sure a full REST API for the IPT makes sense, as it quickly becomes the GBIF registry, but having some management API for key features like the one @meliezer and @sylmorin-gbif need would make the tool more useful for their workflows.

I suggest we keep the API small initially so that it can be included in a 2.5.1 or 2.5.2 release

timrobertson100 avatar Apr 28 '21 10:04 timrobertson100

Thank you @timrobertson100 ! Personally I would like to:

  1. Upload source files. Overwrite if exist.
  2. Map them (Auto-mapped, simple). If you cannot, so only remapping after the first time was defined using the web version.
  3. Publish.
  4. Register. The first two steps are more important.

meliezer avatar Apr 28 '21 14:04 meliezer

Perhaps this could also be possible?

  1. Update metadata (i.e. passing a JSON dictionary of metadata as body request)

abubelinha avatar Apr 29 '21 07:04 abubelinha

Update metadata (i.e. passing a JSON dictionary of metadata as body request)

That relates to #955.

MattBlissett avatar Apr 29 '21 08:04 MattBlissett

I'm not sure a full REST API for the IPT makes sense, as it quickly becomes the GBIF registry

@timrobertson100 : did you mean it would be easier to publish datasets directly to registry just using its current api?

i.e., is there a way to avoid IPT hosting and keep our datasets and their metadata somewhere (i.e. github repository) we can tell GBIF registry to read from? Anyone using this approach? Any example scripts or pseudocode?

Thanks @abubelinha

abubelinha avatar Nov 17 '22 11:11 abubelinha

If you have a Darwin Core Archive (or just an EML file for a metadata-only dataset) on a public URL you can register it directly with the GBIF.

https://github.com/gbif/registry/tree/dev/registry-examples/src/test/scripts has an example using Bash, which is obviously not production-ready in any way! There are probably 20 or so publishers registering datasets in this way.

Dataset metadata is read from the EML file within the DWCA. You need to keep track of what GBIF dataset UUID is assigned to datasets you have registered, so you can update them.

MattBlissett avatar Nov 17 '22 11:11 MattBlissett

There are probably 20 or so publishers registering datasets in this way.

Thanks a lot! Is it possible to somehow search/know those publishers?
(if anyone publishes DwCa in github, maybe is sharing their publishing protocols/code too).

abubelinha avatar Nov 17 '22 12:11 abubelinha

To sum up, we should:

  1. put our dwca archives (zip files) on a web server each archive will be accessible with a direct URL like http://myhost/dwca12345.zip

  2. register (once) this web server as an "installation", to get an installationKey it means, calling POST https://api.gbif.org/v1/installation with a body like this: https://api.gbif.org/v1/installation/a957a663-2f17-415f-b1c8-5cf6398df8ed but with the installationType HTTP_INSTALLATION

  3. if the dataset is a new one (not yet register), we have 2 calls to do

a) POST https://api.gbif-uat.org/v1/dataset with the following body:

{
		  "publishingOrganizationKey": "$ORGANIZATION",
		  "installationKey": "$INSTALLATION",
		  "type": "OCCURRENCE",
		  "title": "Example dataset registration",
		  "description": "The dataset is registered with minimal metadata, which is overwritten once GBIF can access the file.",
		  "language": "eng",
		  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode"
		}

(copied from the scripts @MattBlissett mentioned ! thanks !)

to create the dataset on GBIF.org, and get the GBIF uuid of the dataset (let's call it aaaa-bbbb-cccc-dddd)

b) POST https://api.gbif-uat.org/v1/dataset/aaaa-bbbb-cccc-dddd/endpoint with the following body:

{
			"type": "DWC_ARCHIVE",
			"url": "http://myhost/dwca12345.zip"
		} 

to tell GBIF.org how to access the dwca archive on our web server

  1. if the dataset is already registered, we are doing an update. let's say we overwrite the file on our webserver (so same URL to access it)

we call PUT https://api.gbif-uat.org/v1/dataset/aaaa-bbbb-cccc-dddd with the correct GBIF uuid and kind of the same the following body as in 3)

is it enough to trigger the update? since the URL has not changed, we don't have to call the "endpoint" URL, right?

sylvain-morin avatar Nov 17 '22 12:11 sylvain-morin

as it is written in @mike-podolskiy90 register.sh script:

Using a process like this, you should make sure you store the UUID GBIF assigns to your dataset, so you don't accidentally re-register existing datasets as new ones.

if our archives are named with stable internal UUIDs, we can even just rely on the GBIF.org API.

calling http://api.gbif.org/v1/organization/1928bdf0-f5d2-11dc-8c12-b8a03c50a862/publishedDataset for your publishing org gives you all the information to retrieve the GBIF UUID from your internal ones (using the endpoints section)

that's what I do with some scripts to compare what is on GBIF.org and what is on my IPT, after an update

sylvain-morin avatar Nov 17 '22 12:11 sylvain-morin

@timrobertson100 you told me to do this 2 years ago... but I love too much IPT to abandon it :-)

I guess it's time for me to migrate to this solution - having 10K datasets on my IPT is becoming difficult to handle.

I'm just wondering if the migration will be easy. For all my current IPT datasets, I will call

  1. PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc
{
		  "publishingOrganizationKey": "$ORGANIZATION",
		  "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
		  "type": "OCCURRENCE",
		  "title": "Example dataset registration",
		  "description": "The dataset is registered with minimal metadata, which is overwritten once GBIF can access the file.",
		  "language": "eng",
		  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode"
		}

to update the installationKey

  1. PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc/endpoint
{
			"type": "DWC_ARCHIVE",
			"url": "http://myhost/dwca12345.zip"
		}

to update the URL endpoint of the archive

Can we update the installationKey of an existing dataset? Won't the GBIF.org block me?

What about the user to use for this call? Can I do these operations with any account, or is there some specific registration to do for the account?

Currently, I don't care about this, since it's the IPT that is doing the registry calls.

sylvain-morin avatar Nov 17 '22 12:11 sylvain-morin

If there are no modifications to make you can call (with authentication) GET https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc/crawl to request we re-crawl/interpret the dataset. Please don't do this with 10000 datasets at once -- for that many either run batches of about 200 and wait for them to complete, or just wait for the weekly crawl which will happen within 7 days anyway.

To migrate, you would only need to include the changed field:

PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc
{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
}

(I think, haven't done this for a while.) Updates cause a crawl after 1 minute, in case there are more updates.

You should write to [email protected] to get authorization to make these requests. It's usually best to create a new institutional account on gbif.org for this. Create one on gbif-uat.org too, so you can test everything there first.

MattBlissett avatar Nov 17 '22 12:11 MattBlissett

Thank you @MattBlissett. I guess I will switch to this (I just asked confirmation to the GBIF France team) before leaving, so it will ease the maintenance. I could share my scripts to the community for sure.

sylvain-morin avatar Nov 17 '22 14:11 sylvain-morin

Wow ... tons of information today. Thanks a lot @MattBlissett & @sylmorin-gbif ... are you planning to use Python for this too?

abubelinha avatar Nov 17 '22 20:11 abubelinha

Hi @MattBlissett, I'm testing the migration, as we discussed above: PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY"
}

Here is the result:

<ul>
	<li>Validation of [publishingOrganizationKey] failed: must not be null</li>
	<li>Validation of [title] failed: must not be null</li>
	<li>Validation of [type] failed: must not be null</li>
</ul>

So I added them: PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
   "publishingOrganizationKey": "xxxx",
   "title": "xxxx",
   "type": "OCCURRENCE"
}

Here is the new result:

<ul>
	<li>Validation of [created] failed: must not be null</li>
	<li>Validation of [key] failed: must not be null</li>
	<li>Validation of [modified] failed: must not be null</li>
</ul>

I don't think adding "created" or "modified" dates is normal.... But I did it :)

PUT https://api.gbif-uat.org/v1/dataset/aaaaa-bbbbbbb-cccccc

{
   "installationKey": "NEW_WEB_SERVER_INSTALLATION_KEY",
   "publishingOrganizationKey": "xxxx",
   "title": "xxxx",
   "type": "OCCURRENCE",
   "key": "xxxxx",
   "created": "2022-11-25T16:18:46.134+00:00",
   "modified": "2022-11-25T16:18:46.134+00:00"
}

And the result is... 400 BAD REQUEST

Any idea?

sylvain-morin avatar Nov 30 '22 10:11 sylvain-morin

I guess I will switch to this (I just asked confirmation to the GBIF France team) before leaving, so it will ease the maintenance. I could share my scripts to the community for sure.

@sylvain-morin did you finally end up with a solution you can share? I'd love to find a way to upload DwCA files to a public repository (Zenodo, Github, whatever) and let GBIF registry read directly from them whenever we publish updates.

abubelinha avatar Feb 10 '24 19:02 abubelinha

I did a very simple Python app server, to handle our needs at GBIF France. https://github.com/gbiffrance/apt

In short:

  • you register this "APT" as a GBIF "HTTP installation"
  • you set the GBIF keys (publisher, installation, ...) in the APT config (by env var)
  • you define the folder to store the datasets (mounted by docker volume)
  • you use the POST endpoint to push a ZIP dataset to the APT

the POST endpoint will:

  • store the file in the folder you defined
  • register (or update) the dataset on gbif.org

It's really basic, but it has been handling +15 000 datasets for 1 year (https://www.gbif.org/installation/e44d0fd7-0edf-477f-aa82-50a81836ab46)

Our goal was to have a simple tool to handle the GBIF publication/update at the end of our dataset pipeline.

sylvain-morin avatar Feb 12 '24 11:02 sylvain-morin

Oh great! Thanks a lot for the summarized explanations.
I suggest you to add them to the repository (INSTALLATION.md or wherever you think it's better).

I understand APT basically replicates IPT behaviour, but in a way that you must create DwCA files yourself before, and then use APT to both serve and register them (and their updates). So this means the APT is expected to be up and running at any time, just like an IPT. Am I right?

This is great but I am mostly interested in "serving" datasets from a different place (i.e. institutional repository, or Zenodo), but using Python/APT only to register them.
I guess this might be possible by changing registry.py dataset_url(id):

def dataset_url(id):
    # your current code:
    # return CONFIG.APT_PUBLIC_URL + "/dataset/"+id 

    # use my own function to get dataset urls from wherever I store them (i.e. database, excel, ...):
    return get_remote_dataset_url(id)

Also the defaultserver.py post_dataset(id) must be changed to upload the file to a repository instead of storing it into APT server.

In such an scenario, would it be possible to use APT from a local machine (not accessible to GBIF registry) so that I only run it when publishing, but Zenodo or my institution's repository take care of keeping the DwC datasource accesible online 24x7? I encourage gbif staff (@timrobertson100 @MattBlissett ...?) to give their opinions about this approach.

I suppose the main concern would be checking for valid DwCA file structure before uploading it to a public url and registering it. But perhaps python-dwca-reader might do the trick. @sylvain-morin did you use any particular approach for that?

Of course creating valid DwCA files on your own might not be trivial (specially the metadata part) ... but that is a different question.

abubelinha avatar Feb 12 '24 12:02 abubelinha

I think I missed the "HTTP installation" role of APT in my previous message.

I guess both APT & IPT are expected to be accessible online, so they constitute kind of an index page for the datasets they serve. So I would slightly change my question: could this "HTTP installation" be a simple html index page of those datasets?

In other words, can we just store (and keep updated) both the "installation" and its datasets in a static website? (so we may use any repository which can keep them available online in permanent urls)

abubelinha avatar Feb 12 '24 22:02 abubelinha