dataverse-client-python icon indicating copy to clipboard operation
dataverse-client-python copied to clipboard

Add ability to instantiate a dataset without instantiating a dataverse

Open pdurbin opened this issue 10 years ago • 16 comments

In working on an internal ticket that ultimately lead to https://github.com/IQSS/dataverse/issues/2599 being opened, I wrote the following to @garthg about my perception that it is impossible to instantiate a dataset without first instantiating a dataverse:

It looks like the get_dataset_by_doi method is really a loop that iterates through get_datasets which seems to be a representation of the list of datasets from the operation above that's failing:

$ grep get_dataset_by_doi dataverse/dataverse.py -A2
    def get_dataset_by_doi(self, doi, refresh=False):
        return next((s for s in self.get_datasets(refresh) if s.doi == doi), None)

I only bring this up because it seems like I can operate on your dataset DOIs if I skip the "List datasets in a dataverse" operation and go directly to these API endpoints:

http://guides.dataverse.org/en/4.2/api/sword.html#display-a-dataset-atom-entry

http://guides.dataverse.org/en/4.2/api/sword.html#display-a-dataset-statement

I'd be happy to be told I'm wrong about this.

Especially once datasets can be found via search (#21) I imagine that datasets will be able to be instantiated without instantiating a dataverse but I thought I'd go ahead and create this issue so we can talk about it.

pdurbin avatar Oct 06 '15 13:10 pdurbin

While it is possible to get some information from those endpoints, it appears that it's insufficient to create a Dataset object.

Example outputs from those endpoints:

<entry xmlns="http://www.w3.org/2005/Atom">
  <bibliographicCitation xmlns="http://purl.org/dc/terms/">[email protected], 2015, "Study of Cats", http://dx.doi.org/10.5072/FK2/LZGXQ8,  API Test Dataverse,  DRAFT VERSION</bibliographicCitation>
  <generator uri="http://www.swordapp.org/" version="2.0"/>
  <id>https://apitest.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.5072/FK2/LZGXQ8</id>
  <link href="https://apitest.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.5072/FK2/LZGXQ8" rel="edit"/>
  <link href="https://apitest.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.5072/FK2/LZGXQ8" rel="http://purl.org/net/sword/terms/add"/>
  <link href="https://apitest.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit-media/study/doi:10.5072/FK2/LZGXQ8" rel="edit-media"/>
  <link href="https://apitest.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/statement/study/doi:10.5072/FK2/LZGXQ8" rel="http://purl.org/net/sword/terms/statement" type="application/atom+xml; type=feed"/>
  <treatment xmlns="http://purl.org/net/sword/terms/">no treatment information available</treatment>
  <link href="http://dx.doi.org/10.5072/FK2/LZGXQ8" rel="alternate"/>
</entry>
<feed xmlns="http://www.w3.org/2005/Atom">
  <id>https://apitest.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.5072/FK2/LZGXQ8</id>
  <link href="https://apitest.dataverse.org/dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.5072/FK2/LZGXQ8" rel="self"/>
  <title type="text">Study of Cats</title>
  <author>
    <name>[email protected]</name>
  </author>
  <updated>2015-10-10T12:12:00.569Z</updated>
  <category term="isMinorUpdate" scheme="http://purl.org/net/sword/terms/state" label="State">true</category>
  <category term="locked" scheme="http://purl.org/net/sword/terms/state" label="State">false</category>
  <category term="latestVersionState" scheme="http://purl.org/net/sword/terms/state" label="State">DRAFT</category>
</feed>

One issue is that there is no reference to the dataset id (the non-DOI id) in either of those endpoints, but the id is needed for some of the operations to take advantage of the native API. Normally the client gets that information from the parent Dataverse object using the native endpoint /dataverses/<alias>/contents and iterating through the datasets, except if we haven't instantiated a Dataverse object, I don't think there's a way to get that id.

rliebz avatar Oct 10 '15 12:10 rliebz

One issue is that there is no reference to the dataset id (the non-DOI id) in either of those endpoints, but the id is needed for some of the operations to take advantage of the native API.

Right, it's absolutely an issue that SWORD operates on DOIs and the native API operates on database IDs. We want them both to operate on DOIs, which is what https://github.com/IQSS/dataverse/issues/1837 is about.

Meanwhile, I've had the same thought that as a workaround (until the native API supports DOIs) perhaps SWORD could somehow expose the database ID of a dataset. I played around with this in a branch at https://github.com/IQSS/dataverse/commit/639d8c37e432d5920a5fd3a2c6f74df429ae50f2 as I commented at https://github.com/IQSS/dataverse/issues/1837#issuecomment-106909159 . In a comment at https://github.com/IQSS/dataverse/commit/639d8c37e432d5920a5fd3a2c6f74df429ae50f2 you can see that the XML contains "datasetEntityId".

By the way, another way to get database IDs is via the Search API if you use the (undocumented) show_entity_ids=true query parameter: https://github.com/IQSS/dataverse/blob/v4.2/src/main/java/edu/harvard/iq/dataverse/api/Search.java#L66 . It's probably time to simply document this since (again) the native API currently doesn't support DOIs.

Thanks for your patience with all this, @rliebz ! And thanks for your continued involvement in the Dataverse Python client!

pdurbin avatar Oct 10 '15 13:10 pdurbin

Hi @pdurbin and @rliebz ,

I'm not sure this will be useful, but I wanted to make it available to you both just in case. I have a helper class that can handle some of the Dataset API functionality without instantiating a Dataset object. It also has a wrapper for finding the database ID using the undocumented "show_entity_ids=true" feature.

The code has minimal comments and some strange workarounds, so it's not production ready and I haven't released it properly anywhere, but I dumped it on Pastebin here if you'd like to look at it: http://pastebin.com/ipdhEPXA .

garthg avatar Oct 12 '15 14:10 garthg

@garthg wow, you've got a whole repo at https://github.com/garthg/petitions-dataverse you're working on! Great! I found this in your pastebin. :)

pdurbin avatar Oct 15 '15 14:10 pdurbin

Hi @pdurbin ,

Yep! That's part of our ongoing project around the Antislavery Petitions. Kevin Condon helped me set up a process where the code in that repo creates a zip archive with a suitable structure of XML files for importing into a Dataverse by a non-public backend script. My recent work with you has been to migrate that to use the Dataverse API instead of a backend script.

If you're curious, you can also check out the front end prototype we built for the data at http://antislaverypetitions.pythonanywhere.com/map, which links back to the Dataverse studies.

garthg avatar Oct 16 '15 12:10 garthg

@garthg wow! That's fantastic! @mcrosas @thegaryking et al. should check out http://antislaverypetitions.pythonanywhere.com/map and how it links back to datasets under https://dataverse.harvard.edu/dataverse/antislaverypetitionsma . I love the timeline feature. :)

Yes, please keep reminding us of anything you need API-wise. I know https://github.com/IQSS/dataverse/issues/2599 was a big issue and it's slated for the next release (4.2.1). Please keep the feedback coming!

pdurbin avatar Oct 16 '15 12:10 pdurbin

:+1: Yes, really nice @garthg great way to integrate visualization with the supporting data

mercecrosas avatar Oct 16 '15 13:10 mercecrosas

truly excellent! is there a schema for the minimum elements in the json needed for the front end to work? I would like to create another example based on the mapviewdb.json, though some of the elements seem specific to the dataset. THANKS for this!

vajlex avatar Oct 16 '15 13:10 vajlex

Hi @vajlex and @mcrosas , thanks for the supportive comments!

@vajlex Regarding your question of minimum schema, the visualization is currently fairly tightly integrated with the petitions dataset. It would work with minimal changes on another petition dataset, or with some work you could adapt it to use different columns. Right now it expects the following to be defined per row:

  • time start
  • time end
  • signatures
  • title
  • topic
  • pds url
  • dataverse id
  • location

And it also expects pre-built maps of rowsForPlace, rowsForYear, and latLngForPlace.

If you're interested in looking at the source code for generating the mapviewdb.json file as well as the html/css/js source, it's all available in another repo at: https://github.com/garthg/petitions-visualization

@pdurbin I sure will keep hassling you! I really do appreciate how responsive you and the team have been.

garthg avatar Oct 16 '15 21:10 garthg

Great @garthg Actually, I think I can hack this with some of my data, which is historical placename data... so I think I can populate the rowsForPlace, rowsForYear, and latLngForPlace elements. Will let you know how my experiment goes, meanwhile, awesome work you've done!

vajlex avatar Oct 16 '15 22:10 vajlex

@vajlex That sounds promising! I'd love to see it when you get it up and running. Very cool!

garthg avatar Oct 16 '15 23:10 garthg

@garthg do you mind if i use http://antislaverypetitions.pythonanywhere.com/map as an example in a presentation on ways people are building off the Dataverse APIs?

It is for the Increasing Openness and Connections portion of this session: https://dlfforum2015.sched.org/event/62384c349f7a6aaf6aa5b3e7d6b5bd88#.VivEvBCrT-Y

eaquigley avatar Oct 24 '15 17:10 eaquigley

@eaquigley Please feel free to include any of that work in your presentation! I'm excited that you're interested in it. If you receive any interesting comments or questions on it, I'd love to hear about that afterwards as well.

garthg avatar Oct 24 '15 18:10 garthg

@garthg I'm using it as one of the examples too in a talk on Monday, among other visualizations and analysis from data in Dataverse. Thanks!

mercecrosas avatar Oct 25 '15 15:10 mercecrosas

@mcrosas That's great to hear that you chose to include it as an example! Really exciting. As with Elizabeth, if you get any comments or questions on the work, I'd love to hear about it afterwards.

garthg avatar Oct 26 '15 01:10 garthg

Will gladly report back any comments @garthg! Thanks for building it so we can show it off!

On Oct 25, 2015, at 6:39 PM, garthg <[email protected]mailto:[email protected]> wrote:

@mcrosashttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mcrosas&d=CwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=Y9aY3P6kFFpmMLaYYO_id08dS3gL1xWMQuI2CZ74PoI&m=tYbOJIek5imONaMWagnA-noOkfkonVyJmmWME3zNosI&s=RoGfQya_wkiULQL8DZAB0aXr_FcdRoHvV2Xgc203tf8&e= That's great to hear that you chose to include it as an example! Really exciting. As with Elizabeth, if you get any comments or questions on the work, I'd love to hear about it afterwards.

— Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse-2Dclient-2Dpython_issues_28-23issuecomment-2D150999243&d=CwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=Y9aY3P6kFFpmMLaYYO_id08dS3gL1xWMQuI2CZ74PoI&m=tYbOJIek5imONaMWagnA-noOkfkonVyJmmWME3zNosI&s=-fJBzsRZNz3HaerBjkVtKNTPMdbjpn-GwXDeJRhzbiI&e=.

eaquigley avatar Oct 26 '15 17:10 eaquigley