pyDataverse
pyDataverse copied to clipboard
Add import and export of OAISTree
As I mentioned at https://github.com/IQSS/dataverse/issues/5235#issuecomment-492875277 I'm curious if the "DVTree" (Dataverse Tree) format could be used to upload sample data to a brand new Dataverse installation for use in demos and usability testing.
I would love to see some docs. Or a pointer to the code for now. Thanks! π
The structure so far is not defined, it's just a rough idea I had, inspired by @petermr CTree structure. I definitely want to talk with some of the Dataverse Devs about the idea -> if this would work right now and in the long run. The idea in general is, that only the filenames and the structure of the folders and files tell about, what should/can be inside and how to treat different files then. Like, every dataverse folder must have a metadata file, same for datasets. The content of the metadata file then must not be strictly defined, but most likely will also have mandatory attributes. This then can be used to create a local export independent of OS and connecting programming language, which also can be used by humans.
Here my first draft of the structure:
Naming Conventions:
- Dataverse: dv_IDENTIFIER, prefix dv_, id = alias
- Dataset: ds_IDENTIFIER, prefix ds_, id = id
- Datafile: FILENAME
βββ dv_harvard/
β βββ metadata.json
β βββ dv_iqss/
β βββ metadata.json
β βββ ds_microcensus-2018/
β βββ metadata.json
β βββ datafiles/
β βββ documentation.pdf
β βββ data.csv
β βββ metadata.json
βββ dv_aussda/
βββ ds_survey-labour-2016/
βββ metadata.json
βββ datafiles/
βββ docs.pdf
βββ data.tsv
Some open questions:
- are the filenaming conventions compatible, possibly? e.g. is it always okay/possible to convert the dataverse alias to a filename string and store it on every operating system?
- is the filename the best identifier for the datafiles? or is it's hash better?
- how to handle versioning? is a DVTree only for one version possible or should there be another level of folders, like
v1/? - do we need to seperate metadata into 1) general metadata and 2) metadata for API upload (add api.json or so)?
@skasberger thanks for this great write up! I just posted on the (ancient) "Round tripping the contents of DVN" thread at https://groups.google.com/d/msg/dataverse-community/07h0Ca-Ai1I/qyq3l-lakc0J with a link to this issue. I'm hoping to spur some good discussion. π
As mentioned several times at the Dataverse Community Conference: BagIt seems to be very similar, and could be a good inspiration. https://en.wikipedia.org/wiki/BagIt
After developing the first proof of concept, I recommend to rename it to OAISTree (Open Archival Information System), cause the related processes are the guidance for the directory structure and it's conventions. Here my actual draft.
ROOT_DIR
βββ YYYYMMDD_dataverses.csv
βββ YYYYMMDD_datasets.csv
βββ YYYYMMDD_datafiles.csv
βββ terms-of-access.html
βββ terms-of-use.html`
βββ PICKE_FILE.pickle: Pickle Files
βββ OAISTrees/
βββDATASET_ID/
βββ DATASET_ID_history.json
βββ SIP/
βββ RAW_DATA_FILENAME
βββ AIP/
βββ DATASET_ID_metadata.json
βββ DATASET_ID_DATAFILE_ID_metadata.json
βββ DIP/
βββ terms-of-use.html
βββ terms-of-access.html
βββDATASET_ID/
βββDATASET_ID/
βββDATASET_ID/
Is this dependent on the https://en.wikipedia.org/wiki/Open_Archival_Information_System ? This is a standard and a community.
On Mon, Dec 2, 2019 at 2:05 PM Stefan Kasberger [email protected] wrote:
After developing the first proof of concept, I recommend to rename it to OAISTree (Open Archival Information System), cause the related processes are the guidance for the directory structure and it's conventions. Here my actual draft.
β
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AUSSDA/pyDataverse/issues/5?email_source=notifications&email_token=AAFTCSZSAMQTEYBUPFIUXALQWUI2ZA5CNFSM4HNLTYGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFTS4PI#issuecomment-560410173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZS6TQNAXMITURUUXLQWUI2ZANCNFSM4HNLTYGA .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Hi Peter,
not really dependent. It's a first proposal for a standardized folder/data structure for OAIS related data, which then can be used to convert data from and to different systems. We use Dataverse, but others use iRODS or other software solutions.
Fine, so Dataverse encapulates OAIS. The important thing in using standards is to be consistent with their specs otherwise it accuses confusion with software.
On Mon, Dec 2, 2019 at 2:20 PM Stefan Kasberger [email protected] wrote:
Hi Peter,
not really dependent. It's a first proposal for a standardized folder/data structure for OAIS related data, which then can be used to convert data from and to different systems. We use Dataverse, but others use iRODS or other software solutions.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AUSSDA/pyDataverse/issues/5?email_source=notifications&email_token=AAFTCS4O7KHWREZUJYIMZRDQWUKRTA5CNFSM4HNLTYGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFTUI7A#issuecomment-560415868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5TKOHHTMIYVCBVKILQWUKRTANCNFSM4HNLTYGA .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
@skasberger this looks great. Have you considered how to support dataverses of arbitrary depth?
For example the dataset below about test taking is in a dataverse called "JOPD" which is inside a dataverse called "Ubiquity Press":

For our "dataverse-sample-data" repo I ended up using nested directories to support this. The idea is that there can be an arbitrary number of "dataverses" directories to trigger the next level:

I tend to have the sample data loaded up at https://dev2.dataverse.org if you'd like to take a look.
Here's the "sample data" repo: https://github.com/IQSS/dataverse-sample-data
In it I use pyDataverse! Thanks! π
Some notes on how to develop the oaistree. I already have some code for this locally running for AUSSDA purpose, so if you want to contribute, please get in touch with me first.
Workflow
- Create Directory structure
- Copy Raw data
- Create History file
- Create Dataset JSON
- Create Datafile JSON
Functinoalities:
- sub-folders in DIP: use the categories (
data,documentation) or file-tags/filenames verwenden, to create sub-folder
Development
- create classes to organize the OAISTree
- Function names:
from_oaistree(),to_oaistree() - Question: can Dataverse alias or dataset id or datafile id always be used for directoy or filename naming? Look for a fitting solutions for organizing this (look at bagit for this). Must work for different OS.
- Easy synchronization of pyDataverse objects and OISTree.
- Preservation:
- how to manage delete and destroy of data?
- new pid allowed?
- history must be preserved!
- how to manage delete and destroy of data?
- Manage Create, Update and Delete steps: JSON creation, history.
- Think together with history feature
- integrate with history function (#43)
As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python