pyDataverse Add import and export of OAISTree

trafficstars

As I mentioned at https://github.com/IQSS/dataverse/issues/5235#issuecomment-492875277 I'm curious if the "DVTree" (Dataverse Tree) format could be used to upload sample data to a brand new Dataverse installation for use in demos and usability testing.

I would love to see some docs. Or a pointer to the code for now. Thanks! 😄

May 16 '19 11:05 pdurbin

The structure so far is not defined, it's just a rough idea I had, inspired by @petermr CTree structure. I definitely want to talk with some of the Dataverse Devs about the idea -> if this would work right now and in the long run. The idea in general is, that only the filenames and the structure of the folders and files tell about, what should/can be inside and how to treat different files then. Like, every dataverse folder must have a metadata file, same for datasets. The content of the metadata file then must not be strictly defined, but most likely will also have mandatory attributes. This then can be used to create a local export independent of OS and connecting programming language, which also can be used by humans.

Here my first draft of the structure:

Naming Conventions:

Dataverse: dv_IDENTIFIER, prefix dv_, id = alias
Dataset: ds_IDENTIFIER, prefix ds_, id = id
Datafile: FILENAME

├── dv_harvard/
│     ├── metadata.json
│     └── dv_iqss/
│            ├── metadata.json
│            └── ds_microcensus-2018/
│                    ├── metadata.json
│                    └── datafiles/
│                           ├── documentation.pdf
│                           └── data.csv
│     └── metadata.json
└── dv_aussda/
       └── ds_survey-labour-2016/
              ├── metadata.json
              └── datafiles/
                     ├── docs.pdf
                     └── data.tsv

Some open questions:

are the filenaming conventions compatible, possibly? e.g. is it always okay/possible to convert the dataverse alias to a filename string and store it on every operating system?
is the filename the best identifier for the datafiles? or is it's hash better?
how to handle versioning? is a DVTree only for one version possible or should there be another level of folders, like v1/?
do we need to seperate metadata into 1) general metadata and 2) metadata for API upload (add api.json or so)?

May 17 '19 12:05 skasberger

@skasberger thanks for this great write up! I just posted on the (ancient) "Round tripping the contents of DVN" thread at https://groups.google.com/d/msg/dataverse-community/07h0Ca-Ai1I/qyq3l-lakc0J with a link to this issue. I'm hoping to spur some good discussion. 😄

May 17 '19 16:05 pdurbin

As mentioned several times at the Dataverse Community Conference: BagIt seems to be very similar, and could be a good inspiration. https://en.wikipedia.org/wiki/BagIt

Jun 22 '19 18:06 skasberger

After developing the first proof of concept, I recommend to rename it to OAISTree (Open Archival Information System), cause the related processes are the guidance for the directory structure and it's conventions. Here my actual draft.

ROOT_DIR
├── YYYYMMDD_dataverses.csv
├── YYYYMMDD_datasets.csv
├── YYYYMMDD_datafiles.csv
├── terms-of-access.html
├── terms-of-use.html`
├── PICKE_FILE.pickle: Pickle Files
└── OAISTrees/
        └──DATASET_ID/
               ├── DATASET_ID_history.json
               └── SIP/
                       └── RAW_DATA_FILENAME
               └── AIP/
                       ├── DATASET_ID_metadata.json
                       └── DATASET_ID_DATAFILE_ID_metadata.json
               └── DIP/
                       ├── terms-of-use.html
                       └── terms-of-access.html
        └──DATASET_ID/
        └──DATASET_ID/
        └──DATASET_ID/

Dec 02 '19 14:12 skasberger

Is this dependent on the https://en.wikipedia.org/wiki/Open_Archival_Information_System ? This is a standard and a community.

On Mon, Dec 2, 2019 at 2:05 PM Stefan Kasberger [email protected] wrote:

After developing the first proof of concept, I recommend to rename it to OAISTree (Open Archival Information System), cause the related processes are the guidance for the directory structure and it's conventions. Here my actual draft.

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AUSSDA/pyDataverse/issues/5?email_source=notifications&email_token=AAFTCSZSAMQTEYBUPFIUXALQWUI2ZA5CNFSM4HNLTYGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFTS4PI#issuecomment-560410173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZS6TQNAXMITURUUXLQWUI2ZANCNFSM4HNLTYGA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Dec 02 '19 14:12 petermr

Hi Peter,

not really dependent. It's a first proposal for a standardized folder/data structure for OAIS related data, which then can be used to convert data from and to different systems. We use Dataverse, but others use iRODS or other software solutions.

Dec 02 '19 14:12 skasberger

Fine, so Dataverse encapulates OAIS. The important thing in using standards is to be consistent with their specs otherwise it accuses confusion with software.

On Mon, Dec 2, 2019 at 2:20 PM Stefan Kasberger [email protected] wrote:

Hi Peter,

not really dependent. It's a first proposal for a standardized folder/data structure for OAIS related data, which then can be used to convert data from and to different systems. We use Dataverse, but others use iRODS or other software solutions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AUSSDA/pyDataverse/issues/5?email_source=notifications&email_token=AAFTCS4O7KHWREZUJYIMZRDQWUKRTA5CNFSM4HNLTYGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFTUI7A#issuecomment-560415868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5TKOHHTMIYVCBVKILQWUKRTANCNFSM4HNLTYGA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Dec 02 '19 14:12 petermr

@skasberger this looks great. Have you considered how to support dataverses of arbitrary depth?

For example the dataset below about test taking is in a dataverse called "JOPD" which is inside a dataverse called "Ubiquity Press":

Screen Shot 2019-12-02 at 10 48 41 AM

For our "dataverse-sample-data" repo I ended up using nested directories to support this. The idea is that there can be an arbitrary number of "dataverses" directories to trigger the next level:

Screen Shot 2019-12-02 at 10 49 15 AM

I tend to have the sample data loaded up at https://dev2.dataverse.org if you'd like to take a look.

Here's the "sample data" repo: https://github.com/IQSS/dataverse-sample-data

In it I use pyDataverse! Thanks! 😄

Dec 02 '19 15:12 pdurbin

Some notes on how to develop the oaistree. I already have some code for this locally running for AUSSDA purpose, so if you want to contribute, please get in touch with me first.

Workflow

Create Directory structure
Copy Raw data
Create History file
Create Dataset JSON
Create Datafile JSON

Functinoalities:

sub-folders in DIP: use the categories (data, documentation) or file-tags/filenames verwenden, to create sub-folder

Development

create classes to organize the OAISTree
Function names: from_oaistree(), to_oaistree()
Question: can Dataverse alias or dataset id or datafile id always be used for directoy or filename naming? Look for a fitting solutions for organizing this (look at bagit for this). Must work for different OS.
Easy synchronization of pyDataverse objects and OISTree.
Preservation:
- how to manage delete and destroy of data?
  - new pid allowed?
  - history must be preserved!
Manage Create, Update and Delete steps: JSON creation, history.
Think together with history feature
integrate with history function (#43)

Jun 26 '20 11:06 skasberger

As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python

Mar 04 '24 16:03 pdurbin

pyDataverse pyDataverse copied to clipboard

Add import and export of OAISTree

pyDataverse
pyDataverse copied to clipboard