pyDataverse icon indicating copy to clipboard operation
pyDataverse copied to clipboard

Re-factor models module

Open skasberger opened this issue 4 years ago • 4 comments

Re-factor the models module.

Goals are:

  • integrate it better with other tools
  • integrate it better with API module
  • implement setters/getters
  • Integrate with Pandas

Requirements

  • create Dataset with default metadatablocks when constructed
  • create Dataset with custom metadatablocks by passing mdb to Dataset construction
  • store internal information about metadatablock inside metadatablock object: mdb_type, mdb_version, date_created,
  • visualize data flow and architecture

ACTIONS

0. Pre-Requisites

1. Research

Design

  • [ ] https://stackoverflow.com/questions/50041661/bidirectional-data-structure-conversion-in-python
  • [ ] CSV
  • [ ] JSON
    • [ ] Dataverse Upload JSON default
    • [ ] Dataverse Download JSON default
    • [ ] DSpace
  • [ ] XML
    • [ ] DDI
  • [ ] think it with use-cases in mind
  • [ ] think it with mappings in mind
  • [ ] think it with integrations in mind

Schema

  • [ ] required, data type, unique, limits, formats (email, date), minItems
  • [ ] controlledVocabulary: subject, authorIdentifierScheme, contributorType, country, journalArticleType, language, publicationIDType
  • [ ] pydataverse models
    • [ ] metadatablocks: mdb.citation?
    • [ ] make usage of mdb's easy
  • [ ] CSV
  • [ ] JSON
    • [ ] Dataverse Upload JSON default
    • [ ] Dataverse Download JSON default
    • [ ] DSpace
  • [ ] XML
    • [ ] DDI

Tools

  • [ ] pydantic
  • [ ] https://avro.apache.org/docs/current/gettingstartedpython.html
  • [ ] https://jmespath.org/
  • [ ] https://daffodil.incubator.apache.org/
  • [ ] main.create_model()
  • [ ] https://github.com/koxudaxi/datamodel-code-generator
  • [ ] pydantic_sqlalchemy
    • [ ] sqlalchemy_to_pydantic
  • [ ] Idea: For JSON, get pydantic data model from JSON schema file. get mapping from json file. key is the location of source variable, value the name of the pyDataverse attribute.
  • [ ] Idea: For CSV, get pydantic data model from csv header row and type row. get mapping from json file. key is the name of the source variable, value the name of the pyDataverse attribute.
  • [ ] offer dataverse default schemas which are used from Dataverse by default. DataverseDefault(), DatasetDefault(), DatafileDefault()
  • [ ] use pydantic models with validators for internal data structure https://pydantic-docs.helpmanual.io/usage/models/
    • [ ] Validate incoming data dicts
    • [ ] Export data with .dict() and .json()
  • [ ] JSON Schema erstellen mit pydantic: https://pydantic-docs.helpmanual.io/usage/schema/#schema-customization
  • [ ] jsonpath https://pypi.org/project/jsonpath-ng/
  • [ ] schema

Architecture

  • [ ] Idea: Create base class (ABC or normal), called BaseModel()

2. Plan

  • [ ] Identify use-cases
  • [ ] Define requirements
  • [ ] Collect mappings
    • [ ] Mapping DSpace JSON #47
    • [ ] Mapping DDI XML #18
    • [ ] Mappin custom JSON #48
    • [ ] Mapping CSV templates #107
    • [ ] Mapping upload JSON #108
    • [ ] Mapping Download JSON #109
  • [ ] Collect Integrations
    • [ ] Pandas #97
    • [ ] Model - API #98
  • [ ] Prioritize features, especially mappings. Only most important for this release

Prioritize

  • In:
    • Dataset Default Download JSON: import and export
    • Dataset Custom Download JSON: import and export
    • Dataset Upload JSON: import and export
    • CSV templates
  • Out:
    • DSpace
    • DDI XML
    • Custom JSON

3. Implement

  • [ ] Write tests
    • [ ] Integration tests
  • [ ] Write/Update code
    • [ ] Create base class (ABC or normal), called BaseModel()
    • [ ] Implement Dataverse Upload JSON mappings and schemas
    • [ ] Implement Dataverse Download JSON mappings and schemas
    • [ ] Prototype: CSV templates mappings and schemas
    • [ ] Prototype: DSpace mappings and schemas
    • [ ] Prototype: DDI XML mappings and schemas
  • [ ] Write/Update Docs
  • [ ] Write/Update Docstrings
  • [ ] Run pytest
  • [ ] Run tox
  • [ ] Run pylint
  • [ ] Run mypy

visualize data flow / Architecture

draw all functions, paths etc.

  • use set_mdbs() or get_mdb() instead of get/set function
    • https://en.wikipedia.org/wiki/Mutator_method#Python
    • Dataset.set_mdb(mdb_name, mdb)
    • Dataset.mdb(mdb_name, mdb)
    • Dataset.mdbs (@property)
    • validate()
    • json: @property
    • json(): setters
    • dict: anderer name??, @property

models.py

class Dataverse():
  .__created_at
  .json
  .validate
  .json()
  .metadata()
  .metadata
  .__get_dataverse_download_json()
  .__validate()

class Dataset():
  .__created_at
  .json
  .validate
  .json()
  .mdb()
  .mdbs
  .__get_dataverse_download_json()
  .__validate()

class Datafile():
  .__created_at
  .json
  .validate
  .json()
  .metadata()
  .metadata
  .dataframe
  .__get_dataverse_download_json()
  .__validate()

class BaseMetaDataBlock(ABC):
  self.__mdb_type = "custom" # options: `citation`, `journal` etc and `custom`
  self.__mdb_version = "4.18.1" # options: for Dataverse mdb types = Dataverse Version in semantic string, for custom a custom versioning in semantic versioning string.
  self.__mdb_date_created = datetime.now() # options: for Dataverse mdb types = Dataverse Version in semantic string, for custom a custom versioning in semantic versioning string.

class MetaDataBlock(BaseMetaDataBlock():
  .__created_at
  .__name

class MetaDataBlockEntry():
  .__created_at
  .__value
  .__multiple
  .__type_class
  .__class

class Roles();
  .__created_at

class Group():
  .__created_at

class User():
  .__created_at

4. Follow Ups

  • [ ] Review
    • [ ] Code
    • [ ] Tests
    • [ ] Docs

skasberger avatar Feb 16 '21 12:02 skasberger

Just discovered this issue and the idea seems to align very well to what has been done with EasyDataverse already. The library also utilizes PyDantic and generates objects according to the metadatablock schemes found at api/metadatablocks/blockname.

Wouldnt it make sense to merge the functionality into PyDataverse? In my opinion having a single Python library makes more sense since both are heading in the same direction. What do you think @skasberger @pdurbin @poikilotherm?

JR-1991 avatar Mar 15 '23 12:03 JR-1991

I think a single library would be easier for the community, sure.

pdurbin avatar Mar 15 '23 14:03 pdurbin

Agree on that, especially as pyDataverse is very lightweight and is made to build upon other, more specialized services/functions.

skasberger avatar Mar 21 '23 17:03 skasberger

As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python

pdurbin avatar Mar 04 '24 16:03 pdurbin