Re-factor models module
Re-factor the models module.
Goals are:
- integrate it better with other tools
- integrate it better with API module
- implement setters/getters
- Integrate with Pandas
Requirements
- create Dataset with default metadatablocks when constructed
- create Dataset with custom metadatablocks by passing mdb to Dataset construction
- store internal information about metadatablock inside metadatablock object: mdb_type, mdb_version, date_created,
- visualize data flow and architecture
ACTIONS
0. Pre-Requisites
1. Research
Design
- [ ] https://stackoverflow.com/questions/50041661/bidirectional-data-structure-conversion-in-python
- [ ] CSV
- [ ] JSON
- [ ] Dataverse Upload JSON default
- [ ] Dataverse Download JSON default
- [ ] DSpace
- [ ] XML
- [ ] DDI
- [ ] think it with use-cases in mind
- [ ] think it with mappings in mind
- [ ] think it with integrations in mind
Schema
- [ ] required, data type, unique, limits, formats (email, date), minItems
- [ ]
controlledVocabulary: subject, authorIdentifierScheme, contributorType, country, journalArticleType, language, publicationIDType - [ ] pydataverse models
- [ ] metadatablocks:
mdb.citation? - [ ] make usage of mdb's easy
- [ ] metadatablocks:
- [ ] CSV
- [ ] JSON
- [ ] Dataverse Upload JSON default
- [ ] Dataverse Download JSON default
- [ ] DSpace
- [ ] XML
- [ ] DDI
Tools
- [ ] pydantic
- [ ] https://avro.apache.org/docs/current/gettingstartedpython.html
- [ ] https://jmespath.org/
- [ ] https://daffodil.incubator.apache.org/
- [ ]
main.create_model() - [ ] https://github.com/koxudaxi/datamodel-code-generator
- [ ] pydantic_sqlalchemy
- [ ]
sqlalchemy_to_pydantic
- [ ]
- [ ] Idea: For JSON, get pydantic data model from JSON schema file. get mapping from json file. key is the location of source variable, value the name of the pyDataverse attribute.
- [ ] Idea: For CSV, get pydantic data model from csv header row and type row. get mapping from json file. key is the name of the source variable, value the name of the pyDataverse attribute.
- [ ] offer dataverse default schemas which are used from Dataverse by default.
DataverseDefault(),DatasetDefault(),DatafileDefault() - [ ] use pydantic models with validators for internal data structure https://pydantic-docs.helpmanual.io/usage/models/
- [ ] Validate incoming data dicts
- [ ] Export data with .dict() and .json()
- [ ] JSON Schema erstellen mit pydantic: https://pydantic-docs.helpmanual.io/usage/schema/#schema-customization
- [ ] jsonpath https://pypi.org/project/jsonpath-ng/
- [ ] schema
Architecture
- [ ] Idea: Create base class (ABC or normal), called
BaseModel()
2. Plan
- [ ] Identify use-cases
- [ ] Define requirements
- [ ] Collect mappings
- [ ] Mapping DSpace JSON #47
- [ ] Mapping DDI XML #18
- [ ] Mappin custom JSON #48
- [ ] Mapping CSV templates #107
- [ ] Mapping upload JSON #108
- [ ] Mapping Download JSON #109
- [ ] Collect Integrations
- [ ] Pandas #97
- [ ] Model - API #98
- [ ] Prioritize features, especially mappings. Only most important for this release
Prioritize
- In:
- Dataset Default Download JSON: import and export
- Dataset Custom Download JSON: import and export
- Dataset Upload JSON: import and export
- CSV templates
- Out:
- DSpace
- DDI XML
- Custom JSON
3. Implement
- [ ] Write tests
- [ ] Integration tests
- [ ] Write/Update code
- [ ] Create base class (ABC or normal), called
BaseModel() - [ ] Implement Dataverse Upload JSON mappings and schemas
- [ ] Implement Dataverse Download JSON mappings and schemas
- [ ] Prototype: CSV templates mappings and schemas
- [ ] Prototype: DSpace mappings and schemas
- [ ] Prototype: DDI XML mappings and schemas
- [ ] Create base class (ABC or normal), called
- [ ] Write/Update Docs
- [ ] Write/Update Docstrings
- [ ] Run pytest
- [ ] Run tox
- [ ] Run pylint
- [ ] Run mypy
visualize data flow / Architecture
draw all functions, paths etc.
- use set_mdbs() or get_mdb() instead of get/set function
- https://en.wikipedia.org/wiki/Mutator_method#Python
- Dataset.set_mdb(mdb_name, mdb)
- Dataset.mdb(mdb_name, mdb)
- Dataset.mdbs (@property)
- validate()
- json: @property
- json(): setters
- dict: anderer name??, @property
models.py
class Dataverse():
.__created_at
.json
.validate
.json()
.metadata()
.metadata
.__get_dataverse_download_json()
.__validate()
class Dataset():
.__created_at
.json
.validate
.json()
.mdb()
.mdbs
.__get_dataverse_download_json()
.__validate()
class Datafile():
.__created_at
.json
.validate
.json()
.metadata()
.metadata
.dataframe
.__get_dataverse_download_json()
.__validate()
class BaseMetaDataBlock(ABC):
self.__mdb_type = "custom" # options: `citation`, `journal` etc and `custom`
self.__mdb_version = "4.18.1" # options: for Dataverse mdb types = Dataverse Version in semantic string, for custom a custom versioning in semantic versioning string.
self.__mdb_date_created = datetime.now() # options: for Dataverse mdb types = Dataverse Version in semantic string, for custom a custom versioning in semantic versioning string.
class MetaDataBlock(BaseMetaDataBlock():
.__created_at
.__name
class MetaDataBlockEntry():
.__created_at
.__value
.__multiple
.__type_class
.__class
class Roles();
.__created_at
class Group():
.__created_at
class User():
.__created_at
4. Follow Ups
- [ ] Review
- [ ] Code
- [ ] Tests
- [ ] Docs
Just discovered this issue and the idea seems to align very well to what has been done with EasyDataverse already. The library also utilizes PyDantic and generates objects according to the metadatablock schemes found at api/metadatablocks/blockname.
Wouldnt it make sense to merge the functionality into PyDataverse? In my opinion having a single Python library makes more sense since both are heading in the same direction. What do you think @skasberger @pdurbin @poikilotherm?
I think a single library would be easier for the community, sure.
Agree on that, especially as pyDataverse is very lightweight and is made to build upon other, more specialized services/functions.
As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python