disco
disco copied to clipboard
dataset storage feature and database integration
Feature: Dataset metadata storage in Disco
This PR adds new API endpoints to the server allowing to push dataset description and retrieve them. The frontend is updated to use these new endpoints.
Currently, a dataset is simply some features description and some information about the intended usage of features. We store the URL and the source of the data (currently only direct url or google drive link) to be able to retrieve the data when needed, the data is not stored in Disco.
As of now, we can just add new datasets and browse them. Because there isn't any user authentication in Disco, we can't add the security measures that would allow deleting or updating datasets.
Server change
New router
A new router has been added in src/router/controllers
, it is managing all the requests related to the datasets. Its role is defining API endpoint exposed by the server for dataset metadata related task. These endpoint are meant to be used by the UI frontend, but could be used by any client
Database integration
Storing the data for this feature is mandatory, however Disco did not have any database until now. This is not a trivial change, especially if integration must be done properly. Making the disco server depending on a database is a new requirement for starting the application. The database schema must also be maintained with migrations (see src/database/migrations/*
).
Indeed, changing the database (i.e. adding a new column) must be done via a new migration if we want to keep old data. When the new version is deployed we have to take into account that there will already be some data stored according to the old schema and update them accordingly.
To manage the migrations, the schema, and the database queries from Typescript, I have chosen to use MikroORM for its good integration with Typescript (preventing untyped SQL query surprise error at runtime) and PostgreSQL as a robust relational database backend.
For better separation of concerns, most of the code related to the integration with the database is located in src/database/entities
(Objects representing database rows) and src/database/repositories
(services abstracting database queries to be used by the rest of the application)
DTOs are objects representing different views of the entities and can be mapped to database entities internally when needed to do a requests. We don't let the user directly input entities to have more control (at the cost of maintaining mappers) and to decouple our business logic from the database schema.
As a security measure, user input validation has been added on DTOs at the service layer thanks to class-validator. This prevent any exploits based on bypassing the frontend and sending requests directly to the server API.
ESM module and update dependencies
- Typescript has been upgraded to the last version
- Migrated server codebase to an ESM module.
- https://pawelgrzybek.com/all-you-need-to-know-to-move-from-commonjs-to-ecmascript-modules-esm-in-node-js/
- "ECMAScript modules are the official standard format to package JavaScript code for reuse"
- Unfortunately this also means mandatory file extensions in imports
- ESM modules can import both ESM module and legacy CommonJS modules
Frontend changes
Deployment considerations
- Until now Disco didn't use database, and is deployed via Google App Engine, to be able to deploy these change, a database must be added first.
- I made the changes to the GitHub Actions workflow to handle secret storage (Database password...) and to provide it to GCP at deploy time.
- The current implementation of the datasets still make the server download and parse the files provided by the user at least once (needed to count the number of lines, check validity of user provided columns etc..)
- This could mean that big files would be downloaded often by the server, consuming a lot of bandwidth.
- We can discuss alternative ways if this is a problem.
Possible improvements
- Frontend
- It's my first time doing frontend changes on Disco and I'm not really a Vue expert, the UI side might need a better pass.
- Some types (like the DTO) could be shared between server and frontend for type-safety by adding them to the core library, but that would mean also adding class-validator to the library
- Same problem with entities and MikroORM
- Some pagination would be nice for the dataset listing.
These are some big changes, and it makes the Disco architecture a bit more complex. So let's discuss some points if needed.
This can't be merged until approved and a database has been deployed
Will separate the frontend part in another PR as discussed with @martinjaggi
Closing as there is no current plan/need to integrate the dataset storage feature and the PR is getting old.