FAIRDataPoint Add bootstrapping FDP from fixtures

With this, there will be a folder with YAML/RDF files of specific structure that can be used to pre-populate FDP if empty. This will remove the need for legacy internal RDF migrations and DB migrations controlling default data inside FDP. Moreover, it will be possible to change these fixture files to more easily create new instances of FDP with different content.

This should cover:

Metadata records (RDF)
Users
Metadata schemas
Resource definitions
Settings

Also see post by @MarekSuchanek in #581

Feb 26 '25 13:02 MarekSuchanek

related to

#581

Feb 27 '25 09:02 dennisvang

From the list of file types that are defined above, the metadata records and metadata schemas are clear in terms of how to represent (RDF Turtle and SHACL, respectively). For the other files (Users, Resource definitions and Settings), they should be represented as JSON files. Internally, the software can take the JSON files and use the information in them to create SQL queries or other procedures. @TODO: create JSON Schemas for the users, resource definitions and settings.

Apr 14 '25 07:04 luizbonino

A summary of the present situation, as I understand it.

(@MarekSuchanek, @luizbonino please correct me if I'm wrong)

[!NOTE] The links in the text point to relevant parts of the source.

What is stored where

The FDP (v2) uses two types of database to store various kinds of information (user data and application data):

relational db (PostgreSQL, also see #519):
- API keys
- membership (Undocumented... What's this for? Related to access control?)
- index
- metadata schemas (SHACL in .ttl format, stored as plain text in the definition column)
- resource definitions
- search queries
- settings
- user accounts
RDF triple store (user configurable graph database, e.g. graphDB):
- catalog metadata
- generic metadata

Relational db

The relational database is accessed using JPA, and (relational) schema migrations are managed using flyway, which is configured for automatic migration on application startup. Relational schema migrations for the production profile are defined in resources/db/migration (DDL in .sql files). The development profile (i.e. a Spring profile) uses the schema migrations from the production profile, and adds data migrations, defined in resources/dev/db/migration, for users, metadata schemas, etc. As far as I know, there are no data migrations yet for the production profile.

These data migrations include pre-determined unique identifiers (UUID as primary key). The app currently relies on these identifiers as defined in KnownUUIDs.

The metadata schemas for the development profile are hardcoded in the sql files, e.g. V0001.2__dev-data-schemas.sql. There is also a database/fixtures dir that has actual .ttl files, but I'm not sure if they are already used . Similar fixtures were used in v1 for MongoDB migrations.

RDF triple store

The triple store is accessed using rdf4j. In v1, RDF migrations were implemented using the custom spring-rdf-migration package. See v1 RDF migrations. In the current (v2) develop branch, the spring-rdf-migration dependency is still present, and production migrations are still defined, based on that package. However, there is no runner to execute these production migrations. Moreover, these production migration files are not fully implemented and are tagged for removal. Also see the discussion in #581. For the development profile, the current (v2) develop branch uses a different mechanism to execute RDF migrations, as implemented in the RDFDevelopmentMigrationRunner and RdfMetadataMigration classes. When the development profile is active, this migrationrunner also executes AclMigrations (What for, and why only for development?).

[WORK IN PROGRESS]

Notes

Currently, the relational migrations are stored in src/main/resources, whereas the rdf migrations are stored in src/main/java. This seems inconsistent, and it leads to recurring problems such as #252.
I think we should make a clear distinction between:
1. Information that is truly indispensable to the normal operation of the app, i.e. information without which the app cannot run, such as default settings or metadata that are essential to FDP spec compliance. Users should not be able to modify any of those. (not quite the same, but closely related to static data)
2. Other information, i.e. information that the app can run without, even if that limits functionality, such as user accounts. This should not be part of the source, other than test fixtures or examples for documentation purposes.
Currently, there is no clear distinction, as e.g. KnownUUIDs includes identifiers for both types.
I think in many cases the use of fixtures (as defined in terminology) is inconvenient, because they contain both domain information and internal information, relevant to the implementation only, such as UUIDs. Often ORM-based data migrations are better suited. These do not require hard-coding of any unique identifiers, can handle relationships with existing data, and can be versioned like schema migrations. We could probably use flyway java-based migrations for this.
Some options to consider for initializing database content:
- using the api
- "import" functionality (although we want to be independent of the fdp-client)
- data migrations and/or fixture loading triggered from the deployment workflow
Perhaps we should discuss whether flyway auto-migration should remain enabled. Perhaps we should pull this into the deployment workflow instead.

Terminology

Some terminology, to make sure we are talking about the same thing.

fixture My interpretation is in line with Django's definition:

A fixture is a collection of files that contain the serialized contents of the database. [...]

This includes unique identifiers such as primary keys or natural keys.

Apr 24 '25 10:04 dennisvang