catma icon indicating copy to clipboard operation
catma copied to clipboard

Optimized directory and file layout for Collections

Open mpetris opened this issue 3 years ago • 0 comments

Currently a Collection is a directory with a header.json containing meta data and a subdirectory containing the Annotations of that Collection. Each Annotation sits in its own file. For Collections (or even worse for CATMA Projects) with tens or hundreds of thousands of Annotations this is a performance bottleneck when loading the Annotations.

The reason for choosing this one-Annotation-per-file layout over an all/many-Annotations-in-one-file layout was to avoid git conflicts on creating new Annotations.

The goal is therefore to reach good read and write performance without git conflicts on Annotation creation:

  • We introduce user specific sections within a Collection for creating new Annotations. Each user will have a dedicated JSON file named after the username for writing new Annotations.
  • Once an Annotation has been created in a file, all edit and delete operations will happen on that file.
  • Edit and delete operations can be executed by all participating members of that CATMA Project not just the user who belongs to the dedicated JSON file.
  • To avoid infinite growth of the dedicated JSON files a paging mechanism will be introduced to split the JSON files into smaller files still large enough to have sufficient read performance. The maximum size should be below 180 kb to enable conflict resolution with the Gitlab UI. The Gitlab UI has a limit of 200kb including conflict markes.

Example file and folder layout for a Collection with two users A and B with B having created more Annotations than fit in one page:

a_collection/
├─ header.json 
├─ annotations/
   ├─ A_1.json
   ├─ B_1.json
   ├─ B_2.json

mpetris avatar Jan 19 '22 16:01 mpetris