Store list of published files in the database
We would like to improve our search functionality to allow discovery of published files that meet a set of requirements (and we have promised this as a deliverable in one of our funded projects).
The idea is that someone could search for all WFDB format files that are over 1 minute long and that contain ECG (and later, perhaps more sophisticated searches).
*I think* the first step is to track published files within our database.
To begin, my suggestion is that we create a table that tracks file path and format (e.g., .hea, .dat, .csv). Next we could consider tracking additional file level information, like:
- Signal types
- Duration
- Sampling frequency
- Number of channels
- ...
These additional fields would be format-specific, so I'm not sure whether we would track them in the same table, or in a set of linked tables (probably this).
Anyway, to begin, my suggestion is to add the following model and add a list of files upon publication:
PublishedFile (
id ID PK,
path TEXT,
project FOREIGN_KEY,
format TEXT,
file_size_bytes BIGINT,
created_at TIMESTAMP,
)
@bemoody @Chrystinne any thoughts?
Storing such information in the physionet database is not a good idea. The main reasons that come to mind are:
-
Size: ensure it is possible to keep full daily backups, and to transfer the
physionetdatabase between servers quickly. -
Compartmentalization: ensure that the backups avoid storing information that might need to be permanently deleted at some point, such as when somebody decides to use sensitive identifiers in their filenames.
Better options might be:
-
Storing file metadata in a Postgres database that is separate from
physionet. -
Storing the information in multiple SQLite database, i.e. a separate
sqlitefile for each project. -
Not using SQL at all, and using something purpose-built.
If the information is stored in a separate SQLite file per project, then people can download that file and query it locally. (As we did with pbsearch in the old days.)
Although it may sound nice to use Django ORM and rely on Django for data migrations, I'm also wary that that might mean having migrations that take days to run.
Thanks Benjamin -
Size: ensure it is possible to keep full daily backups, and to transfer the physionet database between servers quickly.
It doesn't feel like it would be huge. Let's say 200 bytes per row, then 100 million files would be 20GB.
Compartmentalization: ensure that the backups avoid storing information that might need to be permanently deleted at some point, such as when somebody decides to use sensitive identifiers in their filenames.
If we delete files, we delete the data? We should never get to the point where we are publishing sensitive filenames, and we should introduce additional checks to avoid this.
Better options might be: Storing file metadata in a Postgres database that is separate from physionet.
I'm not against this, but it feels like it adds unnecessary overhead right now. If we want to create an additional database later, we can migrate.
Better options might be: Storing the information in multiple SQLite database, i.e. a separate sqlite file for each project.
I think this makes implementing a cross-platform search less straightforward, but we could adopt it later if there were benefits to this approach.
Not using SQL at all, and using something purpose-built.
We could just create a files.json metadata file within each folder (and I think this could be worth doing regardless), but I'm not sure the file is especially helpful for implementing the cross site search.
Not using SQL at all, and using something purpose-built.
Side note, but the file-based list may also be at least partially solved if we implement: https://github.com/MIT-LCP/physionet-build/issues/2184 (so there would be a croissant.json file with file level metadata).
Yes, 20 GB is huge compared to 200 MB.
Are you thinking that "file metadata" would be public? Certainly someone could make that argument, but I don't think that's wise.
Let's discuss! My intent was that some of the information would likely be made public, yes. The metadata would only apply to published files, which are already public. Keeping this information in the main database keeps the system simple and allows us to deliver the search improvements we've committed to. Even at large scale, the size seems manageable (~20 GB is bigger than 200MB, but not much of a difference practically, I don't think?).
@bemoody a few notes on points raised in the dev meeting today:
- The proposed model wouldn't support "search for all WFDB format files that are over 1 minute long and that contain ECG".
I agree, and sorry for suggesting otherwise. The idea is that we implement the first step - a basic index of published files. Once we have this, we can add richer file level metadata in linked tables or files.
- There may be better alternatives to storing the data in a relational database.
That's possible, but Postgres feels like a good fit. The data is simple, structured, and tightly coupled to project models. Using Django keeps things simple, and we can rethink the backend later if needed.
- Storing the data may strain the database/Django
With ~17.7M files, our storage size would be ~3.5GB. If we open the data to public queries and connections become a strain, then we could look at indexing, caching or a replica database.
- Unclear benefit.
This is a first step towards supporting file-level search. Even basic filtering would be a step forward. Longer term, it will lead to (I think!) a better user experience and reduces unnecessary load from users downloading full corpuses just to find relevant files.
@bemoody a few notes on points raised in the dev meeting today:
- The proposed model wouldn't support "search for all WFDB format files that are over 1 minute long and that contain ECG".
I agree, and sorry for suggesting otherwise. The idea is that we implement the first step - a basic index of published files. Once we have this, we can add richer file level metadata in linked tables or files.
- There may be better alternatives to storing the data in a relational database.
That's possible, but Postgres feels like a good fit. The data is simple, structured, and tightly coupled to project models. Using Django keeps things simple, and we can rethink the backend later if needed.
- Storing the data may strain the database/Django
With ~17.7M files, our storage size would be ~3.5GB. If we open the data to public queries and connections become a strain, then we could look at indexing, caching or a replica database.
- Unclear benefit.
This is a first step towards supporting file-level search. Even basic filtering would be a step forward. Longer term, it will lead to (I think!) a better user experience and reduces unnecessary load from users downloading full corpuses just to find relevant files.