sgn icon indicating copy to clipboard operation
sgn copied to clipboard

partition nd_experiment table for defined phenotype storage use-case

Open nickmorales opened this issue 4 years ago • 6 comments

Expected Behavior

There are use-cases, such as storing phenotypic values, where the nd_experiment table is becoming overloaded in large databases.

I propose adding a new 'nd_experiment_phenotype_bridge' table to provide the defined links (e.g. linking phenotype to stock to project in the case of storing phenotypes).

This table would have foreign keys to the stock, project, phenotype, metadata.md_files, metadata.md_image, and metadata.md_json tables and could have fields of operator and upload_date that are now stored in nd_experimentprop.

This would reduce the number of tables to join over by not using the nd_experiment_* linking tables (e.g. nd_experiment_stock, nd_experiment_project, nd_experiment_phenotype) or the phenome.nd_experiment_md_* tables (e.g. nd_experiment_md_files, nd_experiment_md_images, nd_experiment_md_json) in this very rigidly defined scenario. In summary, this would bring the number of intermediate tables to join over from 7 to 1.

For Bugs:

Environment

Steps to Reproduce

nickmorales avatar Sep 24 '20 13:09 nickmorales

We should use database -level partition of the nd_experiment table.

Also, maybe we can delete the nd_experiment_phenotype linking table? Isn't it useless? (A phenotype will never go to two experiments at the same time, so the relationship is not many to many, but one to many.

lukasmueller avatar Sep 25 '20 13:09 lukasmueller

don't delete any of the chado tables (e.g. nd_experiment_phenotype) because it will break some GMOD tools and Bio::Chado::Schema We can just not use them. If you add any tables to the public schema we won't be able to easily use Bio::Chado::Schema for queries

nmenda avatar Sep 25 '20 14:09 nmenda

More info on db level table partitions: https://www.postgresql.org/docs/10/ddl-partitioning.html

lukasmueller avatar Sep 25 '20 15:09 lukasmueller

There is a create_date in the nd_epxeriment table that could be used to partition the table. We would create a partition for each year. Data is usually created in year intervals and that would improve performance majorly. Before you implement the different schema, I would definitely first try something like that.

lukasmueller avatar Sep 29 '20 18:09 lukasmueller

Just revisited this idea today with Lukas. I'll look into partitioning again in more depth.

bellerbrock avatar Jun 15 '22 20:06 bellerbrock

Amber: We should have a hackathon about this

lukasmueller avatar Jun 22 '23 18:06 lukasmueller