ODM2
ODM2 copied to clipboard
ships_track advice needed
Hello, I'm a developer at the Norwegian institute for water research (NIVA). We are in an ongoing project to modernize our IT infrastructure. One of the current pain points is our data model which has slowly evolved over the previous decades. We are considering how to progress from here. One of the obvious paths is to try to evolve the data model we are currently using. However one of our developers has used ODM1 before (shoutout to @JamesSample) so ODM2 is under consideration as well. I'm a big fan of open source software and getting access to the ODM2 tools and getting interoperability with other scientific organizations seems very worthwhile.
So, with that background out of the way, here is the problem I'm currently running into. One of the aims of our infrastructure development is to merge large time series data into the main database, currently these are separate. The time series are largely generated by two ongoing projects namely, the ferrybox and glider projects. Both are slated for future expansion. The ferrybox project concerns a number of ships that have water quality sensors onboard (ongoing for more than 10 years), the gliders are a set of small autonomous research platforms. These also have a fairly extensive sensorsuite. All of the time series generated by these projects have real time quality control according to a number of tests, roughly as defined in this document by NOAA.
So my first hunch was to try and map this to the ODM2 schema using the data_qaulity, time_series_result and ships track sampling feature. However, after digging into the data model a little it seems that this won't be quite a good fit. The current time series data looks approximately like this:
- 12 million location rows
- 255 million individual measurements (these include operational parameters of the sensory systems, thus not all of these measurement have quality control flags associated with them.)
- 408 million quality control flags
There does not seem to be a convenient mechanism to reuse measurement locations for multiple ResultValues in any of the result types. On average there are about 21 measurement per location. I think it would be too inefficient to store the location on every measurement. In addition there does not seem to be a mechanism to assign data_quality level per ResultValue, it's only possible to assign a data_quality code to a result. However, results are not granular enough to describe the quality of the data while the time series is growing.
I have been experimenting a little with creating a new TrackResult that has two child tables; TrackResultValues and TrackResultLocations and then make it so that the Locations table can be referenced from several TrackResults that reference different results. This seems fairly promising but also a rather radical break from the current structure. Especially since I would like to add a nullable data_qaulity column to both of the child tables.
I'm sorry for the lengthy post and I understand if you guys don't have time to consider this. I would however really like to try and use ODM2, especially since it seems like an incredible fit for the more regular monitoring data we collect. This is mostly manually performed in-situ measurements, species counts, and lab analyses on biota and water or sediment samples. So if anyone has advice on how to fit this time series in the ODM2 schema that would be very much appreciated.