Add short term storage expiration indicator to history items
xref #20169
This simple approach should not be too expensive and can help the user identify when a dataset might be gone because it is stored in a short-term object store.
This works just by annotating the object store config with a new property object_expires_after_days:
- id: scratch
type: disk
device: device2
weight: 0
allow_selection: true
private: true
name: Scratch Storage
description: >
This object store is connected to institutional scratch storage. This disk space is not backed up and private to
your user, and datasets belonging to this storage will be automatically deleted after one month.
quota:
source: second_tier
files_dir: /home/dlopez/sandbox/data-gx/dev/objects/temp
badges:
- type: faster
- type: less_stable
- type: not_backed_up
- type: short_term
message: The data stored here is purged after a month.
object_expires_after_days: 30
There are still some drawbacks to consider/resolve:
- [ ] Synchronize the object store config property
object_expires_after_dayswith the actual expiration time of the object store. It seems the cleanup of the object store is handled by external processes, so this value must be in sync with the actual expiration time of the object store. - [ ] Collections do not have an
object_store_idproperty. I wonder if we could "estimate" or "assume" the object store ID of a collection by looking at the object store ID of the first dataset in the collection. This is not ideal, but maybe it could be a good enough workaround? I'm not sure how often collection elements are stored in mixed object stores, but I guess it could happen.
How to test the changes?
(Select all options that apply)
- [ ] I've included appropriate automated tests.
- [ ] This is a refactoring of components with existing test coverage.
- [ ] Instructions for manual testing are as follows:
- [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]
License
- [x] I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.
I'm anxious about this idea for a few reasons but Anton is the boss 🤷♀️.
If you descend into collections in the history panel - do you get the icon on individual datasets there? The query that summarizes states across the whole collection could accumulate the object store IDS at the same time - it would be a wild query but it would probably the easiest and correct thing to do to summarize the dataset collection. I guess we couldn't get a count down in that case but we could add a storage temp storage icon with more information like a per dataset break down by clicking on it.
I understand and share your concerns, especially with collections. I also think the other proposed solutions, like sending emails, are even more concerning. It would be really hard to do it right and not turn it into a massive spam generator, so likely not worth it :sweat_smile:
If you descend into collections in the history panel - do you get the icon on individual datasets there?
I would say no. My idea was to do something less accurate, but enough to "inform" the user that the datasets or collections used will be temporal. I thought displaying something a the top level would be enough, if you drilled down that collection, you must have already seen the "indication" and we could still display it at the top.
I know it is technically possible to mix elements from different object stores in the same collection, but will this be a common case? I was hoping we could assume a single common object store for the HDCA by peeking into just one of its datasets. But yeah, in the worst case, we could do what you suggest, aggregating the object store IDs in the summarize query, and if there is at least one object store ID known to be short-term, just display some warning at the top. This would probably already be a huge improvement in raising awareness of the temporary nature of the selected storage without needing many more features.
And we're certain we cannot just take scratch away from people who complain? We "promote" them to a "higher tier" of user where all there data is permanent storage and advanced options are disabled. Not going to fly huh?
I know it is technically possible to mix elements from different object stores in the same collection, but will this be a common case?
It is probably uncommon but they are pretty easy to create and it would be my guess that they would be more common/have more obvious use cases than say mixing dbkeys or file extensions and we deal with a mix of those in the UI in a mostly "correct" fashion.
The other option is not show the indicator at all for collections, but only when the user drills down to the dataset level.
I made an attempt to include the set of object_store_ids as we do with dbKeys and extensions in https://github.com/galaxyproject/galaxy/pull/20332/commits/048e2d6dad9658e7242f49599a30a20574ec8f44, and then find the shortest time to expiration in any of them. I assume that as soon as one of the elements of a collection expires, the whole collection can be considered expired, as it can no longer be used completely.
Let me know if this is still a bad idea :sweat_smile:
I've run benchmarks on three different dataset collections: 1K, 5K, and 10K datasets. For each collection, I issued 100 requests to the endpoint:
api/dataset_collections/{collection_id}?view=summary
and recorded the minimum, maximum, and average response times (in milliseconds).
Without adding the object_store_ids field (the current code):
| Collection | Min (ms) | Max (ms) | Avg (ms) |
|---|---|---|---|
| 1K | 25.85 | 81.60 | 44.77 |
| 5K | 59.69 | 110.31 | 74.90 |
| 10K | 107.08 | 182.18 | 125.92 |
With the object_store_ids field (the changes proposed in https://github.com/galaxyproject/galaxy/pull/20332/commits/d105defac26f50f8507325855e060655944ad844):
| Collection | Min (ms) | Max (ms) | Avg (ms) |
|---|---|---|---|
| 1K | 26.84 | 56.01 | 38.92 |
| 5K | 66.64 | 128.78 | 80.28 |
| 10K | 119.39 | 171.10 | 137.09 |
There is a slight increase in response time for larger collections, but maybe it's worth the tradeoff?
On the other hand, I noticed there is still an inaccuracy in this approach. This tracks all the object_store_ids, but it "assumes" the creation_time of the whole HDCA is the time for calculating the expiration, but it should consider "the oldest creation_time for each dataset in those object_stores" instead.
I will try to explore and benchmark adding the oldest creation_time to each object_store_id and see what we get...
The new approach for collections in https://github.com/galaxyproject/galaxy/pull/20332/commits/1b8acc2965116fd4210b7d8617909794a2bcbeb5 is more accurate as it takes into account the "oldest create_time" of the datasets associated with each object store used in the collection.
Of course, it is slightly slower too, but again, it may be worth the extra time.
| Collection | MIN (ms) | MAX (ms) | AVG (ms) |
|---|---|---|---|
| 1K | 28.97 | 57.46 | 39.95 |
| 5K | 69.89 | 132.95 | 86.03 |
| 10K | 125.86 | 170.74 | 144.08 |
Average Response Time Comparison (ms)
| Collection | Without object_store_ids |
With object_store_ids |
With store_times_summary |
|---|---|---|---|
| 1K | 44.77 | 38.92 | 39.95 |
| 5K | 74.90 | 80.28 | 86.03 |
| 10K | 125.92 | 137.09 | 144.08 |
This is finally ready for review. Includes some selenium tests to verify the basic behavior.
Selenium test failure unrelated.
FAILED lib/galaxy_test/selenium/test_data_source_tools.py::TestDataSource::test_ucsc_table_direct1_data_source - selenium.common.exceptions.TimeoutException: Message: Timeout waiting on CSS selector [#org] to become present.
Thank you for the review!