galaxy
galaxy copied to clipboard
User-based ObjectStore
(1) Initially started at https://github.com/galaxyproject/galaxy/pull/4314; (2) all the commits of this PR were squashed into a single commit on Dec 11, 2019, the history of the changes are preserved via this branch.
Introduction
This PR extends Galaxy's ObjectStore
to enable users to bring-their-own-resources: users can plug a media (e.g., Amazon S3 bucket) on which Galaxy will persist their datasets.
Motivations
- unlimited storage: users on Galaxy instances with limited storage resources (e.g., storage quota) can potentially have an unlimited storage by plugging their own (cloud-based) storage to Galaxy;
- data sharing: having datasets generated by Galaxy stored on user’s cloud-based storage makes it easier sharing analysis results with collaborators;
- flexible persistence location: members of different labs using a common Galaxy instance hosted at their institute can have their data stored on their lab’s network attached storage (NAS).
Highlights
- For users without a plugged storage media, Galaxy will continue to use an instance-wide configuration for their data storage needs;
- A user's storage media (e.g., an S3 bucket) will be used for their data storage needs only, and will not be accessed for other user's storage needs;
- A storage media can be a local path, or an Amazon S3 bucket;
- Users can plug multiple media (e.g., two different local path, and three Amazon S3 buckets), assign an
order
andquota
attribute to each, and Galaxy will use them based on the givenorder
and will fall from one to another if their quota limit is reached; - Leveraging the
order
attribute of storage media, users can use both instance-wide storage and their own media. For instance, they can direct Galaxy to use the instance-wide storage until their quota limit is reached (e.g., 250GB on Galaxy Main), then use their own media for the rest of their data storage needs. - Storage media are defined leveraging Galaxy’s cloud authorization model, hence Galaxy does not ask for user’s credentials.
- This PR implements all the necessary models, managers, functions, and APIs; and there will be a separate PR for UI;
- The functionality is leveraging
Hierarchical
ObjectStore; hence, it is functional only ifHierarchical
ObjectStore is configure. However, the hierarchy is applied instance-wide only, and does not affect user’s plugged media configuration; - Each storage media has its separate staging path (mainly used for S3 backend), independent from the instance-wide ObjectStore and other storage media; and admin can define a default staging path.
What's next?
We're aiming to keep this PR "minimally functional"; hence, features such as ability to mount a cloud-based storage and user interfaces will be implemented in subsequent PRs.
How to use
- Configure objectstore to the hierarchical backend; e.g.,:
<?xml version="1.0"?>
<object_store type="hierarchical">
<backends>
<object_store type="distributed" id="primary" order="0">
<backends>
<backend id="files1" type="disk" weight="1">
<files_dir path="database/files1"/>
<extra_dir type="temp" path="database/tmp1"/>
<extra_dir type="job_work" path="database/job_working_directory1"/>
</backend>
<backend id="files2" type="disk" weight="1">
<files_dir path="database/files2"/>
<extra_dir type="temp" path="database/tmp2"/>
<extra_dir type="job_work" path="database/job_working_directory2"/>
</backend>
</backends>
</object_store>
<object_store type="disk" id="secondary" order="1">
<files_dir path="database/files3"/>
<extra_dir type="temp" path="database/tmp3"/>
<extra_dir type="job_work" path="database/job_working_directory3"/>
</object_store>
</backends>
</object_store>
- Login and get your API key;
-
POST
a payload as the following to the/api/storage_media
(you may use Postman to send API requests):
{
"category": "local",
"path": "A_PATH_ON_LOCAL_DISK",
"order": "1",
"quota": "1000.0",
"usage": "0.0"
}
- Then any dataset you create, will be stored in the
A_PATH_ON_LOCAL_DISK
; e.g.,:
.
└── d
└── b
└── 1
└── dataset_db1b29ae-524a-46c1-af8d-e3e9e6861a4e.dat
I started a branch with fixes here: https://github.com/jgoecks/galaxy/tree/UserBasedObjectStore2
Specifically, there are fixes for anonymous access. I can't seem to find your fork to initiate a pull request however—perhaps because your repo is restricted somehow and/or is so far behind the main repo?
Thanks for the updates @jgoecks . Please see if you can make a PR agains this branch; if not, I can update this branch. Besides, I guess we could avoid your last commit.
@VJalili I still cannot find your fork to make a PR against. I'll try to look into this more soon.
@jgoecks I applied the changes you made on your branch on this branch.
It will be nice to have user-based storage. @VJalili Wonder whether you use sql tables to manage user and corresponding storages. I haven't looked deep into this project yet, but my first feeling is to build a table on top of current storage management system.
@VJalili I opened a PR that I think will fix tests for this PR. It's actually, I think, an issue we have always had and it was just never surfaced until this PR.
https://github.com/VJalili/galaxy/pull/6
@dannon Thank you! I guess that has fixed it as all tests passed locally.
@dannon I think the patch works fine for integration tests, but it breaks CI unit tests.
@VJalili Ahh, sure enough. I was laser focused on that one issue, let me dig deeper since there's more to the picture here.
Yeah, the error here has popped up again:
Parent instance <HistoryDatasetAssociation at 0x7fe8e45b0250> is not bound to a Session; lazy load operation of attribute 'history' cannot proceed
I'll try to figure out how we're getting an hda handle that's no longer bound.
The orphan HDA handle is the issue causing the integration test's failure; I guess that is happening when Galaxy is writing metadata to a file.
Thanks for refactoring the concept of ownership out of the dataset instance level (HDA/LDDA) and for the integration tests. These are serious improvements I believe.
Can you add an integration test of copying data on storage media between users? I assume based on the reading if a user copies my data and then I delete the storage media - the data will disappear for the user but I want that verified and stated explicitly with a test case. Is that fair?
@jmchilton as per the challenges using this feature for shared data may introduce (e.g., authorization issues), last we decided to postpone the ability of using this feature for shared data. Do you think we should add some warnings for users who attempt to use this feature for share data?
Do you think we should add some warnings for users who attempt to use this feature for share data?
Yes, ideally. I'm not sure yet if that should be required for this PR but that is a good idea in general if we're going to impose that restriction.
@jmchilton I disabled sharing for user storage media (a history that contains a dataset stored on a user-owned storage, cannot be shared); please see a272454. Any other thoughts?
This needs to be merged with dev so we can see the integration tests pass before it is merged. I'm not comfortable with merging it for 19.09 though - these abstractions still seem intrusive and off and I'm worried about performance hits in many critical paths this code touches.
I don't think I mentioned it in the inline comments but I feel like the quota pieces shouldn't be desperate - that if logic every place this touches quota should be moved into an existing or a new abstraction.
My reading of the code is that it is at least still two big, largely independent chunks of work - one answering the question of how do we inject user credentials into the objectstore layer when needed and one answering a question of how do we switch between objectstores when some condition is active and manage quota in multiple places. I'd really still like to see this decomposed further into those two pieces. Again - that may not be possible or practical - but I think it would crystalize the review process and make it easier to provide concrete directions on working toward an implementation that... at least I... would feel comfortable merging?
How do you envision separating choosing/assigning a media from credentials?
Additionally, I guess one of your main concerns is the performance penalty. Would it be possible to share your current performance evaluation results and methods, and set a threshold for performance penalty that we aim for?
One can imagine many scenarios where one would want to dispatch between different object stores that has nothing to do with the cloud object store.
- Two object stores are available with different quotas - user opts to write new datasets to faster, less backed up disk to get more quota/faster performance for a short period of time.
- This same two object store scenario but that second one is used if the main quota/object store is full without user intervention (like this PR).
- This same two object store scenario but that second one is used if the main quota/object store and the user is on some access list.
I feel like the fact we'd like Galaxy to work in these different modalities means we need to provide some sort extension point, configuration point for how quota works and interacts with object stores.
I don't really feel anything in this PR is establishing abstractions that make this easier. I feel like you're modifying quota in different ways in different places and making it harder to do this. Alternatively - if we had a way to handle the above scenarios presumably this PR would glide right into that pretty naturally.
I'm sorry you feel like separating the quota pieces out would take months - I would suggest rebasing the whole PR into a single commit and then using git reset -p HEAD~
to pull just the quota related pieces out. I'd recommend this video - https://johnkary.net/blog/git-add-p-the-most-powerful-git-feature-youre-not-using-yet/ - I use this technique daily and it is really great at breaking big PRs into smaller ones. I've broken out dozens of atomic PRs out of my remote data staging branch this way, and hundreds from my CWL branch.
I would suggest rebasing the whole PR into a single commit
@jmchilton done. I will get to other points later.
Thanks for the rebase! Pushing the milestone but hopefully we can make some substantial progress this next release cycle.
My reading of the code is that it is at least still two big, largely independent chunks of work - one answering the question of how do we inject user credentials into the objectstore layer when needed and ...
@jmchilton as you asked I removed any code related to credentials and AWS S3 support (the media that requires authZ credentials), see https://github.com/VJalili/galaxy/commit/eda73341ef25f7e1872ce6b6d40dfd0cbfc09a0d.
... one answering a question of how do we switch between objectstores when some condition is active and manage quota in multiple places.
@jmchilton following your suggestion I removed all the quota-related logic; see https://github.com/VJalili/galaxy/commit/be3b31fe5111794320e756e83fa35c1717388fa6
I guess these changes are in alignment to the three points you have raised (i.e., remove quota, remove authnz, add an option to disable this feature for performance concerns). If these address the afore-mentioned points, I can next work on other comments and failing tests. Any suggestions?
ping @jmchilton
@jmchilton What is your path forward on this?
Here is what I did:
- Enable setting in galaxy.yml: enable_user_based_object_store: true
- cp config/object_store_conf.xml.sample config/object_store_conf.xml
- Start galaxy.
- Post via curl: curl -H "Content-Type: application/json" -d '{"category": "local", "path": "/Volumes/shared/dan/galaxy-object-store-test/dan1", "order": "1", "quota": "1000.0", "usage": "0.0"}' 'http://localhost:8080/api/storage_media?key=apiKeyHere'
Some initial comments:
Lack of UI and clear documentation makes evaluation somewhat difficult, but I understand why not present.
Not sure why enable_user_based_object_store: true
needs to exist, can’t this just come implicitly from the store being defined and enabled in the object_store_conf.xml
?
It is not clear if I am able to (and if so, how) to define and enable some rulesets for this object store. It looks like I can just turn it on and off. If it is on, any body can try to create directories and files into any place that they want -- success/failure is based upon the permissions of the galaxy process user. There has to be a configurable way to limit/allow what can be done, and by whom.
First thing I might suggest is to consider how this all is being enabled/defined. My first inclination is to remove the new settings from galaxy.yml and extend object_store_conf.xml
. Something perhaps like
<object_store type="local_path">
<filter type="allow_path"/>someThing</filter>
<filter type="restrict_path"/>someThing</filter>
<filter type="allow_users"/>someThing</filter>
<config name="storage_media_jobs_directory" />database/job_working_directory_storage_media</>
<config name="storage_media_cache_path" />database/storage_media_cache</>
<config name="storage_media_cache_size" />100</>
...other config things here ...
</object_store>
It is probably worthwhile to spend more time thinking about naming and structure, than what I just made up here.
I would then also consider what the config would look like for enabling/disabling/configuring the user-based Cloud storage. Then keep both of these use-cases in mind when reworking the code behavior. It could also be helpful to see the implementation of the cloud-based user store -- to demonstrate the abstractness and isolation of the system.
I’ve also added some comments inline with the code.
Please let me know if I can expand on anything. Can do a video call as well.
@VJalili : the reason the 2 unit tests fail is because you need to add these 2 lines to the test file test/unit/config/test_config_values.py
to this method to the _expected_paths
dictionary: https://github.com/galaxyproject/galaxy/blob/dev/test/unit/config/test_config_values.py#L54
'default_storage_media_jobs_directory': self._in_data_dir('job_working_directory_storage_media'),
'default_storage_media_cache_path': self._in_data_dir('storage_media_cache'),
(ideally, please place them so they appear in alphabetical order)
This comment explains why this is needed: https://github.com/galaxyproject/galaxy/blob/dev/test/unit/config/test_config_values.py#L108
Essentially, this tells the unit test what values should be expected for these config properties (as you can see, _in_data_dir
is a helper that will place the value inside the data_dir
(whatever temp dir it is in the current testing environment), and the value itself is what's specified in the schema as the default.
Sorry. This all should be documented better than a comment in the test file.
@ic4f Thanks! It will take me a while to learn the new config structure, but I like it more than what we had. Thank you.
I'll ping you when I have some documentation up; my hope is the config will just get out of the way and we won't have to think about it much, if at all, while implementing our (more interesting/important) stuff..
@jmchilton I implemented the UserObjectStore using the abstraction introduced in https://github.com/galaxyproject/galaxy/pull/9554. The user objectstore has a much smaller footprint now, and it is invoked much more intuitively; thank you!
One sticking point at a time - I think User. _calculate_or_set_disk_usage is going to be a problem here right? Like as soon as the user's disk usage is recalculated - all the dataset usage for all the attached disk is going to be added to the user's quota even though you very carefully prevented it from being initially added.
I'm trying to work on this in the context of creating like scratch storage object stores - I think what we need is more abstractions around quota calculation that ties them closer to object stores and is extensible for applications like this. I'll see if I can come up with something.