Add pagination support to Files Source plugins
Currently, when listing contents in a remote file source, we retrieve all possible files and directories inside, then perform client-side pagination and search. While this approach is enough for small sources with a limited number of files, larger repositories could potentially contain a vast number of files, making navigation within these sources very slow and difficult.
This addresses the issue by passing limit and offset parameters to paginate the contents. However, this approach comes with its own drawbacks. Specifically, we must carefully handle sorting, filtering, and searching server-side, which may prove challenging depending on the plugin implementation. Also, pagination is only applicable in a "non-recursive" view of the file source.
The pull request includes the following changes:
- A new serializable property for each plugin called
supports_pagination. This property determines if the plugin can paginate the contents server-side so the client can delegate that. - Implementation of pagination for every PyFilesystem2-based and Invenio-based plugin.
- A new plugin named
TempFileSource, which simplifies testing of PyFilesystem2-based plugins. This plugin utilizes the built-in OS Filesystem. I wonder if we could even replace theposixplugin with this... unless there are specific functionalities unique to theposixplugin :thinking: - Fix an issue in PyFilesystem2 plugins when listing them recursively, which was detected during testing with the new
TempFileSourceplugin. - A new serializable property for each plugin called
supports_search. This property determines if the plugin can filter results by a search query. The syntax of the query is up to the plugin implementation, but a basic "search by name" should be supported as a minimum.
TODO
- [x] Support server-side pagination.
- [x] Support server-side filtering/search.
- [ ] Support server-side sorting.
- [ ] Update UI to use server-side pagination
How to test the changes?
- [x] I've included appropriate automated tests.
- [ ] This is a refactoring of components with existing test coverage.
- [ ] Instructions for manual testing are as follows:
- [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]
License
- [x] I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.
Looking great so far!
I guess this test failure:
tests/files/test_gcsfs.py:23:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/files/_util.py:193: in assert_simple_file_realize
res, count = file_source.list("/", recursive=recursive, user_context=user_context)
galaxy/files/sources/__init__.py:441: in list
return self._list(path, recursive, user_context, opts, limit, offset, query)
galaxy/files/sources/_pyfilesystem2.py:76: in _list
count = self._get_total_matches_count(h, path, filter)
galaxy/files/sources/_pyfilesystem2.py:93: in _get_total_matches_count
directories_count = fs.glob(directory_glob_pattern).count()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Globber(GCSFS('genomics-public-data', root_path=''), '//*/')
def count(self):
# type: () -> Counts
"""Count files / directories / data in matched paths.
Example:
>>> my_fs.glob('**/*.py').count()
Counts(files=2, directories=0, data=55)
Returns:
`~Counts`: A named tuple containing results.
"""
directories = 0
files = 0
data = 0
for _path, info in self._make_iter(namespaces=["details"]):
if info.is_dir:
directories += 1
else:
files += 1
> data += info.size
E TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'
Is caused either by an upstream implementation error or by this known limitation of GCSFS https://fs-gcsfs.readthedocs.io/en/stable/#limitations
As a workaround, I will catch the error and default to 0 as directories are not a "real" thing in Google Cloud Storage or try to call fix_storage() as stated in the documentation but this might make navigating this file source slower :thinking:
Update
Using fix_storage() didn't solve the issue and created more issues, so I'll try the other workaround even if pagination might not work correctly with the Google Cloud Storage plugging. In the worst case, we can set supports_pagination to false for this plugin and fall back to the existing behavior.
This should be finally ready for review. The remaining converter and integration test failures seem unrelated.
Suspect Issues
This pull request was deployed and Sentry observed the following issues:
- ‼️ InternalServerError: Problem listing file source path gxfiles://noaa-ufs-prototype5-pds/
/api/remote_filesView Issue
Did you find this useful? React with a 👍 or 👎
Sentry, you are so cool! :laughing:
This seems like a user configuration error:
FileNotFoundError: The specified bucket does not exist
It probably should be a 400 instead of a 500 though. I wonder how we could discriminate better between those here :thinking:
I'm not sure ... in any case eu isn't running 24.1 yet, so this PR is unrelated. If this is an admin config that is wrong it seems correct to raise that error, if this is a user than I agree we don't need to make this a 500 ...
Ahh! You're right! I didn't notice it was on EU. I thought it was on test with someone testing the new user-defined file sources.
Then this does look indeed like a typo or maybe the bucket was renamed. https://registry.opendata.aws/noaa-ufs-s2s/