galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

Add pagination support to Files Source plugins

Open davelopez opened this issue 1 year ago • 1 comments

Currently, when listing contents in a remote file source, we retrieve all possible files and directories inside, then perform client-side pagination and search. While this approach is enough for small sources with a limited number of files, larger repositories could potentially contain a vast number of files, making navigation within these sources very slow and difficult.

This addresses the issue by passing limit and offset parameters to paginate the contents. However, this approach comes with its own drawbacks. Specifically, we must carefully handle sorting, filtering, and searching server-side, which may prove challenging depending on the plugin implementation. Also, pagination is only applicable in a "non-recursive" view of the file source.

The pull request includes the following changes:

  • A new serializable property for each plugin called supports_pagination. This property determines if the plugin can paginate the contents server-side so the client can delegate that.
  • Implementation of pagination for every PyFilesystem2-based and Invenio-based plugin.
  • A new plugin named TempFileSource, which simplifies testing of PyFilesystem2-based plugins. This plugin utilizes the built-in OS Filesystem. I wonder if we could even replace the posix plugin with this... unless there are specific functionalities unique to the posix plugin :thinking:
  • Fix an issue in PyFilesystem2 plugins when listing them recursively, which was detected during testing with the new TempFileSource plugin.
  • A new serializable property for each plugin called supports_search. This property determines if the plugin can filter results by a search query. The syntax of the query is up to the plugin implementation, but a basic "search by name" should be supported as a minimum.

TODO

  • [x] Support server-side pagination.
  • [x] Support server-side filtering/search.
  • [ ] Support server-side sorting.
  • [ ] Update UI to use server-side pagination

How to test the changes?

  • [x] I've included appropriate automated tests.
  • [ ] This is a refactoring of components with existing test coverage.
  • [ ] Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • [x] I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

davelopez avatar Apr 26 '24 11:04 davelopez

Looking great so far!

jmchilton avatar Apr 26 '24 18:04 jmchilton

I guess this test failure:

tests/files/test_gcsfs.py:23: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/files/_util.py:193: in assert_simple_file_realize
    res, count = file_source.list("/", recursive=recursive, user_context=user_context)
galaxy/files/sources/__init__.py:441: in list
    return self._list(path, recursive, user_context, opts, limit, offset, query)
galaxy/files/sources/_pyfilesystem2.py:76: in _list
    count = self._get_total_matches_count(h, path, filter)
galaxy/files/sources/_pyfilesystem2.py:93: in _get_total_matches_count
    directories_count = fs.glob(directory_glob_pattern).count()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Globber(GCSFS('genomics-public-data', root_path=''), '//*/')

    def count(self):
        # type: () -> Counts
        """Count files / directories / data in matched paths.
    
        Example:
            >>> my_fs.glob('**/*.py').count()
            Counts(files=2, directories=0, data=55)
    
        Returns:
            `~Counts`: A named tuple containing results.
    
        """
        directories = 0
        files = 0
        data = 0
        for _path, info in self._make_iter(namespaces=["details"]):
            if info.is_dir:
                directories += 1
            else:
                files += 1
>           data += info.size
E           TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'

Is caused either by an upstream implementation error or by this known limitation of GCSFS https://fs-gcsfs.readthedocs.io/en/stable/#limitations

As a workaround, I will catch the error and default to 0 as directories are not a "real" thing in Google Cloud Storage or try to call fix_storage() as stated in the documentation but this might make navigating this file source slower :thinking:

Update

Using fix_storage() didn't solve the issue and created more issues, so I'll try the other workaround even if pagination might not work correctly with the Google Cloud Storage plugging. In the worst case, we can set supports_pagination to false for this plugin and fall back to the existing behavior.

davelopez avatar May 17 '24 08:05 davelopez

This should be finally ready for review. The remaining converter and integration test failures seem unrelated.

davelopez avatar May 17 '24 13:05 davelopez

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ InternalServerError: Problem listing file source path gxfiles://noaa-ufs-prototype5-pds/ /api/remote_files View Issue

Did you find this useful? React with a 👍 or 👎

Sentry, you are so cool! :laughing:

This seems like a user configuration error:

FileNotFoundError: The specified bucket does not exist

It probably should be a 400 instead of a 500 though. I wonder how we could discriminate better between those here :thinking:

davelopez avatar May 31 '24 08:05 davelopez

I'm not sure ... in any case eu isn't running 24.1 yet, so this PR is unrelated. If this is an admin config that is wrong it seems correct to raise that error, if this is a user than I agree we don't need to make this a 500 ...

mvdbeek avatar May 31 '24 09:05 mvdbeek

Ahh! You're right! I didn't notice it was on EU. I thought it was on test with someone testing the new user-defined file sources.

Then this does look indeed like a typo or maybe the bucket was renamed. https://registry.opendata.aws/noaa-ufs-s2s/

davelopez avatar May 31 '24 12:05 davelopez