VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Store test datasets in this repo

Open TomNicholas opened this issue 1 year ago • 3 comments

Our current approach to testing involves a bunch of fixtures which each download a tutorial dataset from xarray (and cache it because it uses pooch), saves them to a temporary directory, then open that dataset from disk. This is not ideal for a few reasons:

  1. The datasets aren't minimal, so they contain more complexity than is really needed to test a single bug / feature. This can make debugging more complicated.
  2. We're using the network when we don't need to be.
  3. vz.open_virtual_dataset calls xr.open_dataset, but because of our test setup xr.open_dataset can potentially be called more than once in the same test invocation, even if the code we are testing only calls it once. This again can make debugging more confusing than it needs to be.

We do need to test our ability to read files from disk, but it might be better just to make some really tiny netCDF files and save them in this repo.

EDIT: Xarray actually does this and no-one seems to complain because the files are only ~1kB in size, which is smaller than the text files containing the actual code.

TomNicholas avatar Aug 23 '24 21:08 TomNicholas

Note that the way we have been doing this so far is good in that we haven't committed any large files to git, so we don't have to do any cleaning of the git history (which is a PITA).

TomNicholas avatar Dec 30 '24 20:12 TomNicholas

Just a note that @TomNicholas suggested in https://github.com/zarr-developers/VirtualiZarr/issues/365 that we store a smaller alternative to the NISAR file used in FAILED virtualizarr/tests/test_backend.py::TestReadFromURL::test_virtualizarr_vs_local_nisar as part of this issue.

On this and https://github.com/zarr-developers/VirtualiZarr/pull/235, should we consider moving the tests outside the source code directory or explicitly excluding the data files from the release manifest? Some people care a lot about having small release sizes when using lambda for example, though I'm not sure if this includes both the sdist and wheels or just wheels.

maxrjones avatar Dec 30 '24 21:12 maxrjones

On this and https://github.com/zarr-developers/VirtualiZarr/pull/235, should we consider moving the tests outside the source code directory or explicitly excluding the data files from the release manifest?

I don't think there is any need is there? Currently we don't ship large files with the release, and if we switch to using very small test files (~kB) then we still won't be shipping large files with the release. As long as we actually make sure the files are that small then we don't need to separate it out.

OTOH if this is a commonly done thing then sure let's split them apart.

TomNicholas avatar Dec 31 '24 01:12 TomNicholas