cf-python icon indicating copy to clipboard operation
cf-python copied to clipboard

How to keep track of filenames for field source?

Open ThibHlln opened this issue 2 years ago • 2 comments

Hi Sadie, Hi David, 🙂

In unifhy we have been using cf.Field.get_filenames for a while to track down the source files of the user input fields so that they can be stored in a configuration file for potential later reuse. And I have only recently being faced with the scenario where "If all of the data are in memory then an empty set is returned", meaning that get_filenames does not return the information we are looking for anymore (https://github.com/unifhy-org/unifhy/issues/80).

So, I am wondering:

  • is there another attribute/property/method of cf.Field that always keeps the filenames of a field regardless of whether its data fits in memory?
  • if not, would it make sense for cf-python to provide such functionality? e.g. not to drop the filenames even if the data fits in memory (I am guessing it wouldn't, otherwise you would already have implemented it 🙂)

Thank you in advance for your help, Thibault

ThibHlln avatar Mar 28 '22 14:03 ThibHlln

Hi Thibault - good to hear from you. As you've guessed, it's complicated!

The short answer to your particular problem is perhaps to manually save the file names straight after the read step:

>>> import cf
>>> cf.write(cf.example_field(0), '~/delme.nc')
>>> f = cf.read('~/delme.nc')[0]
>>> f._custom['saved_filenames'] = f.get_filenames()
>>> f._custom['saved_filenames']
{'/home/david/delme.nc'}

Doing it this way, by adding it to the _custom dictionary as opposed to just setting the non-reserved attribute f.saved_filenames = ..., ensures that they'll get copied if you do g = f.copy().

I'm can't think of a reason why we couldn't formalise this to, say:

>>> f.get_filenames(save=True)
>>> f.saved_filenames  # now a reserved attribute
'/home/david/delme.nc'

This method with save=True would save the output of get_filenames (which could be an empty set) regardless of whether or not names had previously been saved. Would that be useful?

So what are the complications? As usual it's ambiguities and corner cases. If array values have been entirely overwritten (f += 1), then the presence of saved filenames could be misleading to some people, but not others. . Similarly if only some of the contributing files have been made "redundant". Note also that the files include files names which contain coordinates and other metadata - these usually, but not always, will be in the same files as the data ....

I final note, which might make all this moot (at least for you!) is that soon we'll be releasing the first dask version of cf-python. Because the dask data stores up operations lazily, the original filenames will still be present and returnable by get_filenames, even if you did f += 1. of course, you can still lose this information by forcing the operations to be computed internally (cf. da.array.Array.persist()), but it could open up more possibilities.

Anyway, let us know if you'd like f.get_filenames(save=True) implemented, and we'll get right on it - it will be a very quick implementation.

All the best, David

davidhassell avatar Mar 30 '22 15:03 davidhassell

Hi David,

Thank you for your detailed reply, as always. 🙂

I think the manual option you suggest is perfectly acceptable. I agree with you that "saved filenames" could be misleading to some if the field has been altered in such a way that it is no longer the same as the one in the file anymore. So it is probably best not to implement it, although it is not up to me to decide!

Thank you for your help! Take care, Thibault

ThibHlln avatar Mar 31 '22 13:03 ThibHlln

Closing now, since at 3.14.0 we will have both original filenames (#448) and "live" filenames (#498).

davidhassell avatar Nov 15 '22 10:11 davidhassell