cf-python
cf-python copied to clipboard
How to keep track of filenames for field source?
Hi Sadie, Hi David, 🙂
In unifhy
we have been using cf.Field.get_filenames
for a while to track down the source files of the user input fields so that they can be stored in a configuration file for potential later reuse. And I have only recently being faced with the scenario where "If all of the data are in memory then an empty set
is returned", meaning that get_filenames
does not return the information we are looking for anymore (https://github.com/unifhy-org/unifhy/issues/80).
So, I am wondering:
- is there another attribute/property/method of
cf.Field
that always keeps the filenames of a field regardless of whether its data fits in memory? - if not, would it make sense for
cf-python
to provide such functionality? e.g. not to drop the filenames even if the data fits in memory (I am guessing it wouldn't, otherwise you would already have implemented it 🙂)
Thank you in advance for your help, Thibault
Hi Thibault - good to hear from you. As you've guessed, it's complicated!
The short answer to your particular problem is perhaps to manually save the file names straight after the read step:
>>> import cf
>>> cf.write(cf.example_field(0), '~/delme.nc')
>>> f = cf.read('~/delme.nc')[0]
>>> f._custom['saved_filenames'] = f.get_filenames()
>>> f._custom['saved_filenames']
{'/home/david/delme.nc'}
Doing it this way, by adding it to the _custom
dictionary as opposed to just setting the non-reserved attribute f.saved_filenames = ...
, ensures that they'll get copied if you do g = f.copy()
.
I'm can't think of a reason why we couldn't formalise this to, say:
>>> f.get_filenames(save=True)
>>> f.saved_filenames # now a reserved attribute
'/home/david/delme.nc'
This method with save=True
would save the output of get_filenames
(which could be an empty set) regardless of whether or not names had previously been saved. Would that be useful?
So what are the complications? As usual it's ambiguities and corner cases. If array values have been entirely overwritten (f += 1
), then the presence of saved filenames could be misleading to some people, but not others. . Similarly if only some of the contributing files have been made "redundant". Note also that the files include files names which contain coordinates and other metadata - these usually, but not always, will be in the same files as the data ....
I final note, which might make all this moot (at least for you!) is that soon we'll be releasing the first dask
version of cf-python. Because the dask data stores up operations lazily, the original filenames will still be present and returnable by get_filenames
, even if you did f += 1
. of course, you can still lose this information by forcing the operations to be computed internally (cf. da.array.Array.persist()
), but it could open up more possibilities.
Anyway, let us know if you'd like f.get_filenames(save=True)
implemented, and we'll get right on it - it will be a very quick implementation.
All the best, David
Hi David,
Thank you for your detailed reply, as always. 🙂
I think the manual option you suggest is perfectly acceptable. I agree with you that "saved filenames" could be misleading to some if the field has been altered in such a way that it is no longer the same as the one in the file anymore. So it is probably best not to implement it, although it is not up to me to decide!
Thank you for your help! Take care, Thibault
Closing now, since at 3.14.0 we will have both original filenames (#448) and "live" filenames (#498).