intake icon indicating copy to clipboard operation
intake copied to clipboard

Bug in adding entries to catalog

Open CatarinaSilva opened this issue 5 years ago • 3 comments

When taking an empty catalog:

metadata: {}
sources: {}

And running:

catalog = intake.open_catalog(catalog_uri)   # catalog_uri points to empty catalog
source = intake.open_csv(csvfile)
exported_source = source.export(target_uri)
catalog.add(exported_source)

The code runs successfully. However, if we have a non-empty calendar such as:

metadata: {}
sources:
  source1:
    args:
      meta: {}
      urlpath: s3://dummy-bucket/dummy-uri
    cache: []
    catalog_dir: s3://dummy-bucket/dummy-uri-catalogues
    description: ''
    direct_access: forbid
    driver: intake_parquet.source.ParquetSource
    getenv: true
    getshell: true
    metadata: {}
    name: source1
    parameters: {}

The previous code breaks with AttributeError: 'LocalCatalogEntry' object has no attribute '_yaml' from L628 in file intake/catalog/local.py. I manage to fix it by changing that line to

data['sources'][e] = list(entries[e].get()._yaml()['sources'].values())[0]

But not sure if this will generalize well for all cases.

CatarinaSilva avatar Jul 29 '20 08:07 CatarinaSilva

Ended up realizing the problem is that the entries list loaded from the catalog translates into LocalCatalogEntry objects while the expected input is a Source object (eg. ParquetSource). To make things consistent my final fix in my usecase was to transform the entry list into the sources:

        entries = self._entries.copy()
        for e in entries:
            entries[e] = entries[e].get()

CatarinaSilva avatar Jul 29 '20 08:07 CatarinaSilva

Thanks for the report. I find your solution might be the wrong way around: in the common case where a data source definition contains more information that the derived entry (e.g., because of user parameters), you would loose information. The source itself may have been made from an entry object (which is available as the source's _entry attribute).

The entries themselves used to produce yaml, but that seems to have been lost recently - however, most of the info is there in entry.describe().

I think the best thing to do would be write a test case showing how add() ought to work for various scenarios including yours, above, and then we can code to make sure they all work.

martindurant avatar Jul 29 '20 13:07 martindurant

I think this is fixed on master, can you check?

martindurant avatar Aug 12 '20 14:08 martindurant