dask Unable to load ORC table using `read

What happened:

I am trying to use read_orc to read an ORC table located in a Kerberised HDFS. I can use read_parquet but read_orc fails with the following attribute error:

/projects/gds/chavesrl/condapv/envs/N2D/lib/python3.7/site-packages/dask/dataframe/io/orc.py in read_orc(path, columns, storage_options)
     74                 raise ValueError("Incompatible schemas while parsing ORC files")
     75             nstripes_per_file.append(o.nstripes)
---> 76     schema = _get_pyarrow_dtypes(schema, categories=None)
     77     if columns is not None:
     78         ex = set(columns) - set(schema)

/projects/gds/chavesrl/condapv/envs/N2D/lib/python3.7/site-packages/dask/dataframe/io/utils.py in _get_pyarrow_dtypes(schema, categories)
      9 
     10     # Check for pandas metadata
---> 11     has_pandas_metadata = schema.metadata is not None and b"pandas" in schema.metadata
     12     if has_pandas_metadata:
     13         pandas_metadata = json.loads(schema.metadata[b"pandas"].decode("utf8"))

AttributeError: 'NoneType' object has no attribute 'metadata'

after running the following code:

from dask.dataframe import read_orc
read_orc(info['location']+'/*', 
         storage_options = {
             'user':'{username}', 
             'kerb_ticket' : '/tmp/krb5cc_132855'
         })

where info['location'] is of the likes of: hdfs://cluster/db/schema/table_name

What you expected to happen:

To have the table be loaded as a dask dataframe in the same way parquet tables are read succesfully

Anything else we need to know?:

Environment:

Dask version: '2021.04.0'
Python version: 3.7.10
Operating System: Linux RedHat
Install method (conda, pip, source): conda

Jun 30 '21 16:06 lucharo

cc @dask/io

Jun 30 '21 16:06 jrbourbeau

The error seems consistent with an empty path list. Can you share what is printed for this code?:

from fsspec.core import get_fs_token_paths

path = <your path input for read_orc>
storage_options = <your storage_options input for read_orc>

fs, fs_token, paths = get_fs_token_paths(
    path, mode="rb", storage_options=storage_options
)
print(paths)

Jun 30 '21 17:06 rjzamora

When running:

from fsspec.core import get_fs_token_paths

path = table_info('vca_europe.tcaf_iss_cs')['location']
storage_options =  {
                 'hdfs_port': 8020,
                 'user':'{username}', 
                 'kerb_ticket' : '/tmp/krb5cc_132855'
             }

fs, fs_token, paths = get_fs_token_paths(
    path, mode="rb", storage_options=storage_options
)
print(paths)

The output is:

['/hive/vca_europe.db/tcaf_iss_cs']

Jul 02 '21 10:07 lucharo

Is '/hive/vca_europe.db/tcaf_iss_cs' an orc file, or a directory? Given that the path starts with hive/, I suspect that this is a directory. Note that read_orc does not (yet) support hive-partitioned datasets, but I am hoping to address this in the near future (see #7756 ). If that is what you are trying to do here, you can explicitly pass in a list of files, but the directory structure will not be used to populate "partition" columns.

Jul 02 '21 16:07 rjzamora

It is indeed a directory! I will try to pass the list of all partitions under that directory. Also I am assuming that passing '/hive/vca_europe.db/tcaf_iss_cs/*' (with added /*) would not work right?

Jul 05 '21 10:07 lucharo

Naively, I would expect '/hive/vca_europe.db/tcaf_iss_cs/*' to work. Did you try it out?

Jul 22 '21 19:07 jsignell

Unable to load ORC table using `read_orc`