Unable to load ORC table using `read_orc`
What happened:
I am trying to use read_orc to read an ORC table located in a Kerberised HDFS. I can use read_parquet but read_orc fails with the following attribute error:
/projects/gds/chavesrl/condapv/envs/N2D/lib/python3.7/site-packages/dask/dataframe/io/orc.py in read_orc(path, columns, storage_options)
74 raise ValueError("Incompatible schemas while parsing ORC files")
75 nstripes_per_file.append(o.nstripes)
---> 76 schema = _get_pyarrow_dtypes(schema, categories=None)
77 if columns is not None:
78 ex = set(columns) - set(schema)
/projects/gds/chavesrl/condapv/envs/N2D/lib/python3.7/site-packages/dask/dataframe/io/utils.py in _get_pyarrow_dtypes(schema, categories)
9
10 # Check for pandas metadata
---> 11 has_pandas_metadata = schema.metadata is not None and b"pandas" in schema.metadata
12 if has_pandas_metadata:
13 pandas_metadata = json.loads(schema.metadata[b"pandas"].decode("utf8"))
AttributeError: 'NoneType' object has no attribute 'metadata'
after running the following code:
from dask.dataframe import read_orc
read_orc(info['location']+'/*',
storage_options = {
'user':'{username}',
'kerb_ticket' : '/tmp/krb5cc_132855'
})
where info['location'] is of the likes of: hdfs://cluster/db/schema/table_name
What you expected to happen:
To have the table be loaded as a dask dataframe in the same way parquet tables are read succesfully
Anything else we need to know?:
Environment:
- Dask version: '2021.04.0'
- Python version: 3.7.10
- Operating System: Linux RedHat
- Install method (conda, pip, source): conda
cc @dask/io
The error seems consistent with an empty path list. Can you share what is printed for this code?:
from fsspec.core import get_fs_token_paths
path = <your path input for read_orc>
storage_options = <your storage_options input for read_orc>
fs, fs_token, paths = get_fs_token_paths(
path, mode="rb", storage_options=storage_options
)
print(paths)
When running:
from fsspec.core import get_fs_token_paths
path = table_info('vca_europe.tcaf_iss_cs')['location']
storage_options = {
'hdfs_port': 8020,
'user':'{username}',
'kerb_ticket' : '/tmp/krb5cc_132855'
}
fs, fs_token, paths = get_fs_token_paths(
path, mode="rb", storage_options=storage_options
)
print(paths)
The output is:
['/hive/vca_europe.db/tcaf_iss_cs']
Is '/hive/vca_europe.db/tcaf_iss_cs' an orc file, or a directory? Given that the path starts with hive/, I suspect that this is a directory. Note that read_orc does not (yet) support hive-partitioned datasets, but I am hoping to address this in the near future (see #7756 ). If that is what you are trying to do here, you can explicitly pass in a list of files, but the directory structure will not be used to populate "partition" columns.
It is indeed a directory! I will try to pass the list of all partitions under that directory. Also I am assuming that passing '/hive/vca_europe.db/tcaf_iss_cs/*' (with added /*) would not work right?
Naively, I would expect '/hive/vca_europe.db/tcaf_iss_cs/*' to work. Did you try it out?