datasets
datasets copied to clipboard
Consider using "Sequence" instead of "List"
Feature request
Hi, please consider using Sequence
type annotation instead of List
in function arguments such as in Dataset.from_parquet()
. It leads to type checking errors, see below.
How to reproduce
list_of_filenames = ["foo.parquet", "bar.parquet"]
ds = Dataset.from_parquet(list_of_filenames)
Expected mypy output:
Success: no issues found
Actual mypy output:
test.py:19: error: Argument 1 to "from_parquet" of "Dataset" has incompatible type "List[str]"; expected "Union[Union[str, bytes, PathLike[Any]], List[Union[str, bytes, PathLike[Any]]]]" [arg-type]
test.py:19: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
test.py:19: note: Consider using "Sequence" instead, which is covariant
Env: mypy 0.991, Python 3.10.0, datasets 2.7.1
Hi! Linking a comment to provide more info on the issue: https://stackoverflow.com/a/39458225. This means we should replace all (most of) the occurrences of List
with Sequence
in function signatures.
@tranhd95 Would you be interested in submitting a PR?
Hi all! I tried to reproduce this issue and didn't work for me. Also in your example i noticed that the variables have different names: list_of_filenames
and list_of_files
, could this be related to that?
#I found random data in parquet format:
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata1.parquet"
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet"
#Then i try reproduce
list_of_files = ["userdata1.parquet", "userdata2.parquet"]
ds = Dataset.from_parquet(list_of_files)
My output:
WARNING:datasets.builder:Using custom data configuration default-e287d097dc54e046
Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%
1/1 [00:00<00:00, 40.38it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.43it/s]
Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
P.S. This is my first experience with open source. So do not judge strictly if I do not understand something)
@dantema There is indeed a typo in variable names. Nevertheless, I'm sorry if I was not clear but the output is from mypy
type checker. You can run the code snippet without issues. The problem is with the type checking.
However, I found out that the type annotation is actually misleading. The from_parquet
method should also accept list of PathLike
objects which includes os.PathLike
. But if I would ran the code snippet below, an exception is thrown.
Code
from pathlib import Path
list_of_filenames = [Path("foo.parquet"), Path("bar.parquet")]
ds = Dataset.from_parquet(list_of_filenames)
Output
[/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in from_parquet(path_or_paths, split, features, cache_dir, keep_in_memory, columns, **kwargs)
1071 from .io.parquet import ParquetDatasetReader
1072
-> 1073 return ParquetDatasetReader(
1074 path_or_paths,
1075 split=split,
[/usr/local/lib/python3.8/dist-packages/datasets/io/parquet.py](https://localhost:8080/#) in __init__(self, path_or_paths, split, features, cache_dir, keep_in_memory, streaming, **kwargs)
35 path_or_paths = path_or_paths if isinstance(path_or_paths, dict) else {self.split: path_or_paths}
36 hash = _PACKAGED_DATASETS_MODULES["parquet"][1]
---> 37 self.builder = Parquet(
38 cache_dir=cache_dir,
39 data_files=path_or_paths,
[/usr/local/lib/python3.8/dist-packages/datasets/builder.py](https://localhost:8080/#) in __init__(self, cache_dir, config_name, hash, base_path, info, features, use_auth_token, repo_id, data_files, data_dir, name, **config_kwargs)
298
299 if data_files is not None and not isinstance(data_files, DataFilesDict):
--> 300 data_files = DataFilesDict.from_local_or_remote(
301 sanitize_patterns(data_files), base_path=base_path, use_auth_token=use_auth_token
302 )
[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
794 for key, patterns_for_key in patterns.items():
795 out[key] = (
--> 796 DataFilesList.from_local_or_remote(
797 patterns_for_key,
798 base_path=base_path,
[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
762 ) -> "DataFilesList":
763 base_path = base_path if base_path is not None else str(Path().resolve())
--> 764 data_files = resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
765 origin_metadata = _get_origin_metadata_locally_or_by_urls(data_files, use_auth_token=use_auth_token)
766 return cls(data_files, origin_metadata)
[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
357 data_files = []
358 for pattern in patterns:
--> 359 if is_remote_url(pattern):
360 data_files.append(Url(pattern))
361 else:
[/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py](https://localhost:8080/#) in is_remote_url(url_or_filename)
62
63 def is_remote_url(url_or_filename: str) -> bool:
---> 64 parsed = urlparse(url_or_filename)
65 return parsed.scheme in ("http", "https", "s3", "gs", "hdfs", "ftp")
66
[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in urlparse(url, scheme, allow_fragments)
373 Note that we don't break the components up in smaller bits
374 (e.g. netloc is a single string) and we don't expand % escapes."""
--> 375 url, scheme, _coerce_result = _coerce_args(url, scheme)
376 splitresult = urlsplit(url, scheme, allow_fragments)
377 scheme, netloc, url, query, fragment = splitresult
[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _coerce_args(*args)
125 if str_input:
126 return args + (_noop,)
--> 127 return _decode_args(args) + (_encode_result,)
128
129 # Result objects are more helpful than simple tuples
[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _decode_args(args, encoding, errors)
109 def _decode_args(args, encoding=_implicit_encoding,
110 errors=_implicit_errors):
--> 111 return tuple(x.decode(encoding, errors) if x else '' for x in args)
112
113 def _coerce_args(*args):
[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in <genexpr>(.0)
109 def _decode_args(args, encoding=_implicit_encoding,
110 errors=_implicit_errors):
--> 111 return tuple(x.decode(encoding, errors) if x else '' for x in args)
112
113 def _coerce_args(*args):
AttributeError: 'PosixPath' object has no attribute 'decode'
@mariosasko Should I create a new issue?
@mariosasko I would like to take this issue up.
@avinashsai Hi, I've assigned you the issue.
@tranhd95 Yes, feel free to report this in a new issue.
@avinashsai Are you still working on this? If not I would like to give it a try.
@mariosasko I would like to take this issue up!
Hi @tranhd95 @mariosasko ,I hope you all are doing well.
I am interested in this issue, is this still open and unresolved ?
Thanks and Regards