TopicNet icon indicating copy to clipboard operation
TopicNet copied to clipboard

Dataset "ruwiki_good" does not want to be downloaded

Open Alvant opened this issue 2 years ago • 0 comments

Well, the dataset is currently unavailable. It should be fixed — load_dataset('ruwiki_good'). ~~Or... it should at least download and tell which way the .txt file lies (so that it would be possible to do something manually with the file)~~.

If you try this:

>>> d = load_dataset('ruwiki_good')

you get something like this:

Checking if dataset "ruwiki_good" was already downloaded before
Dataset "ruwiki_good" not found on the machine
Downloading the "ruwiki_good" dataset...
100%|█████████████████████████████████████████| 51.2M/51.2M [00:46<00:00, 1.10MiB/s]
Dataset downloaded! Save path is: "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/ruwiki_good.txt"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 132, in load_dataset
    raise exception
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 126, in load_dataset
    return Dataset(save_path, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 220, in __init__
    self._data = self._read_data(data_path)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 355, in _read_data
    data = data_handle.read_csv(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 935, in read_csv
    kwds_defaults = _refine_defaults_read(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 2063, in _refine_defaults_read
    raise ValueError(
ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.

OS is:

Linux mx 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

Expected Result

The dataset is 1) downloaded and 2) ready to use for topic modeling.

Current "Workaround"

If you set sep='###' in this code:

data = data_handle.read_csv(
    data_path,
    engine='python',
    error_bad_lines=False,
    sep='\n',
    header=None,
    names=[VW_TEXT_COL]
)

then everything seems to work fine.

Alvant avatar Nov 25 '22 14:11 Alvant