TopicNet
TopicNet copied to clipboard
Dataset "ruwiki_good" does not want to be downloaded
Well, the dataset is currently unavailable. It should be fixed — load_dataset('ruwiki_good')
. ~~Or... it should at least download and tell which way the .txt file lies (so that it would be possible to do something manually with the file)~~.
If you try this:
>>> d = load_dataset('ruwiki_good')
you get something like this:
Checking if dataset "ruwiki_good" was already downloaded before
Dataset "ruwiki_good" not found on the machine
Downloading the "ruwiki_good" dataset...
100%|█████████████████████████████████████████| 51.2M/51.2M [00:46<00:00, 1.10MiB/s]
Dataset downloaded! Save path is: "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/ruwiki_good.txt"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 132, in load_dataset
raise exception
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 126, in load_dataset
return Dataset(save_path, **kwargs)
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 220, in __init__
self._data = self._read_data(data_path)
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 355, in _read_data
data = data_handle.read_csv(
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 935, in read_csv
kwds_defaults = _refine_defaults_read(
File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 2063, in _refine_defaults_read
raise ValueError(
ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.
OS is:
Linux mx 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
Expected Result
The dataset is 1) downloaded and 2) ready to use for topic modeling.
Current "Workaround"
If you set sep='###'
in this code:
data = data_handle.read_csv(
data_path,
engine='python',
error_bad_lines=False,
sep='\n',
header=None,
names=[VW_TEXT_COL]
)
then everything seems to work fine.