OCTIS
OCTIS copied to clipboard
Error loading custom dataset
- OCTIS version: 1.11.0
- Python version: 3.8
- Operating System: Windows 10
Description
Hello,
I am having trouble loading my custom dataset. I followed the guide in the main README and am getting the below errors.
What I Did
from octis.dataset.dataset import Dataset import pandas as pd
df = pd.read_csv("/mnt/mydata/notebooks/data.csv")
df.to_csv('corpus.tsv', sep="\t", header= False, columns=['documents']) dataset.load_custom_dataset_from_folder("/mnt/mydata/notebooks")
/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py:330: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
final_df = df[df[1] == 'train'].append(df[df[1] == 'val'])
/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py:331: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
final_df = final_df.append(df[df[1] == 'test'])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py in load_custom_dataset_from_folder(self, path, multilabel)
335
--> 336 self.__corpus = [d.split() for d in final_df[0].tolist()]
337 if len(final_df.keys()) > 2:
/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py in <listcomp>(.0)
335
--> 336 self.__corpus = [d.split() for d in final_df[0].tolist()]
337 if len(final_df.keys()) > 2:
AttributeError: 'int' object has no attribute 'split'
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
<ipython-input-16-28e6bd2fc3cd> in <module>
1 dataset = Dataset()
----> 2 dataset.load_custom_dataset_from_folder("/mnt/mydata/notebooks")
/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py in load_custom_dataset_from_folder(self, path, multilabel)
356 self._load_document_indexes(self.dataset_path + "/indexes.txt")
357 except:
--> 358 raise Exception("error in loading the dataset:" + self.dataset_path)
359
360 def fetch_dataset(self, dataset_name, data_home=None, download_if_missing=True):
Exception: error in loading the dataset:/mnt/mydata/notebooks
in [Load a Custom Dataset] section, it is mentioned that our data set should have a vocabulary file while my dataset is just a csv file I am wondering how can we generate this vocab file. does this pipeline generate it automatically?
Per the readme, the custom dataset is a tsv file, which is what our csv is. I'm uncertain what the vocab file should be.
Hi, the vocabulary file is just the list of words contained in the documents. You can see #92 on how to generate it from the tsv file.