Reranker icon indicating copy to clipboard operation
Reranker copied to clipboard

Problem with reading dataset

Open HerrKrishna opened this issue 3 years ago • 4 comments

I tried to follow the training section of the readme. I get the following error:

Traceback (most recent call last): File "C:\Users\Christoph.Schneider\PycharmProjects\SentBertHelpDesk\try_reranker.py", line 22, in train_dataset = GroupedTrainDataset( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\reranker\data.py", line 31, in init self.nlp_dataset = datasets.load_dataset( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\load.py", line 742, in load_dataset builder_instance.download_and_prepare( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 574, in download_and_prepare self._download_and_prepare( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 652, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 1041, in _prepare_split for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose): File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\tqdm\std.py", line 1133, in iter for obj in iterable: File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\packaged_modules\json\json.py", line 96, in _generate_table s pa_table = pa_table.cast(self.config.schema) File "pyarrow\table.pxi", line 1409, in pyarrow.lib.Table.cast ValueError: Target schema's field names are not matching the table's field names: ['qry', 'pos', 'neg'], ['neg', 'pos', 'qry'] train.zip

i've attached the training file that i use. It follows the standards described in the readme.

HerrKrishna avatar Jun 24 '21 12:06 HerrKrishna

What version of datasets are you using?

luyug avatar Jun 24 '21 13:06 luyug

Check https://github.com/huggingface/datasets/issues/2548

luyug avatar Jun 24 '21 18:06 luyug

Thank you for helping. I'm using datasets 1.8.0 I've reordered neg pos and qry. Now i get this error:

Traceback (most recent call last): File "C:\Users\Christoph.Schneider\PycharmProjects\SentBertHelpDesk\try_reranker.py", line 25, in train_dataset = GroupedTrainDataset( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\reranker\data.py", line 31, in init self.nlp_dataset = datasets.load_dataset( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\load.py", line 742, in load_dataset builder_instance.download_and_prepare( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 574, in download_and_prepare self._download_and_prepare( File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 652, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 1041, in _prepare_split for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose): File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\tqdm\std.py", line 1133, in iter for obj in iterable: File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\packaged_modules\json\json.py", line 96, in _generate_table s pa_table = pa_table.cast(self.config.schema) File "pyarrow\table.pxi", line 1414, in pyarrow.lib.Table.cast File "pyarrow\table.pxi", line 277, in pyarrow.lib.ChunkedArray.cast File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\pyarrow\compute.py", line 281, in cast return call_function("cast", [arr], options) File "pyarrow_compute.pyx", line 465, in pyarrow._compute.call_function File "pyarrow_compute.pyx", line 294, in pyarrow._compute.Function.call File "pyarrow\error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow\error.pxi", line 105, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<qid: string, passage: list<item: int64>> to struct using function cast_struct

Can you help with that?

HerrKrishna avatar Jun 25 '21 09:06 HerrKrishna

Please first try out our tested environment setup torch==1.6.0, transformers==4.2.0, datasets==1.1.3, and in addition pyarrow==2.0.0 to see where the regression comes from. Meanwhile, your data does not seem to be in correct format.

luyug avatar Jun 25 '21 13:06 luyug