structural-probes icon indicating copy to clipboard operation
structural-probes copied to clipboard

AssertionError appears when trying to align embeddings

Open Awkwafina opened this issue 4 years ago • 6 comments

I am trying to run an experiment on a new dataset. Followed your instructions, but everytime when I run this code, 'AssertionError' appears. I don't know whether it is an issue or not, nevertheless, could you check my yaml file? I am working in google colab. bert_exam (my yaml file) is located here: https://github.com/Awkwafina/files/blob/master/bert_exam bert_train

Awkwafina avatar Aug 27 '20 12:08 Awkwafina

Hi! I think the issue is with this line:

    type: token #{token,subword}

which should be

    type: subword #{token,subword}

since BERT uses subword tokenization (and hence the subwords need to be mapped to corpus tokens.)

This is an annoying aspect of the config structure as of now, since you might expect that the BERT-disk flag alone would specify this, but alas.

john-hewitt avatar Aug 27 '20 18:08 john-hewitt

Hi, still facing this issue. maybe there is something wrong within hdf5 file? here, assert single_layer_features.shape[0]=68 and len(tokenized_sent)=74

68 74 [aligning embeddings]: 25% 3173/12543 [00:09<00:27, 339.59it/s] Traceback (most recent call last): File "/content/structural-probes/structural-probes/run_experiment.py", line 242, in <module> execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results) File "/content/structural-probes/structural-probes/run_experiment.py", line 170, in execute_experiment expt_dataset = dataset_class(args, task) File "/content/structural-probes/structural-probes/data.py", line 34, in __init__ self.train_obs, self.dev_obs, self.test_obs = self.read_from_disk() File "/content/structural-probes/structural-probes/data.py", line 65, in read_from_disk train_observations = self.optionally_add_embeddings(train_observations, train_embeddings_path) File "/content/structural-probes/structural-probes/data.py", line 408, in optionally_add_embeddings embeddings = self.generate_subword_embeddings_from_hdf5(observations, pretrained_embeddings_path, layer_index) File "/content/structural-probes/structural-probes/data.py", line 398, in generate_subword_embeddings_from_hdf5 assert single_layer_features.shape[0] == len(tokenized_sent) AssertionError

Awkwafina avatar Aug 30 '20 08:08 Awkwafina

Or maybe I need to disable assertions to proceed?

Awkwafina avatar Sep 12 '20 11:09 Awkwafina

Was this ever resolved? I'm experiencing the same issue.

joebartusek avatar May 01 '21 02:05 joebartusek

Not sure, but it seems like the issue is the process by which vectors are written to disk (which may happen independently of this codebase) and the tokenization performed when loading text from disk are leading to differing numbers of tokens in the sequence. This could be because different tokenizers are used, or because the data isn't ordered the same way, or some preprocessing thing; I'm not 100% sure.

Also, added uncertainty: I'm not sure how the huggingface transformers module tokenizers API has changed since I wrote this code, back when it was still pytorch-pretrained-BERT, not transformers.

john-hewitt avatar May 02 '21 00:05 john-hewitt

In case someone finds it useful, I wrote a version compatible with the new transformers library. I posted the main changes in the issue #13

caspillaga avatar Dec 22 '21 21:12 caspillaga