SentiLARE icon indicating copy to clipboard operation
SentiLARE copied to clipboard

关于数据预处理部分问题

Open Linjz1 opened this issue 3 years ago • 5 comments

   您好,请问关于数据预处理部分,我的数据是imdb,yelp2013,yelp2014分别是10、5、5分类问题。数据是整合成像raw_data/sent/imdb下的数据,  句子+label 吗 ?(多分类问题需要改对应的代码吗) 之后使用preprocess/prep_sent.py 去进行预处理吗?
 谢谢您的阅读 ,期待您的回复!

Linjz1 avatar Apr 22 '21 06:04 Linjz1

您好。数据整理成raw_data/sent/imdb下的格式,然后运行preprocess/prep_sent.py即可,多分类不需要改代码。

Fine-tune的时候需要改一下代码,因为我的imdb是2分类,你可以直接修改我的imdb数据处理类代码,或者可以仿照着自己写一个。具体参考finetune/sent_data_utils_sentilr.py中的line 143-line 169,主要是改get_labels函数中的类别标签集合。

kepei1106 avatar Apr 22 '21 10:04 kepei1106

嗯,数据已处理成您提供的格式,运行prep_sent.py后会报这种错误,您看这该如何解决? Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Traceback (most recent call last): File "prep_sent.py", line 210, in convert_sentence(path, task, sentinet, gloss_embedding, gloss_embedding_norm) File "prep_sent.py", line 194, in convert_sentence clean_text_list, pos_list, senti_list, clean_label_list = process_text(text_list, label_list, sentinet, gloss_embedding, gloss_embedding_norm) File "prep_sent.py", line 117, in process_text corpus_embedding = model.encode(sent_list_str, batch_size=64) File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/sentence_transformers/SentenceTransformer.py", line 194, in encode out_features = self.forward(features) File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/sentence_transformers/models/Transformer.py", line 38, in forward output_states = self.auto_model(**trans_features, return_dict=False) File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py", line 969, in forward past_key_values_length=past_key_values_length, File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py", line 207, in forward embeddings += position_embeddings RuntimeError: The size of tensor a (2142) must match the size of tensor b (512) at non-singleton dimension 1

Linjz1 avatar Apr 23 '21 11:04 Linjz1

请问您在我提供的raw_data数据上能跑通吗?我这边跑我提供的raw_data数据是没问题的。您提供的traceback看起来像是sentence transformers编码句子的时候内部出现了问题:

File "prep_sent.py", line 117, in process_text corpus_embedding = model.encode(sent_list_str, batch_size=64)

我猜测可能是sentence transformers和huggingface transformers的版本不匹配导致的,我的预处理环境如下: transformers (huggingface) 2.3.0 sentence transformers 0.2.6

建议您先检查版本是否对应,然后再根据traceback信息进行debug。

kepei1106 avatar Apr 25 '21 04:04 kepei1106

您提供的raw_data数据集我也跑不通。报错与我自己的数据集相同。您可以看我给你发的邮件(您论文中提供的邮箱) 我重新按照您说的预处理环境。重装了一下。 我直接pip install sentence-transformers==0.2.6 会出现这种错误。

Using cached https://pypi.tuna.tsinghua.edu.cn/packages/51/9d/cef25b5faabdc1b54d218012ee821292312e139e76cc40105c824ad024bb/sentence-transformers-0.2.6.tar.gz (55 kB) ERROR: Command errored out with exit status 1: command: /home/fzuirdata/anaconda3/envs/py37Lin/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5xf_3m8x/sentence-transformers/setup.py'"'"'; file='"'"'/tmp/pip-install-5xf_3m8x/sentence-transformers/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-6e0stt18 cwd: /tmp/pip-install-5xf_3m8x/sentence-transformers/ Complete output (5 lines): Traceback (most recent call last): File "", line 1, in File "/tmp/pip-install-5xf_3m8x/sentence-transformers/setup.py", line 6, in with open('requirements.txt', mode="r", encoding="utf-8") as req_file: FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt' ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

所以我使用了另一种方法从github下载0.2.6的版本。 您说的transformers(huggingface) 2.3.0与 sentence-transformers0.2.6 并不兼容。 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. sentence-transformers 0.2.6 requires transformers>=2.8.0, but you have transformers 2.3.0 which is incompatible. 所以我只能默认选择 github下0.2.6版本所提供requirements.txt中默认的transformers版本。 现在的环境就是sentence-transformers0.2.6 transformers4.5.1 运行后会报这个错误 。请问这个该如何让解决呢? Traceback (most recent call last): File "prep_sent.py", line 15, in model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens') File "/home/fzuirdata/anaconda3/envs/py36torch/lib/python3.6/site-packages/sentence_transformers/SentenceTransformer.py", line 75, in init with open(os.path.join(model_path, 'modules.json')) as fIn: FileNotFoundError: [Errno 2] No such file or directory: 'sentence-transformers/bert-base-nli-mean-tokens/modules.json' 您有对应的sentence-transformers0.2.6的包可以提供给我吗? 邮箱[email protected]

Linjz1 avatar Apr 25 '21 06:04 Linjz1

sentence transformers 0.2.6的requirements.txt里写的是transformers==2.3.0,至少我下载的这版是这样,使用的时候也没有因为不兼容而报错。包稍后发到您的邮箱。

您最后提到的这个问题: FileNotFoundError: [Errno 2] No such file or directory: 'sentence-transformers/bert-base-nli-mean-tokens/modules.json' 原因是您没有下载sentence transformers的模型bert-base-nli-mean-tokens,或者下载后读取路径设置有误。我的代码里是按我设置的路径写的,您需要改为您自己的路径。

kepei1106 avatar Apr 25 '21 07:04 kepei1106