CURE
CURE copied to clipboard
beamsearch.py script is broken
Hi @jiang719 @lin-tan
We have somehow been able to train the model, but the inference step fails for the model. In beamsearch.py
, we keep getting the same error when attempting to generate the hypothesis (both in cpu and gpu mode) by running src/tester/generator.py
. Regardless of the device, we are always getting error here.
/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/functional.py:1960: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
Traceback (most recent call last):
File "src/tester/generator.orig.py", line 134, in <module>
generate_gpt_conut(vocab_file, model_file, input_file, identifier_txt_file, identifier_token_file, output_file, beam_size)
File "src/tester/generator.py", line 89, in generate_gpt_conut
generator.generate(output_file)
File "src/tester/generator.py", line 39, in generate
hypothesis = self.beamsearch.generate_gpt_conut(sample)
File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/beamsearch.py", line 570, in generate_gpt_conut
logits = self.model.decode(
File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/beamsearch.py", line 114, in decode
logits = self.model.decoder(
File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/../models/gpt_conut.py", line 313, in forward
embed = share_embed_model.transformer(
File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/nnashid/.local/lib/python3.8/site-packages/transformers/modeling_openai.py", line 429, in forward
inputs_embeds = self.tokens_embed(input_ids)
File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
@jiang719 we are stuck with this problem. We trained the GPT-CoNuT model and ran inference. However, during inference we are keep getting the above error. We would really appreciate your insight into this error.
-
This is likely due to a problem in the vocabulary part. Are you using the pre-trained GPT model I shared when you train your own GPT-CoNuT model?
-
It could also be the problem of the data format. Make sure you follow the three steps in CURE/data/data/prepare_testing_data.py to prepare the test data as the required format.
We have trained GPT-CoNuT with our dataset. My colleague @msintaha already looked into the steps for dataset creation to ensure we are following the same format. But we will cross-check again in our side.
@jiang719 thanks for your feedback, we really appreciate it.
The point of the first possible cause is that, when you train your own GPT-CoNuT model, did you only change the train_file
and valid_file
in src/trainer/gpt_conut_trainer.py
and keep the vocab_file
and gpt_file
unchanged? If that's the case, the model should be fine and the problem is more likely to be in the test data.
You could share one test instance in the input_file
in src/tester/generator.py
, and its corresponding line in the identifier_txt_file
and identifier_token_file
so I can see if it looks correct.
Yes, here you go
input_bpe.txt
app . use ( express . static ( path . join ( _ _ dirname + $STRING$ ) ) ) ; <CTX> var _ = require ( $STRING$ ) ; var express = require ( $STRING$ ) ; var app = express ( ) ; var http = require ( $STRING$ ) . Server ( app ) ; var path = require ( $STRING$ ) ; var io = require ( $STRING$ ) ( http ) ; const PORT = process . env . PORT || $NUMBER$ ; var users = [ ] ; app . use ( express . static ( path . join ( _ _ dirname + $STRING$ ) ) ) ; app . get ( $STRING$ , ( req , res ) = > { res . send ( JSON . stringify ( users ) ) ; } ) ; app . get ( $STRING$ , ( req , res ) = > { res . send ( JSON . stringify ( _ . find ( users , ( user ) = > user . id == == req . query . user CaMeL Id ) ) ) ; } ) ; app . get ( $STRING$ , ( req , res ) = > { console . log ( $STRING$ ) ; res . send CaMeL File ( path . join ( _ _ dirname + $STRING$ ) ) ; } ) ; io . on ( $STRING$ , ( socket ) = > { console . log ( ` a user connected : $ { socket . id } ` ) ; socket . on ( $STRING$ , ( player ) = > { users . push ( player ) ; socket . broadcast . emit ( $STRING$ , users ) ; } ) ; socket . on ( $STRING$ , ( payload ) = > { var user = _ . find ( users , ( user ) = > user . id == == payload . user CaMeL Id ) ; user . life = payload . life ; socket . broadcast . emit ( $STRING$ , users ) ; } ) ; } ) ; http . listen ( PORT , ( ) = > { console . log ( $STRING$ ) ; } ) ;@@
identifier.txt
send if throw 1 return post code http ++ express , Server exports var function router Router static ] msg err extends $NUMBER$ PORT from __dirname ) getElementById > find obj _ 0xffffffff catch io async class type content get JSON options continue document 0x7f push switch || env id use break ! res + body [ listen user connect ; result else PropTypes error for T const 0 typeof sendFile - app Route undefined key payload import on } React req ( connection < while console join e false broadcast i life value users emit length host 0x1f style name set state message do = $STRING$ action userId log await of data in url => socket <<unk>> true module path config node stringify process done new axios query { . player require === :
identifier.tokens
send <SEP> if <SEP> throw <SEP> 1 <SEP> return <SEP> post <SEP> code <SEP> http <SEP> ++ <SEP> express <SEP> , <SEP> Server <SEP> exports <SEP> var <SEP> function <SEP> router <SEP> Router <SEP> static <SEP> ] <SEP> msg <SEP> err <SEP> extends <SEP> $NUMBER$ <SEP> PORT <SEP> from <SEP> _ _ dirname <SEP> ) <SEP> get CaMeL Element CaMeL By CaMeL Id <SEP> > <SEP> find <SEP> obj <SEP> _ <SEP> 0 xffffffff <SEP> catch <SEP> io <SEP> async <SEP> class <SEP> type <SEP> content <SEP> get <SEP> JSON <SEP> options <SEP> continue <SEP> document <SEP> 0 x $NUMBER$ f <SEP> push <SEP> switch <SEP> || <SEP> env <SEP> id <SEP> use <SEP> break <SEP> ! <SEP> res <SEP> + <SEP> body <SEP> [ <SEP> listen <SEP> user <SEP> connect <SEP> ; <SEP> result <SEP> else <SEP> Prop CaMeL Types <SEP> error <SEP> for <SEP> T <SEP> const <SEP> 0 <SEP> typeof <SEP> send CaMeL File <SEP> - <SEP> app <SEP> Route <SEP> undefined <SEP> key <SEP> payload <SEP> import <SEP> on <SEP> } <SEP> React <SEP> req <SEP> ( <SEP> connection <SEP> < <SEP> while <SEP> console <SEP> join <SEP> e <SEP> false <SEP> broadcast <SEP> i <SEP> life <SEP> value <SEP> users <SEP> emit <SEP> length <SEP> host <SEP> 0 x 1 f <SEP> style <SEP> name <SEP> set <SEP> state <SEP> message <SEP> do <SEP> = <SEP> $STRING$ <SEP> action <SEP> user CaMeL Id <SEP> log <SEP> await <SEP> of <SEP> data <SEP> in <SEP> url <SEP> = > <SEP> socket <SEP> <<unk>> <SEP> true <SEP> module <SEP> path <SEP> config <SEP> node <SEP> stringify <SEP> process <SEP> done <SEP> new <SEP> axios <SEP> query <SEP> { <SEP> . <SEP> player <SEP> require <SEP> == == = <SEP> :
@msintaha Looks like you only run the prepare_cure_input
function.
there are two remaining steps:
- run subword-nmt to tokenize these lines into subwords.
- run
clean_testing_bpe
to finalize the input files.
Please check the readme file under CURE/data/data, the Prepare Test Input section shows the steps. if possible, I recommend you integrate these three steps into your own script.
I have actually run those as well, using the subword.txt generated. It was mentioned in the prepare_cure_input script at the end
First i generated the vocab using
subword-nmt learn-joint-bpe-and-vocab --input training_tokenize.txt -s 50000 -o subword.txt --write-vocabulary vocabulary.txt
Then i ran:
subword-nmt apply-bpe -c subword.txt < training_tokenize.txt > training_bpe.txt
subword-nmt apply-bpe -c subword.txt < input.txt > input_bpe.txt
subword-nmt apply-bpe -c subword.txt < validation_tokenize.txt > validation_bpe.txt
subword-nmt apply-bpe -c subword.txt < identifier.tokens > identifier_bpe.tokens
First i generated the vocab using
subword-nmt learn-joint-bpe-and-vocab --input training_tokenize.txt -s 50000 -o subword.txt --write-vocabulary vocabulary.txt
Then i ran:
subword-nmt apply-bpe -c subword.txt < training_tokenize.txt > training_bpe.txt subword-nmt apply-bpe -c subword.txt < input.txt > input_bpe.txt subword-nmt apply-bpe -c subword.txt < validation_tokenize.txt > validation_bpe.txt subword-nmt apply-bpe -c subword.txt < identifier.tokens > identifier_bpe.tokens
Then you should have a file called ifentifier_bpe.tokens, which should not contain <SEP>
, as the input to the generator.py.
But now I assume the problem is the vocabulary, since you train your own subword-nmt, so the vocabulary file also changes. How many unique lines do you have in your own vocabulary.txt?
If you change the vocabulary file, you will need to re-train the GPT first (re-train a new Huggingface GPT model), since the one I shared can only recognize the 50057 vocabulary in data/vocabulary/vocabulary.txt
. If your new vocabulary file contains more vocabulary, the index out of range
error will be caused.
We have 46,247 lines in the vocabulary.txt. And yes, the generated identifier_bpe.tokens file does not contain <SEP>
That looks reasonable. Could you enclose the call of generate_gpt_conut
with try-catch and see if it crashes for every input or just some?
Another possibility I can imagine is the input exceeds the maximum length (1024 tokens) set for the GPT model. But this will only cause those long inputs to crash.
The maximum input length is within 1022 tokens. We have enclosed it in try_catch block, and it crashes on all the inputs
Hi there @lin-tan , I just cloned your code and try to run it use your module which you have been trained.But it always has a error about follow that .
D:\Python\python.exe E:/cure/CURE/src/tester/generator.py
50061
Traceback (most recent call last):
File "E:/cure/CURE/src/tester/generator.py", line 135, in
@ozzydong I also met the same problem. Have you solved it?thank you
@studypython33 I also met the same problem, too. Have you solved it? Thanks in advance.
@studypython33 I also met the same problem, too. Have you solved it? Thanks in advance.