awesome-align
awesome-align copied to clipboard
1. Input size limit. 2. Generate alignments with words along with indices
run_train.py
: Skip parallel instances that have more than 512 tokens when combined. This is a problem considering the input limit of transformers.
run_align.py
: In addition to the word indices output file, this module will generate an additional output file with the words.
run_align.py
now generates two files, args.output_file (alignment shown with word-indices) && args.output_file+".outtxt" (alignment shown with words).
Thanks for the contribution!
-
I actually considered that case in https://github.com/neulab/awesome-align/blob/master/run_train.py#L170-L189, basically I cut the sentence lengths to max_len/2 when combining them, so that might not be necessary?
-
that looks ok but could you make it an option? also could you implement it in the word_align function (https://github.com/neulab/awesome-align/blob/master/run_align.py#L81)?
This is messing too much with the workings of distributed training (which is broken anyway). Setting environment variables to the rank of the GPU? Python should not set environment variables, that's why they are environmnent variables... GPU selection should not happen through a flag but with CUDA_VISIBLE_DEVICES. args.spc_gpu
is not necessary at all.
I agree that DDP is broken and should be fixed but not by making things even more complex. Probably something like this:
if args.n_gpu > 1 or args.local_rank != -1:
guides = model.module.get_aligned_word(examples_src, examples_tgt, bpe2word_map_src, bpe2word_map_tgt,
args.device, src_len, tgt_len, align_layer=args.align_layer,
extraction=args.extraction,
softmax_threshold=args.softmax_threshold)
else:
guides = model.get_aligned_word(examples_src, examples_tgt, bpe2word_map_src, bpe2word_map_tgt,
args.device, src_len, tgt_len, align_layer=args.align_layer,
extraction=args.extraction, softmax_threshold=args.softmax_threshold)
@BramVanroy, True, Thanks. I undo-ed the changes.
@zdou0830 Regarding 1. Input size limit, When I take longer sentences, I get an error saying,
result = self.forward(*input, **kwargs)
File "/media/mano/1E66BB6D66BB4475/UPV/code/mt-ner-code/mt-ner-20210211T132629Z-001/mt-ner/awesome-align/modeling.py", line 178, in forward
position_embeddings = self.position_embeddings(position_ids)
File "/home/mano/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mano/anaconda3/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 124, in forward
return F.embedding(
File "/home/mano/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
I printed the tokenizer's max_len
and it is 1000000000000.
In the code, we can use config.max_position_embeddings
instead of token.max_len
.
Regarding 2.Generate alignments with words along with indices, I updated the code accordingly.