awesome-align icon indicating copy to clipboard operation
awesome-align copied to clipboard

1. Input size limit. 2. Generate alignments with words along with indices

Open gitlost-murali opened this issue 4 years ago • 3 comments

run_train.py: Skip parallel instances that have more than 512 tokens when combined. This is a problem considering the input limit of transformers. run_align.py: In addition to the word indices output file, this module will generate an additional output file with the words.

run_align.py now generates two files, args.output_file (alignment shown with word-indices) && args.output_file+".outtxt" (alignment shown with words).

gitlost-murali avatar Feb 16 '21 12:02 gitlost-murali

Thanks for the contribution!

  1. I actually considered that case in https://github.com/neulab/awesome-align/blob/master/run_train.py#L170-L189, basically I cut the sentence lengths to max_len/2 when combining them, so that might not be necessary?

  2. that looks ok but could you make it an option? also could you implement it in the word_align function (https://github.com/neulab/awesome-align/blob/master/run_align.py#L81)?

zdou0830 avatar Feb 17 '21 21:02 zdou0830

This is messing too much with the workings of distributed training (which is broken anyway). Setting environment variables to the rank of the GPU? Python should not set environment variables, that's why they are environmnent variables... GPU selection should not happen through a flag but with CUDA_VISIBLE_DEVICES. args.spc_gpu is not necessary at all.

I agree that DDP is broken and should be fixed but not by making things even more complex. Probably something like this:

        if args.n_gpu > 1 or args.local_rank != -1:
            guides = model.module.get_aligned_word(examples_src, examples_tgt, bpe2word_map_src, bpe2word_map_tgt,
                                                   args.device, src_len, tgt_len, align_layer=args.align_layer,
                                                   extraction=args.extraction,
                                                   softmax_threshold=args.softmax_threshold)
        else:
            guides = model.get_aligned_word(examples_src, examples_tgt, bpe2word_map_src, bpe2word_map_tgt,
                                            args.device, src_len, tgt_len, align_layer=args.align_layer,
                                            extraction=args.extraction, softmax_threshold=args.softmax_threshold)

BramVanroy avatar Mar 20 '21 13:03 BramVanroy

@BramVanroy, True, Thanks. I undo-ed the changes.

@zdou0830 Regarding 1. Input size limit, When I take longer sentences, I get an error saying,

    result = self.forward(*input, **kwargs)
  File "/media/mano/1E66BB6D66BB4475/UPV/code/mt-ner-code/mt-ner-20210211T132629Z-001/mt-ner/awesome-align/modeling.py", line 178, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "/home/mano/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/mano/anaconda3/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 124, in forward
    return F.embedding(
  File "/home/mano/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

I printed the tokenizer's max_len and it is 1000000000000.

In the code, we can use config.max_position_embeddings instead of token.max_len.

Regarding 2.Generate alignments with words along with indices, I updated the code accordingly.

gitlost-murali avatar Mar 22 '21 13:03 gitlost-murali