awesome-align
awesome-align copied to clipboard
A neural word aligner based on multilingual BERT
Hi! This is a really great tool and it's been fun using it. I am trying to train the model 'bert-base-multilingual-uncased' using a tokenized dataset in the correct format. But...
Added multiprocessing to LineByLineTextDataset class since tokenizer.prepare_for_model takes lot to time to process for large datasets
Hi awesome-align team, First, thanks for the great tool. It has really great potential. I am following your Colab demo, and I tried to align English to Arabic. Here are...
`run_train.py`: Skip parallel instances that have more than 512 tokens when combined. This is a problem considering the input limit of transformers. `run_align.py`: In addition to the word indices output...
Is it possible to load seq-to-seq model to make word alignments with this work? I'm stuck on getting proper out_src and out_tgt layers to work with for next step. I...
Hi! You have done a great job!! I have been training two different models. the one mentioned in the title ("ixa-ehu/ixambert-base-cased") and multibert_cased. With the multibert I didn't have any...
Ideally I'd like to keep the model in memory and call it with something approaching the syntax used by Simalign: ``` myaligner = SentenceAligner(model="model_path", token_type="bpe", **model_parameters) # ... and later...
Hello, I know this is not directly related to awesome-align, but I have a large training set of 10M source/target pairs and it takes 4 hours to process them before...
Hi, dear ziyi~ I found in your code, the bert output weights are not set to be the same as the input embedding, which can be proved in [here](https://github.com/neulab/awesome-align/blob/5f150d45bbe51e167daf0a84abebaeb07c3323d1/awesome_align/modeling.py#L374)(In detail,...
Hello, Your README states: > Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples...