awesome-align icon indicating copy to clipboard operation
awesome-align copied to clipboard

A neural word aligner based on multilingual BERT

Results 15 awesome-align issues
Sort by recently updated
recently updated
newest added

Hi! This is a really great tool and it's been fun using it. I am trying to train the model 'bert-base-multilingual-uncased' using a tokenized dataset in the correct format. But...

Added multiprocessing to LineByLineTextDataset class since tokenizer.prepare_for_model takes lot to time to process for large datasets

Hi awesome-align team, First, thanks for the great tool. It has really great potential. I am following your Colab demo, and I tried to align English to Arabic. Here are...

enhancement

`run_train.py`: Skip parallel instances that have more than 512 tokens when combined. This is a problem considering the input limit of transformers. `run_align.py`: In addition to the word indices output...

Is it possible to load seq-to-seq model to make word alignments with this work? I'm stuck on getting proper out_src and out_tgt layers to work with for next step. I...

Hi! You have done a great job!! I have been training two different models. the one mentioned in the title ("ixa-ehu/ixambert-base-cased") and multibert_cased. With the multibert I didn't have any...

Ideally I'd like to keep the model in memory and call it with something approaching the syntax used by Simalign: ``` myaligner = SentenceAligner(model="model_path", token_type="bpe", **model_parameters) # ... and later...

enhancement

Hello, I know this is not directly related to awesome-align, but I have a large training set of 10M source/target pairs and it takes 4 hours to process them before...

Hi, dear ziyi~ I found in your code, the bert output weights are not set to be the same as the input embedding, which can be proved in [here](https://github.com/neulab/awesome-align/blob/5f150d45bbe51e167daf0a84abebaeb07c3323d1/awesome_align/modeling.py#L374)(In detail,...

Hello, Your README states: > Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples...