awesome-align issues

Trying to train on an existing model

1

Hi! This is a really great tool and it's been fun using it. I am trying to train the model 'bert-base-multilingual-uncased' using a tokenized dataset in the correct format. But...

ghost

added multiprocessing to LineByLineTextDataset class

Added multiprocessing to LineByLineTextDataset class since tokenizer.prepare_for_model takes lot to time to process for large datasets

vigneshmj1997

Wrong mapping with non-matching sentences

5

Hi awesome-align team, First, thanks for the great tool. It has really great potential. I am following your Colab demo, and I tried to align English to Arabic. Here are...

mzeidhassan

enhancement

1. Input size limit. 2. Generate alignments with words along with indices

3

`run_train.py`: Skip parallel instances that have more than 512 tokens when combined. This is a problem considering the input limit of transformers. `run_align.py`: In addition to the word indices output...

gitlost-murali

Setting seq-to-seq model as our pretrained model

6

Is it possible to load seq-to-seq model to make word alignments with this work? I'm stuck on getting proper out_src and out_tgt layers to work with for next step. I...

b3ade

Trying to train "ixa-ehu/ixambert-base-cased" model

1

Hi! You have done a great job!! I have been training two different models. the one mentioned in the title ("ixa-ehu/ixambert-base-cased") and multibert_cased. With the multibert I didn't have any...

jmurua14

Repeated single-sentence inferences on an in-memory model?

1

Ideally I'd like to keep the model in memory and call it with something approaching the syntax used by Simalign: ``` myaligner = SentenceAligner(model="model_path", token_type="bpe", **model_parameters) # ... and later...

pseudomonas

enhancement

Torch.save() for large training Dataset

1

Hello, I know this is not directly related to awesome-align, but I have a large training set of 10M source/target pairs and it takes 4 hours to process them before...

stelmath

A bug(maybe)

1

Hi, dear ziyi~ I found in your code, the bert output weights are not set to be the same as the input embedding, which can be proved in [here](https://github.com/neulab/awesome-align/blob/5f150d45bbe51e167daf0a84abebaeb07c3323d1/awesome_align/modeling.py#L374)(In detail,...

shaoyangxu

Inputs shoult be tokenized only for training/evaluation sets?

1

Hello, Your README states: > Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples...

stelmath

awesome-align
awesome-align copied to clipboard

Metadata

Trying to train on an existing model

added multiprocessing to LineByLineTextDataset class

Wrong mapping with non-matching sentences

1. Input size limit. 2. Generate alignments with words along with indices

Setting seq-to-seq model as our pretrained model

Trying to train "ixa-ehu/ixambert-base-cased" model

Repeated single-sentence inferences on an in-memory model?

Torch.save() for large training Dataset

A bug(maybe)

Inputs shoult be tokenized only for training/evaluation sets?

← Metadata

Owner

Metadata

awesome-align awesome-align copied to clipboard

Metadata

← Metadata

Owner

Metadata

awesome-align
awesome-align copied to clipboard