More info on configuration options
Hi, thanks for providing this code! Could you please give more information (e.g. a brief explanation) of the following options?
- max_align=5
- top_k=3
- win=5
- skip=-0.1
- margin=True
- len_penalty=True
- is_split=False
Thank you in advance! Rachele
max_align is the maximum alignment types such as 1:1, 1:2, etc. 5 means the alignments allowed are 1:0, 0:1, 1:1, 1:2, 2:1, 2:2, 2:3, and 3:2. You can set this parameter to a higher value if the corpus to be aligned contains many complex alignments.
top_k is for the search of k nearest target neighbors of each source sentence in the first-step alignment.
win is the search window of dynamic programming in the second-step alignment.
skip is the predefined simililarity score for 1:0 and 0:1 alignments. If your corpus consists of many omissions and insertions, you can set this value to a larger one, e.g. skip=0.
margin represents modified cosine similarity as proposed in https://doi.org/10.1093/llc/fqac089.
len_penalty considers the length difference between source and target sentences when calculating similarity between sentence pairs.
If is_split=True, it means the corpus has already been split into sentences. Otherwise, bertalign uses sentence-splitter to split the bitexts into sentences.
Hi. Is there a way to specify max_align with more granularity? For my use case, I would like to limit the allowable alignments to 1:1, 1:2, ..., 1:n, and the inverse thereof (1:1, 2:1, ..., n:1). In other words, I want to exclude 1:0, 0:1, and many-to-many alignments.
EDIT: nvm modifying get_alignment_types or hardcoding second_alignment_types seems to have done the trick.