bertalign icon indicating copy to clipboard operation
bertalign copied to clipboard

More info on configuration options

Open RacheleSprugnoli opened this issue 2 years ago • 2 comments

Hi, thanks for providing this code! Could you please give more information (e.g. a brief explanation) of the following options?

  • max_align=5
  • top_k=3
  • win=5
  • skip=-0.1
  • margin=True
  • len_penalty=True
  • is_split=False

Thank you in advance! Rachele

RacheleSprugnoli avatar Jul 24 '23 09:07 RacheleSprugnoli

max_align is the maximum alignment types such as 1:1, 1:2, etc. 5 means the alignments allowed are 1:0, 0:1, 1:1, 1:2, 2:1, 2:2, 2:3, and 3:2. You can set this parameter to a higher value if the corpus to be aligned contains many complex alignments.

top_k is for the search of k nearest target neighbors of each source sentence in the first-step alignment.

win is the search window of dynamic programming in the second-step alignment.

skip is the predefined simililarity score for 1:0 and 0:1 alignments. If your corpus consists of many omissions and insertions, you can set this value to a larger one, e.g. skip=0.

margin represents modified cosine similarity as proposed in https://doi.org/10.1093/llc/fqac089.

len_penalty considers the length difference between source and target sentences when calculating similarity between sentence pairs.

If is_split=True, it means the corpus has already been split into sentences. Otherwise, bertalign uses sentence-splitter to split the bitexts into sentences.

bfsujason avatar Jul 24 '23 16:07 bfsujason

Hi. Is there a way to specify max_align with more granularity? For my use case, I would like to limit the allowable alignments to 1:1, 1:2, ..., 1:n, and the inverse thereof (1:1, 2:1, ..., n:1). In other words, I want to exclude 1:0, 0:1, and many-to-many alignments.

EDIT: nvm modifying get_alignment_types or hardcoding second_alignment_types seems to have done the trick.

jdough1982 avatar May 12 '24 12:05 jdough1982