hunalign
hunalign copied to clipboard
Hunalign gives very high priority to matching numbers (e.g. "Book 1" to "Chapter 1")
I have two files structured like this:
The Author
The Book Name
Book I
The introduction text.
Chapter 1 The Beginning
The first sentence.
La autoro
La nomo di libro
Libro 1
La prefaca texto.
Chapitro I La Komenco
La unesma frazo.
The result of hunalign -text
is:
The Author La autoro 0.266667
The Book Name ~~~ La nomo di libro 0
Book I -0.3
~~~ The introduction text. -0.3
0.3
Chapter 1 The Beginning Libro 1 10.7
0.3
The first sentence. La prefaca texto. ~~~ 0
Chapitro I La Komenco -0.3
0.3
La unesma frazo. -0.3
0.3
I.E. "Chapter 1" gets matched incorrectly to "Libro 1" (with a very high confidence score!), skewing the whole alignment - apparently, just because there is a number "1" at the both sides.