hunalign icon indicating copy to clipboard operation
hunalign copied to clipboard

Hunalign gives very high priority to matching numbers (e.g. "Book 1" to "Chapter 1")

Open kyegupov opened this issue 6 years ago • 0 comments

I have two files structured like this:

The Author
The Book Name

Book I

The introduction text.

Chapter 1 The Beginning

The first sentence.
La autoro
La nomo di libro

Libro 1

La prefaca texto.

Chapitro I La Komenco

La unesma frazo.

The result of hunalign -text is:

The Author	La autoro	0.266667
The Book Name ~~~ 	La nomo di libro	0
Book I		-0.3
 ~~~ The introduction text.		-0.3
		0.3
Chapter 1 The Beginning	Libro 1	10.7
		0.3
The first sentence.	La prefaca texto. ~~~ 	0
	Chapitro I La Komenco	-0.3
		0.3
	La unesma frazo.	-0.3
		0.3

I.E. "Chapter 1" gets matched incorrectly to "Libro 1" (with a very high confidence score!), skewing the whole alignment - apparently, just because there is a number "1" at the both sides.

kyegupov avatar Jan 21 '19 01:01 kyegupov