ParlaMint icon indicating copy to clipboard operation
ParlaMint copied to clipboard

FI: significant Swedish text not marked as such

Open TomazErjavec opened this issue 1 year ago • 0 comments

While doing MT Taja noticed that quite a lot of the text in the FI transcriptions is in fact in Swedish but is not marked as such. This is esp. bad for MT, as it is applying the Finnish model to the text marked as Finnish, which here includes the Swedish text. The result is that the Swedish text remains untranslated. A quick count (as the Swedish words in the MTed corpus are analysed as unknow PoS, i.e. 'X') shows that this affects 1,743,576 (7.6%) of the tokens. Obviously this can't be corrected for 4.0, so setting it to the Future milestone.

TomazErjavec avatar Sep 28 '23 08:09 TomazErjavec