ParlaMint
ParlaMint copied to clipboard
FI: significant Swedish text not marked as such
While doing MT Taja noticed that quite a lot of the text in the FI transcriptions is in fact in Swedish but is not marked as such. This is esp. bad for MT, as it is applying the Finnish model to the text marked as Finnish, which here includes the Swedish text. The result is that the Swedish text remains untranslated. A quick count (as the Swedish words in the MTed corpus are analysed as unknow PoS, i.e. 'X') shows that this affects 1,743,576 (7.6%) of the tokens. Obviously this can't be corrected for 4.0, so setting it to the Future milestone.