Michael Kohler

Results 87 comments of Michael Kohler

Yes, certainly would be an option, but that would need to be implemented. Overall this would mean going over the sentences multiple times for the case where it won't find...

> sorting sentences by length can help performance. Mh, this made me think. Now I wonder if the legal requirement is just "maximum 3 sentences per article" or if there...

Right now it's fully random, but rejecting what does not fit the rules. So generally, by analysis the full Wikipedia dump, you could optimize the minimum words rule to get...

@jessicarose Analog to the other question I tagged you in, could you also check here if we in theory would be allowed to always take the 3 *longest* sentences per...

@HarikalarKutusu Thanks for keeping track of this. I agree. Do you know what the correct value for EN would be and then we set that as default? And do you...

> but you might need to point to them in case somebody decides on a re-run... I can try to keep this in mind :)

Thanks for your efforts here. This perfectly well shows how broken the sentence segmentation is for some languages :( There's #11 already on file for this issue. I've also created...

@bact I've created a proof of concept to use a Python based sentence splitting algorithm, to make sure that the Sentence Extractor can also be used for language that `rust-punkt`...

The segmenter PR has now been merged, check out https://github.com/common-voice/cv-sentence-extractor#using-a-different-segmenter-to-split-sentences for more info. Looking forward to hear if that helps with Thai :)

Seems to not work in the initial comment, as that's not an issue comment created. Should work here though: /action blocklist sv 80