Greg Tatum

Results 371 comments of Greg Tatum

Yes please, I've tried to do this already when using the tool.

This issue is labelled as a bug, which from the discussion sounds like this is a feature request, as a new transform is being requested. I believe we should either...

In Gecko we're using the `Intl.Segmenter` JavaScript API. For the CJK work in Gecko, we're exploring using this for all of our sentence segmentation needs as well, rather than ssplit....

Do we validate that our input data is valid utf-8 anywhere?

I think we can port cyrtranslit to JavaScript and add it to the Gecko implementation. It's MIT licensed, and looks fairly straightforward.

I guess the other issue is the web is pretty messy, and it's hard to know how what script the page is using, especially if it's mixed. If we can...

Two things that could be helpful here are the ICU segmenter (which is equivalent to the powerful `Intl.Segmenter` in javascript. I believe there are ICU bindings available in Python. Tutorial:...

With this work we should make sure we utilize the "normalization tables" in SentencePiece. These can augment the default Unicode normalization. This way we don't have to do Gecko mitigations...

I wonder if this could be caused by the decoder being too shallow, or the decoder not being big enough. This could be good to experiment with, and also test...