CCAligner icon indicating copy to clipboard operation
CCAligner copied to clipboard

Find and integrate a text tokenisation library.

Open saurabhshri opened this issue 8 years ago • 2 comments

The current implementation of text tokenisation is pretty naive and doesn't cover all aspects. A nice tokenisation library should be able to generate all possible text tokens like currency, dates, numbers, symbols etc..

For example :

In 1996, 1996 people sent emails at someone @ example . com at 1:30 PM.

In nineteen ninety six, one thousand nine hundred and ninety six people sent emails at someone at example dot com at one thirty p m

and all the alternative versions.

The library needs to be integrated in subtitle parser (srtparser.h).

saurabhshri avatar Sep 30 '17 18:09 saurabhshri

https://github.com/google/sparrowhawk

nshmyrev avatar Oct 13 '17 22:10 nshmyrev

@nshmyrev Thanks! That looks really nice! :)

saurabhshri avatar Oct 20 '17 04:10 saurabhshri