tshatrov comments

Results 22 comments of


                                            tshatrov

Limits of hiragana-based romanisation

Yeah, this sounds like a good idea. One problem though, in JMdict database hiragana readings are not separated by kanji so this wasn't possible to implement at the time I...

Limits of hiragana-based romanisation

@tslater I think if you do `(setf ichiran:*default-romanization-method* ichiran:*hepburn-basic*)` _before_ building the executable, then `-f` will use basic romanization.

Katakana proper nouns are being split up

Yeah it doesn't parse proper nouns at all because they aren't in JMdict. There isn't a word フレッド but there is a word レッド. There could be all sorts of...

Incorporate jmnedict database

I decided not to do this because it would likely degrade segmenting a lot. Proper nouns can't be consistently romanized anyway. I'll be adding things that *can* be romanized such...

More portable version using SQLite

Well, the main problem is that the [postmodern library](https://github.com/marijnh/Postmodern) which is used to access the database only supports Postgres, and it also happens to be the best db library, and...

More portable version using SQLite

Hm, I don't know, depends on how "invasive" it is to the existing codebase. Also it might make adding new features more difficult as I'd have to test if each...

JSON returned by ichiran/cli

The gloss is available for root words only. it is a list of definitions, each definition is itself a dictionary which has a part of speech (`pos`) and the definition...

Paper/Explanation of algorithm used

There's a [blog post](https://readevalprint.tumblr.com/post/97467849358/who-needs-graph-theory-anyway) about the segmentation algorithm, but the secret sauce is really the scoring algorithm, which was built in an ad-hoc manner over the years to split sentences...

Paper/Explanation of algorithm used

Yes, this is the suite, it's not particularly thorough, I was mostly including corner cases for the segmentations I wanted to fix. https://github.com/tshatrov/ichiran/blob/master/tests.lisp

Whitespace/punctuation inconsistency

The punctuation substitutions are listed here: https://github.com/tshatrov/ichiran/blob/master/characters.lisp#L75 Because Japanese texts don't generally use spaces, I just automatically add spaces after relevant punctuations. It's a bit lazy, I guess. Before each...