Shane Carroll comments

Results 7 comments of


                                            Shane Carroll

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection

@okuchaiev It's probably worth waiting another week. I got rid of the hacky char tokenizer and cleaned some things, I'll make sure it still works and check it all in...

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection

@okuchaiev It's probably in a reasonable place for a review. I presume my decision to use a character-level language model will be controversial, but it works. Some things aren't done...

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection

If the character-level language model is too constraining (I think it is), I have an alternative branch that uses arbitrary subword tokenization and LM, but generates character-level predictions in the...

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection

@ekmb thanks for the feedback. The token-based model that makes character-level predictions is in the branch `pcs2`. A better description can be found in this model card: https://huggingface.co/1-800-BAD-CODE/pcs_multilang_bert_base. I now...

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection

> Hi @1-800-BAD-CODE, are there any updates on this PR? I have: * Matured the branch that uses regular subwords, and moved on from the character-based LM constraints * Got...

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection

@ekmb This is probably as far as I should take it on my own. Recent updates focus primarily on single-pass training and inference, as well as reducing the amount of...

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection

I'm ok with letting this one die. The code turned out more complicated than I prefer.