prose icon indicating copy to clipboard operation
prose copied to clipboard

Adding more languages: How to get started?

Open adewes opened this issue 4 years ago • 4 comments

Congrats on writing this great library! I'm thinking about integrating this into our (soon to be released) own open-source toolkit for privacy & security engineering, where we have a component for detecting personal and sensitive information for which the NER functionality would be super interesting. I'd love to use a Golang-only library like prose for this as our tool is also written in Go and integrating a toolkit like Spacy would drastically complicate the deployment of our tool.

How would one go about building a POS/NER model for other languages? I'm happy to have a stab at this and open a PR to add e.g. a German language model, right now I'm not entirely sure how to get started on this though. My basic understanding is that I could use e.g. Prodigy or a similar tool to label German text and then train a custom POS & NER model. Reading your code it seems I would also need to add a custom tokenization/stemming routing for German (and possibly a way to make this generric). Is that correct? If you can give me a few pointers and how to get started on this I'd be happy to help!

adewes avatar Jul 03 '20 16:07 adewes

It's hard to provide concrete steps for adding more languages because there are still a number of higher-level design considerations to tackle before the library will be ready for this.

That said, there are a few related improvements to done:

  1. Expose an API for training a POS tagger (similar to the existing one for NER); and
  2. add support for providing custom components (tokenizer, segmenter, etc.).

jdkato avatar Jul 07 '20 18:07 jdkato

Ok I understand. I think as soon as I have some time I'll go through the code in detail and try to understand it better, I might then make a proposal on how to implement 1. and 2. above if that sounds good? Not sure when I can actually contribute but NER & POS is on our own roadmap and I probably prefer to support another OS project instead of developing this ourselves.

adewes avatar Jul 09 '20 10:07 adewes

Hi all!

I was just wondering if there's been any change to this in the past months? I'd like to try moving my NLP stack from Python to Go, but without a tokenizer that can handle multiple languages (in different instances, of course) the idea is kind of DOA.

@adewes did you find something else that fits your needs?

Thanks

the-holger avatar Apr 24 '21 11:04 the-holger

Did anyone made some progress to bring other languages to this library?

tobinski avatar May 10 '22 15:05 tobinski