stanza
stanza copied to clipboard
Stanza needs better documentation
Is your feature request related to a problem? Please describe. It is really really hard to do any advanced stuff with stanza. The documentation only explains the most basic usage, but nothing else.
Describe the solution you'd like This package does really need a documented API (like the one in sklearn, for instance, based on docstrings). I am trying to create custom processors for the pipeline and in the documentation there are only 2 examples and they don't even explain what things mean or what things I can do.
I had to dive into de source code to try to understand what is happening. However the code does not help much and there are completely no docstrings so I end up having to guess and checking with trial and error. That really slows down anything I would want to try.
Apart from the documented API, also good examples of the functionality would help people a lot.
Additional context If stanza wants to become a good NLP alternative in Python, it has to take itself more seriously. If users can't barely understand what is happening, they will switch to simpler alternatives.
PS: Sorry if this sounds like a rant. I understand that this is relatively new and I really love some features and models, that is why I'm opening the issue
Admittedly there's not a lot of doc on how to create an entirely new processor. There is this, which tells you how to create a simple processor:
https://stanfordnlp.github.io/stanza/pipeline.html#building-your-own-processors-and-using-them-in-the-neural-pipeline
If you want to load models as part of the pipeline initialization, that is also possible. You can look at stanza/pipeline/sentiment_processor.py for a example.
https://github.com/stanfordnlp/stanza/blob/master/stanza/pipeline/sentiment_processor.py
If that doesn't work, please let us know where you get stuck, and we'll try to expand on the documentation. In the meantime, I'll try to factor out the documentation and add a bit on how to make the processors load models.
it has to take itself more seriously
I don't think that's necessary or helpful.
Yes I saw that example. What I would like to know if I can select the order of the processors in the pipeline. For example, having a custom processor before the depparser
Yes, there's no reason you can't do that. You'll need to specify the processors manually anyway in order to use your new processor. I expect this will work:
stanza.Pipeline("lang", processors="tokenize,mwt,pos,lemma,custom,depparse")
Here depparse
has been loaded before. I have also used debugger and then I call process, it also comes before than mine. I guess im doing something wrong when creating the processor.
Yes, I see the issue. Apparently there is a section which hard codes the order to load new processors. Presumably this is to enforce the dependencies (the "lemma" depends on "tokenize" dependencies, not the depparse dependencies). In fact, there may be a subtle bug here - if you add two new processors, one which depends on the other, they may or may not wind up in the right order.
For a long time I've wanted to separate out some of the modules built into models, such as having the pretrained embeddings separate from the pos, depparse, and ner. This would require a minor rewrite of the module loading in the pipeline so that the separated modules get silently loaded and put in the right order. That would also be a good time to fix your issue. I think I probably won't try to change this until I make that change, though.
In the meantime, I can see two possibilities:
-
you want the "cool" processor to change the document in a way that affects the results the depparse processor produces. In that case, you might look into making a "cooldepparse" processor variant. https://stanfordnlp.github.io/stanza/pipeline.html#processor-variants If that doesn't work for your use case or if you get stuck, please circle back and we'll figure it out.
-
the "cool" processor doesn't affect the depparse results, in which case hopefully it is okay to have it appear after the depparse processor
On Thu, Jan 28, 2021 at 3:28 AM Guillem García Subies < [email protected]> wrote:
Here depparse has been loaded before. I have also used debugger and then I call process, it also comes before than mine. I guess im doing something wrong when creating the processor.
[image: image] https://user-images.githubusercontent.com/37592763/106132252-04cb7a80-6164-11eb-8625-3102e0449874.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/606#issuecomment-768989536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIRFLGZ6E2C55KLZNDS4FC4NANCNFSM4WVN6XDA .
The first suggestion worked, thank you.