see if we can migrate to either nltk or wn and not both
c.f. https://github.com/leondz/garak/pull/764#discussion_r1688505031
- Some garak plugins use nltk.wordnet, others use wn.
- These two use different APIs
- It would be good to reduce our footprint by requiring as few packages as possible
- Consider updated the relevant code (
autodanfornltk,topicforwn) to unify on just one of these packages
nltk is the broader library and we use it for a few other things IIRC, would suggest wn is the one that should be removed, which will require us to refactor topic slightly.
indeed. annoyingly ´wn´ and ´nltk´provide interfaces with different functionality, and there's a core part on ´wn´ that needs to be rewritten.
this PR is probably also a fine place to build an ´nltk´ api mixin, to unify nltk access following the pattern that uses our cache dir.
Hi, I'm currently working on the next version of Wn and came across this issue when I searched GitHub to see how Wn's dependents are using the library. I'm not trying to influence your decision about whether or not to continue using it, but I am curious about what you found difficult, if you don't mind sharing.
Regarding the API differences with the NLTK: the NLTK's wordnet module is built around the original Princeton WordNet in the WNDB format, whereas Wn works with the newer WN-LMF XML resources. The different features between WNDB and WN-LMF (explicitly modeled senses, interlingual indices, pronunciations, etc.) necessitate a slightly different API.
If you are only looking up synonyms, hypernyms, and hyponyms in English, either the NLTK or Wn will work well. I did notice you were getting alternative forms with Wn, and AFAIK that feature is not available in the NLTK: https://github.com/NVIDIA/garak/blob/e599eb0e6545ac3cfb00de1fcbb92c1939bc6d05/garak/probes/topic.py#L64-L67
Oh, hey! Cool software, thank you for it. Happy to elaborate.
-
Preamble - can we get some means of suppressing the admittedly very pretty progress bars
-
Prefer wn's API. It'd be even nice if either nltk or wn's possible API actions was a strict superset of the other
-
Our main goal with this issue is, like many projects, to reduce dependency count
-
External resources we use depend on nltk, so it might not go out immediately, though I suspect is overkill compared to the functions we actually use
-
Maybe nltk.wn is better off being replaced by wn? But given nltk's status today (I know it's been like fifteen years since I last committed..) a strong advocate and proactive project member may be hard to find
-
One thing we like to do is reuse data. Each external dep that does downloads needs its own wrapper to fit in our system, which I guess adds pressure to #2.
On Thu, Nov 21, 2024, 06:38 Michael Wayne Goodman @.***> wrote:
Hi, I'm currently working on the next version of Wn and came across this issue when I searched GitHub to see how Wn's dependents are using the library. I'm not trying to influence your decision about whether or not to continue using it, but I am curious about what you found difficult, if you don't mind sharing.
Regarding the API differences with the NLTK: the NLTK's wordnet module is built around the original Princeton WordNet in the WNDB format, whereas Wn works with the newer WN-LMF XML resources. The different features between WNDB and WN-LMF (explicitly modeled senses, interlingual indices, pronunciations, etc.) necessitate a slightly different API.
If you are only looking up synonyms, hypernyms, and hyponyms in English, either the NLTK or Wn will work well. I did notice you were getting alternative forms with Wn, and AFAIK that feature is not available in the NLTK:
https://github.com/NVIDIA/garak/blob/e599eb0e6545ac3cfb00de1fcbb92c1939bc6d05/garak/probes/topic.py#L64-L67
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/garak/issues/835#issuecomment-2490122328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5YTS2QJ2ZXDF4OFNE2GT2BVWUXAVCNFSM6AAAAABMSNR4IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJQGEZDEMZSHA . You are receiving this because you authored the thread.Message ID: @.***>
@leondz thanks!
- Preamble - can we get some means of suppressing the admittedly very pretty progress bars
If you want to simply suppress the progress bars, you can pass None:
wn.download(lexicon, progress_handler=None)
It looks like you are currently customizing the format of the default progress handler: https://github.com/NVIDIA/garak/blob/e599eb0e6545ac3cfb00de1fcbb92c1939bc6d05/garak/probes/topic.py#L99-L101
You can also subclass the basic wn.util.ProgressHandler class if you want something more custom, such as what the Wordbook app does for a GUI progress bar.
If the above are not sufficient and you want a more persistent way to disable/change the progress bar, I might be able to add a setting to the wn.config object?
- Prefer wn's API. It'd be even nice if either nltk or wn's possible API actions was a strict superset of the other
I appreciate that. I tried to keep Wn's API similar to the NLTK's when it made sense, but I also didn't restrict myself when there was an opportunity to improve things. I have an old, but nearly complete shim module on the nltk branch to replicate the NLTK's API in Wn. If there is interest, I might be able to finish that up and merge it.
- Our main goal with this issue is, like many projects, to reduce dependency count
- External resources we use depend on nltk, so it might not go out immediately, though I suspect is overkill compared to the functions we actually use
I totally understand and sympathize with the desire to keep dependencies minimal.
- Maybe nltk.wn is better off being replaced by wn? But given nltk's status today (I know it's been like fifteen years since I last committed..) a strong advocate and proactive project member may be hard to find
Originally there was a plan for Wn to be such a replacement, but since then Eric Kafe has been putting in some nice work to update the NLTK's wordnet module, so a merge seems less likely.
- One thing we like to do is reuse data. Each external dep that does downloads needs its own wrapper to fit in our system, which I guess adds pressure to 2.
I can't do much about this, unfortunately. Wn doesn't use the same resources as the NLTK, so I can't just point it to the nltk_data/ directory. I tried to make it easy for people to configure the download directory, though.
Thanks again for taking the time to share your thoughts!