garak icon indicating copy to clipboard operation
garak copied to clipboard

see if we can migrate to either nltk or wn and not both

Open leondz opened this issue 1 year ago • 5 comments

c.f. https://github.com/leondz/garak/pull/764#discussion_r1688505031

  • Some garak plugins use nltk.wordnet, others use wn.
  • These two use different APIs
  • It would be good to reduce our footprint by requiring as few packages as possible
  • Consider updated the relevant code (autodan for nltk, topic for wn) to unify on just one of these packages

leondz avatar Aug 15 '24 15:08 leondz

nltk is the broader library and we use it for a few other things IIRC, would suggest wn is the one that should be removed, which will require us to refactor topic slightly.

erickgalinkin avatar Oct 25 '24 19:10 erickgalinkin

indeed. annoyingly ´wn´ and ´nltk´provide interfaces with different functionality, and there's a core part on ´wn´ that needs to be rewritten.

this PR is probably also a fine place to build an ´nltk´ api mixin, to unify nltk access following the pattern that uses our cache dir.

leondz avatar Oct 30 '24 18:10 leondz

Hi, I'm currently working on the next version of Wn and came across this issue when I searched GitHub to see how Wn's dependents are using the library. I'm not trying to influence your decision about whether or not to continue using it, but I am curious about what you found difficult, if you don't mind sharing.

Regarding the API differences with the NLTK: the NLTK's wordnet module is built around the original Princeton WordNet in the WNDB format, whereas Wn works with the newer WN-LMF XML resources. The different features between WNDB and WN-LMF (explicitly modeled senses, interlingual indices, pronunciations, etc.) necessitate a slightly different API.

If you are only looking up synonyms, hypernyms, and hyponyms in English, either the NLTK or Wn will work well. I did notice you were getting alternative forms with Wn, and AFAIK that feature is not available in the NLTK: https://github.com/NVIDIA/garak/blob/e599eb0e6545ac3cfb00de1fcbb92c1939bc6d05/garak/probes/topic.py#L64-L67

goodmami avatar Nov 21 '24 05:11 goodmami

Oh, hey! Cool software, thank you for it. Happy to elaborate.

  1. Preamble - can we get some means of suppressing the admittedly very pretty progress bars

  2. Prefer wn's API. It'd be even nice if either nltk or wn's possible API actions was a strict superset of the other

  3. Our main goal with this issue is, like many projects, to reduce dependency count

  4. External resources we use depend on nltk, so it might not go out immediately, though I suspect is overkill compared to the functions we actually use

  5. Maybe nltk.wn is better off being replaced by wn? But given nltk's status today (I know it's been like fifteen years since I last committed..) a strong advocate and proactive project member may be hard to find

  6. One thing we like to do is reuse data. Each external dep that does downloads needs its own wrapper to fit in our system, which I guess adds pressure to #2.

On Thu, Nov 21, 2024, 06:38 Michael Wayne Goodman @.***> wrote:

Hi, I'm currently working on the next version of Wn and came across this issue when I searched GitHub to see how Wn's dependents are using the library. I'm not trying to influence your decision about whether or not to continue using it, but I am curious about what you found difficult, if you don't mind sharing.

Regarding the API differences with the NLTK: the NLTK's wordnet module is built around the original Princeton WordNet in the WNDB format, whereas Wn works with the newer WN-LMF XML resources. The different features between WNDB and WN-LMF (explicitly modeled senses, interlingual indices, pronunciations, etc.) necessitate a slightly different API.

If you are only looking up synonyms, hypernyms, and hyponyms in English, either the NLTK or Wn will work well. I did notice you were getting alternative forms with Wn, and AFAIK that feature is not available in the NLTK:

https://github.com/NVIDIA/garak/blob/e599eb0e6545ac3cfb00de1fcbb92c1939bc6d05/garak/probes/topic.py#L64-L67

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/garak/issues/835#issuecomment-2490122328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5YTS2QJ2ZXDF4OFNE2GT2BVWUXAVCNFSM6AAAAABMSNR4IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJQGEZDEMZSHA . You are receiving this because you authored the thread.Message ID: @.***>

leondz avatar Nov 21 '24 19:11 leondz

@leondz thanks!

  1. Preamble - can we get some means of suppressing the admittedly very pretty progress bars

If you want to simply suppress the progress bars, you can pass None:

wn.download(lexicon, progress_handler=None)

It looks like you are currently customizing the format of the default progress handler: https://github.com/NVIDIA/garak/blob/e599eb0e6545ac3cfb00de1fcbb92c1939bc6d05/garak/probes/topic.py#L99-L101

You can also subclass the basic wn.util.ProgressHandler class if you want something more custom, such as what the Wordbook app does for a GUI progress bar.

If the above are not sufficient and you want a more persistent way to disable/change the progress bar, I might be able to add a setting to the wn.config object?

  1. Prefer wn's API. It'd be even nice if either nltk or wn's possible API actions was a strict superset of the other

I appreciate that. I tried to keep Wn's API similar to the NLTK's when it made sense, but I also didn't restrict myself when there was an opportunity to improve things. I have an old, but nearly complete shim module on the nltk branch to replicate the NLTK's API in Wn. If there is interest, I might be able to finish that up and merge it.

  1. Our main goal with this issue is, like many projects, to reduce dependency count
  2. External resources we use depend on nltk, so it might not go out immediately, though I suspect is overkill compared to the functions we actually use

I totally understand and sympathize with the desire to keep dependencies minimal.

  1. Maybe nltk.wn is better off being replaced by wn? But given nltk's status today (I know it's been like fifteen years since I last committed..) a strong advocate and proactive project member may be hard to find

Originally there was a plan for Wn to be such a replacement, but since then Eric Kafe has been putting in some nice work to update the NLTK's wordnet module, so a merge seems less likely.

  1. One thing we like to do is reuse data. Each external dep that does downloads needs its own wrapper to fit in our system, which I guess adds pressure to 2.

I can't do much about this, unfortunately. Wn doesn't use the same resources as the NLTK, so I can't just point it to the nltk_data/ directory. I tried to make it easy for people to configure the download directory, though.

Thanks again for taking the time to share your thoughts!

goodmami avatar Nov 21 '24 23:11 goodmami