gensim icon indicating copy to clipboard operation
gensim copied to clipboard

[WIP] Clean up of FlsaModel: fixing bugs + formatting + efficiency

Open piskvorky opened this issue 2 years ago • 15 comments

Fixes #3423. Supersedes #3435, #3436.

This is still work-in-progress and needs finishing up. Namely:

  1. Missing user-friendly docstrings and overall model motivation: what is this, who should use it? What do the various parameters mean?
  2. As input, accept standard streaming corpora in the bag-of-words (BoW) format. Drop all the in-memory handling of the entire corpus in RAM as "list of list of strinks" and "scipy DOK matrix", that doesn't scale.
  3. Complete the cleanup of the code formatting that I started. Especially use more helpful error messages in ValueErrors, showing what values are expected vs what the user supplied.
  4. Related to that, focus all the parameter validation to a single place in code = the module entrypoints where users pass in these parameters. Currently the checks (even the same checks?) appear in multiple places, even in internal methods, where we should be in control of what the input values are, so we're doublechecking ourselves which makes no sense.

piskvorky avatar Jan 20 '23 11:01 piskvorky

CC @ERijck are you able to continue and finish this up?

All the points above, plus all the FIXME notes I left in the code, must be resolved if we are to keep FlsaModel in Gensim.

piskvorky avatar Jan 20 '23 11:01 piskvorky

@piskvorky yes, I will do that.

ERijck avatar Jan 20 '23 13:01 ERijck

Finishing up 1, 3 and 4 will be a great start. I can then assist with 2 (input streaming), to bring flsamodel in line with the rest of Gensim.

piskvorky avatar Jan 20 '23 13:01 piskvorky

To get up to speed with Git, I followed the Codecademy Git&Github pro course today. Afterwards, I just tried to fetch and merge the work in your branch. To do so, I used the following:

image

I assumed to see your code when opening flsamodel.py. However, this is not the case. Then, I tried the following steps:

image

This does not work. Which command can I use to pull cbfd972257f83d2d64803059e6585c00184f784c refs/heads/flsa_fixes?

ERijck avatar Jan 23 '23 16:01 ERijck

Yeah git can be frustrating when you're starting out.

Probably best to discard any existing mess in your local fork and start fresh:

git checkout develop && git fetch upstream && git reset --hard upstream/develop  # discard local changes in your develop branch, if any.
git branch -D flsa_fixes  # delete your existing local flsa_fixes branch, if any.
git checkout -b my_flsa_fixes  # create a new local branch for your changes, named "my_flsa_fixes"
git reset --hard upstream/flsa_fixes  # set the content of "my_flsa_fixes" to match the remote "flsa_fixes", to begin with.

At that point you should be at commit cbfd97225 on branch my_flsa_fixes so you can make your changes and commit them and push them into your Github fork repository.

When your changes are ready for review, open a new pull request (PR) from your my_flsa_fixes branch against Gensim's flsa_fixes branch. You can do this from Github's UI, no need for CLI at this point.

Let me know how it goes :)

piskvorky avatar Jan 23 '23 17:01 piskvorky

Thank you @piskvorky, I will follow your steps!

ERijck avatar Jan 24 '23 06:01 ERijck

Hi guys,

I have been checking licensing in some of my projects and I got FuzzyTM+pyFUME popping up in one using gensim. If correctly, they are following GPL, importing them in gensim would make gensim GPL as well, rather than LGPL.

Are you aware of this? If I'm wrong concerning the licensing, please let me know.

Thanks!

victox5 avatar Feb 14 '23 18:02 victox5

Plus, FuzzyTM is a GPL2/3 license which has a strong copy left requirement. Recently we let poetry update all our dependencies and our corporate scan tool reported a high concern to us with the dependency scan. We would not be able to continue to use Gensim if that library stayed in (I believe this would be the case for most companies/organizations where their IP is in software.) (ahh, I see @victox5 comment on this now as well)

damonmerrill avatar Mar 08 '23 15:03 damonmerrill

Gensim itself has a strong copy left license too – LGPL. I'm afraid freeloading corporate concerns are not our primary motivator when choosing dependencies.

We offer a commercial (paid) dual licensing for such cases.

piskvorky avatar Mar 08 '23 16:03 piskvorky

ahh, thanks for the clarification. A mis-understanding on my part with gensims (RaRe-Technologies) position. The company I work for would gladly purchase commercial licensing as needed.

damonmerrill avatar Mar 08 '23 17:03 damonmerrill

@damonmerrill that would be great – we welcome contributions on all levels: https://github.com/sponsors/piskvorky

piskvorky avatar Mar 09 '23 12:03 piskvorky

I note that the license link in the file points at LGPLv3 instead of LGPLv2.1, that should get updated.

pabs3 avatar Mar 13 '23 09:03 pabs3

@ERijck can you please fix the merge conflict & update the LGPL link as per @pabs3 's comment above? Thanks.

piskvorky avatar May 10 '23 09:05 piskvorky

Yes, I will do this tomorrow!

ERijck avatar May 10 '23 14:05 ERijck

See PR #3471 where I apply the required changes to flsa_fixes

ERijck avatar May 11 '23 11:05 ERijck

We don't have the bandwidth to keep pushing this along.

mpenkov avatar Jun 11 '24 11:06 mpenkov