gensim
                                
                                
                                
                                    gensim copied to clipboard
                            
                            
                            
                        [WIP] Clean up of FlsaModel: fixing bugs + formatting + efficiency
Fixes #3423. Supersedes #3435, #3436.
This is still work-in-progress and needs finishing up. Namely:
- Missing user-friendly docstrings and overall model motivation: what is this, who should use it? What do the various parameters mean?
 - As input, accept standard streaming corpora in the bag-of-words (BoW) format. Drop all the in-memory handling of the entire corpus in RAM as "list of list of strinks" and "scipy DOK matrix", that doesn't scale.
 - Complete the cleanup of the code formatting that I started. Especially use more helpful error messages in ValueErrors, showing what values are expected vs what the user supplied.
 - Related to that, focus all the parameter validation to a single place in code = the module entrypoints where users pass in these parameters. Currently the checks (even the same checks?) appear in multiple places, even in internal methods, where we should be in control of what the input values are, so we're doublechecking ourselves which makes no sense.
 
CC @ERijck are you able to continue and finish this up?
All the points above, plus all the FIXME notes I left in the code, must be resolved if we are to keep FlsaModel in Gensim.
@piskvorky yes, I will do that.
Finishing up 1, 3 and 4 will be a great start. I can then assist with 2 (input streaming), to bring flsamodel in line with the rest of Gensim.
To get up to speed with Git, I followed the Codecademy Git&Github pro course today. Afterwards, I just tried to fetch and merge the work in your branch. To do so, I used the following:

I assumed to see your code when opening flsamodel.py. However, this is not the case. Then, I tried the following steps:

This does not work. Which command can I use to pull cbfd972257f83d2d64803059e6585c00184f784c        refs/heads/flsa_fixes?
Yeah git can be frustrating when you're starting out.
Probably best to discard any existing mess in your local fork and start fresh:
git checkout develop && git fetch upstream && git reset --hard upstream/develop  # discard local changes in your develop branch, if any.
git branch -D flsa_fixes  # delete your existing local flsa_fixes branch, if any.
git checkout -b my_flsa_fixes  # create a new local branch for your changes, named "my_flsa_fixes"
git reset --hard upstream/flsa_fixes  # set the content of "my_flsa_fixes" to match the remote "flsa_fixes", to begin with.
At that point you should be at commit cbfd97225 on branch my_flsa_fixes so you can make your changes and commit them and push them into your Github fork repository.
When your changes are ready for review, open a new pull request (PR) from your my_flsa_fixes branch against Gensim's flsa_fixes branch. You can do this from Github's UI, no need for CLI at this point.
Let me know how it goes :)
Thank you @piskvorky, I will follow your steps!
Hi guys,
I have been checking licensing in some of my projects and I got FuzzyTM+pyFUME popping up in one using gensim. If correctly, they are following GPL, importing them in gensim would make gensim GPL as well, rather than LGPL.
Are you aware of this? If I'm wrong concerning the licensing, please let me know.
Thanks!
Plus, FuzzyTM is a GPL2/3 license which has a strong copy left requirement. Recently we let poetry update all our dependencies and our corporate scan tool reported a high concern to us with the dependency scan. We would not be able to continue to use Gensim if that library stayed in (I believe this would be the case for most companies/organizations where their IP is in software.) (ahh, I see @victox5 comment on this now as well)
Gensim itself has a strong copy left license too – LGPL. I'm afraid freeloading corporate concerns are not our primary motivator when choosing dependencies.
We offer a commercial (paid) dual licensing for such cases.
ahh, thanks for the clarification. A mis-understanding on my part with gensims (RaRe-Technologies) position. The company I work for would gladly purchase commercial licensing as needed.
@damonmerrill that would be great – we welcome contributions on all levels: https://github.com/sponsors/piskvorky
I note that the license link in the file points at LGPLv3 instead of LGPLv2.1, that should get updated.
@ERijck can you please fix the merge conflict & update the LGPL link as per @pabs3 's comment above? Thanks.
Yes, I will do this tomorrow!
See PR #3471 where I apply the required changes to flsa_fixes
We don't have the bandwidth to keep pushing this along.