cld icon indicating copy to clipboard operation
cld copied to clipboard

Upgrade to CLD2

Open cbandy opened this issue 11 years ago • 17 comments

See #8.

Before this is merged, we should update our licensing. The library has changed to the Apache license.

The size of the bundled library has grown significantly. The source itself is over 90 MiB. The gem is now 35 MiB (up from 6 MiB) and installed it uses 93 MiB (up from 17 MiB). If CLD2 ever releases a tarball, we can stop bundling it here and shrink the installed size to 2 MiB.

There are two possible CLD2 libraries to link against: libcld2.so and libcld2_full.so. The latter can detect twice as many languages and is 4 MiB larger. I arbitrarily chose the former, smaller library in this PR. Which would you prefer to be used by default? In either case, we can also make this configurable during gem install.

cbandy avatar May 30 '14 15:05 cbandy

wow, that is a very large gem! is there any way we can reduce this? 6mb was already too much.

jtoy avatar Jun 03 '14 15:06 jtoy

I found that some of the CLD2 source files are not necessary to build the libraries. The gem is now 17 MiB and installed uses 46 MiB. If we commit to just one of libcld2.so or libcld2_full.so, we can reduce this further.

The unavoidable fact is that the source contains large tables of pre-computed n-grams. cld2_generated_quad0122.cc is required to build libcld2_full.so and is 27 MiB. Gems are already compressed, so minimizing the number of these source files in the shipped gem is the only way to save bits.

If CLD2 were to release an archive/tarball, we could ship zero source files and download it before compiling the extension using something like mini_portile.

I looked into downloading bare files from the project repository, but we either need to

  1. depend on more tools (e.g. svn or wget) or
  2. maintain something approaching their complexity or
  3. maintain a list of source files/URLs to download.

cbandy avatar Jun 04 '14 05:06 cbandy

Another option is to ship binary/pre-compiled gems. At first pass, it looks like the smaller gem would be less than 2 MiB and the larger would be less than 5 MiB.

I don't have any experience releasing a binary gem.

cbandy avatar Jun 04 '14 06:06 cbandy

Any chance there has been any progress or updates with this? I'd love to help out with this if possible.

mattdoller avatar May 15 '15 22:05 mattdoller

I would also like to contribute. Let's solve this issue asap. This issue p is pending for more than a year just because of size of CLD.

adityapatadia avatar May 31 '15 13:05 adityapatadia

Here is similar implementation in JavaScript. We can take cues from that: https://github.com/dachev/node-cld

adityapatadia avatar May 31 '15 14:05 adityapatadia

@jtoy can we reconsider this? The gem did get larger, but so did the source library. I don't think there is a clean way to avoid this and still allow anyone to use the gem.

craig-day avatar Jun 26 '15 16:06 craig-day

any update on this?

mmahalwy avatar Oct 11 '15 02:10 mmahalwy

@craig-day can you merge and release this ?

grosser avatar Oct 11 '15 03:10 grosser

I'll take a look hopefully tomorrow or Monday morning at the latest.

On Oct 10, 2015, at 8:38 PM, Michael Grosser [email protected] wrote:

@craig-day can you merge and release this ?

— Reply to this email directly or view it on GitHub.

craig-day avatar Oct 11 '15 03:10 craig-day

CLD2 project has moved to https://github.com/CLD2Owners/cld2/

cbandy avatar Oct 11 '15 04:10 cbandy

@cbandy is this still ready to go? I'd like to merge and release a new major version.

craig-day avatar Jun 27 '16 16:06 craig-day

It has been a long time since I looked at this.

  • Something still needs to be done about the licensing.
  • The project moved, so any links should updated. I see one in the README.
  • Should we pull in any changes to CLD2 since May 2014, if any?

If CLD2 were to release an archive/tarball...

I still don't see a tarball; at least not one provided by GitHub tags/releases.

I looked into downloading bare files from the project repository...

Maybe this is more reasonable now that it hosted in Git? I forget how common it is for Gem installers to have git available.

cbandy avatar Jun 27 '16 20:06 cbandy

Should we pull in any changes to CLD2 since May 2014, if any?

This appears to be the revision/commit that I imported in this PR: https://github.com/CLD2Owners/cld2/commit/d076f5eda223ac568639d6288f2e2d70d908f282

cbandy avatar Jun 27 '16 20:06 cbandy

@cbandy can you update the readme link and pull in any new changes? I'm not sure if the tarball is a concern right now. I'd rather avoid a git dependency because not all places gems get installed have git (like production servers).

craig-day avatar Jun 28 '16 19:06 craig-day

As far as licensing, I think you can copy the apache license from the CLD2 owners. It looks like our original license was just copied from them anyway.

craig-day avatar Jun 28 '16 19:06 craig-day

@cbandy I don't think this project will be updated, I suggest you to release your code as a new cld2 gem

guilleiguaran avatar Jun 11 '18 06:06 guilleiguaran