stdlib string package pulls all datasets

Description

This is not a bug per se, but something worth fixing nonetheless.

@stdlib/string depends on @stdlib/datasets (~150MB) only to use @stdlib/datasets/stopwords-en (<100kB).

Related Issues

No response

Questions

Side node: should @stdlib/datasets even exist as a single package?

Demo

No response

Reproduction

No response

Expected Results

No response

Actual Results

No response

Version

No response

Environments

Node.js

Browser Version

No response

Node.js / npm Version

No response

Platform

No response

Checklist

[X] Read and understood the Code of Conduct.
[X] Searched for existing issues and pull requests.

Dec 18 '21 15:12 lightmare

@lightmare Yeah, we're in a bit of a catch-22 here. We've discussed this internally and, by my assessment, we have three (less than ideal) options.

Publish top-level namespaces to depend on only those individual datasets packages they actually use. This has the downside that, upon installing @stdlib/stdlib, users would then have duplication of datasets as @stdlib/datasets-foo would be installed alongside @stdlib/datasets/foo. Duplication bloat could become significant depending on how @stdlib/datasets/* are used throughout the project in the future.
Move certain dataset packages out of datasets. This is not great as it then fragments dataset packages across the project, lead to other namespaces increasing in heft, and would require package deprecation. Furthermore, moving packages around is something of a slippery slope. While certain datasets packages may not be depended on anywhere in the project, this is not guaranteed to remain that way forever. Meaning, if we "patch" this issue now, nothing to prevent us from running into the same issue later with one or more other datasets packages, leading to a top-level namespace depending on all of @stdlib/datasets. In general, IMO, I don't think this is a viable option.
Leave the status quo.

Given the above, I'd lean toward (3). If you want to avoid installing all of @stdlib/datasets, I'd recommend just installing those individual @stdlib/string-* packages you actually need (e.g., @stdlib/string-camelcase, etc). In turn, this should only install those datasets packages which are actually used, not the entire top-level namespace.

Dec 18 '21 21:12 kgryte

Ad 1.

This has the downside that, upon installing @stdlib/stdlib, users would then have duplication of datasets as @stdlib/datasets-foo would be installed alongside @stdlib/datasets/foo. Duplication bloat could become significant depending on how @stdlib/datasets/* are used throughout the project in the future.

This duplication is happening already. If I install @stdlib/string and some other package depends on @stdlib/string-acronym, I end up with @stdlib/datasets and @stdlib/datasets-stopwords-en.

Which is partly why I asked whether @stdlib/datasets should even exist as a single monster package. It's a random collection of stuff that's never going to be needed whole. And it just duplicates all the datasets it's comprised of.

Ad 2. agreed, that doesn't seem viable.

If you want to avoid installing all of @stdlib/datasets, I'd recommend just installing those individual @stdlib/string-* packages you actually need (e.g., @stdlib/string-camelcase, etc). In turn, this should only install those datasets packages which are actually used, not the entire top-level namespace.

That sounds like the wrong trade-off. The convenience of having @stdlib/string as a single package providing various operations on strings, greatly outweighs the convenience of stdlib/datasets as a single package containing an assortment of data from Napoleon wars, through Boston house prices, to monochrome photos of a :cat:

Dec 19 '21 00:12 lightmare

Re: 1. Yeah, that is a fair point where a user has in their package tree both @stdlib/string and, say, @stdlib/string-*. I suppose our intuition is that the consumption distribution is mainly trimodal, with one mode being @stdlib/stdlib, the second being top-level namespaces, and the third being individual packages. As soon as consumption patterns overlap, we're kind of stuck in a sub-optimal minima, where no one solution is optimal. If we actually controlled the package tree ala npm, we could probably figure out an optimum tree, but alas!

In general, I agree that we should consider whether there is a way to achieve some sort of happy medium for the mixed mode case, but I am not terribly optimistic. 😞

@Planeshifter may have other ideas. 🙏

Dec 19 '21 00:12 kgryte

@Planeshifter Looks like in @stdlib/string/*, the only package which explicitly depends on a dataset is @stdlib/string/acronym, which defaults to using English stopwords. We could just default to an empty array. This would alleviate the issue of @stdlib/string depending on @stdlib/datasets.

This would be somewhat akin to @stdlib/string/remove-words where we don’t default to any list of particular words to remove.

Dec 29 '21 18:12 kgryte

This has now been addressed in https://github.com/stdlib-js/stdlib/commit/d96a0da70fcca6f13fdb2c87a9cef1a751c6b545. Thanks, @Planeshifter!

Jan 19 '22 02:01 kgryte

Closing as the change is incorporated in the latest release of the @stdlib/string package.

Sep 16 '22 14:09 Planeshifter

stdlib stdlib copied to clipboard

string package pulls all datasets

Description

Related Issues

Questions

Demo

Reproduction

Expected Results

Actual Results

Version

Environments

Browser Version

Node.js / npm Version

Platform

Checklist

stdlib
stdlib copied to clipboard