stdlib
stdlib copied to clipboard
string package pulls all datasets
Description
This is not a bug per se, but something worth fixing nonetheless.
@stdlib/string
depends on @stdlib/datasets
(~150MB) only to use @stdlib/datasets/stopwords-en
(<100kB).
Related Issues
No response
Questions
Side node: should @stdlib/datasets
even exist as a single package?
Demo
No response
Reproduction
No response
Expected Results
No response
Actual Results
No response
Version
No response
Environments
Node.js
Browser Version
No response
Node.js / npm Version
No response
Platform
No response
Checklist
- [X] Read and understood the Code of Conduct.
- [X] Searched for existing issues and pull requests.
@lightmare Yeah, we're in a bit of a catch-22 here. We've discussed this internally and, by my assessment, we have three (less than ideal) options.
- Publish top-level namespaces to depend on only those individual datasets packages they actually use. This has the downside that, upon installing
@stdlib/stdlib
, users would then have duplication of datasets as@stdlib/datasets-foo
would be installed alongside@stdlib/datasets/foo
. Duplication bloat could become significant depending on how@stdlib/datasets/*
are used throughout the project in the future. - Move certain dataset packages out of
datasets
. This is not great as it then fragments dataset packages across the project, lead to other namespaces increasing in heft, and would require package deprecation. Furthermore, moving packages around is something of a slippery slope. While certain datasets packages may not be depended on anywhere in the project, this is not guaranteed to remain that way forever. Meaning, if we "patch" this issue now, nothing to prevent us from running into the same issue later with one or more other datasets packages, leading to a top-level namespace depending on all of@stdlib/datasets
. In general, IMO, I don't think this is a viable option. - Leave the status quo.
Given the above, I'd lean toward (3). If you want to avoid installing all of @stdlib/datasets
, I'd recommend just installing those individual @stdlib/string-*
packages you actually need (e.g., @stdlib/string-camelcase
, etc). In turn, this should only install those datasets packages which are actually used, not the entire top-level namespace.
Ad 1.
This has the downside that, upon installing
@stdlib/stdlib
, users would then have duplication of datasets as@stdlib/datasets-foo
would be installed alongside@stdlib/datasets/foo
. Duplication bloat could become significant depending on how@stdlib/datasets/*
are used throughout the project in the future.
This duplication is happening already. If I install @stdlib/string
and some other package depends on @stdlib/string-acronym
, I end up with @stdlib/datasets
and @stdlib/datasets-stopwords-en
.
Which is partly why I asked whether @stdlib/datasets
should even exist as a single monster package. It's a random collection of stuff that's never going to be needed whole. And it just duplicates all the datasets it's comprised of.
Ad 2. agreed, that doesn't seem viable.
If you want to avoid installing all of
@stdlib/datasets
, I'd recommend just installing those individual@stdlib/string-*
packages you actually need (e.g.,@stdlib/string-camelcase
, etc). In turn, this should only install those datasets packages which are actually used, not the entire top-level namespace.
That sounds like the wrong trade-off. The convenience of having @stdlib/string
as a single package providing various operations on strings, greatly outweighs the convenience of stdlib/datasets
as a single package containing an assortment of data from Napoleon wars, through Boston house prices, to monochrome photos of a :cat:
Re: 1. Yeah, that is a fair point where a user has in their package tree both @stdlib/string
and, say, @stdlib/string-*
. I suppose our intuition is that the consumption distribution is mainly trimodal, with one mode being @stdlib/stdlib
, the second being top-level namespaces, and the third being individual packages. As soon as consumption patterns overlap, we're kind of stuck in a sub-optimal minima, where no one solution is optimal. If we actually controlled the package tree ala npm
, we could probably figure out an optimum tree, but alas!
In general, I agree that we should consider whether there is a way to achieve some sort of happy medium for the mixed mode case, but I am not terribly optimistic. 😞
@Planeshifter may have other ideas. 🙏
@Planeshifter Looks like in @stdlib/string/*
, the only package which explicitly depends on a dataset is @stdlib/string/acronym
, which defaults to using English stopwords. We could just default to an empty array. This would alleviate the issue of @stdlib/string
depending on @stdlib/datasets
.
This would be somewhat akin to @stdlib/string/remove-words
where we don’t default to any list of particular words to remove.
This has now been addressed in https://github.com/stdlib-js/stdlib/commit/d96a0da70fcca6f13fdb2c87a9cef1a751c6b545. Thanks, @Planeshifter!
Closing as the change is incorporated in the latest release of the @stdlib/string
package.