narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

Commitment to keep package size down

Open MarcoGorelli opened this issue 11 months ago • 6 comments

Narwhals started off with the objective of being a lightweight compatibility layer. As we've been adding feature and supported backends, the package size has been growing

There's a lot of essential dataframe functionality, and a lot of libraries that people want to support, so some increase in size since the earliest days is expected. But we do need to monitor it, it does need to stay under control, and Narwhals does need to stay lightweight.

Commitment: I'd like to suggest a hard commitment that:

  • The Narwhals wheel size will never go above 500 kB. It's currently 305 kB
  • Narwhals' size on disk will never go above 5000 kB (i.e.: when you make a virtual environment, the difference in size before vs after doing pip install narwhals. This includes some cached files which Python generates, but still, I think it's good to monitor the overall size). It's currently 3789 kB.
  • Never introduce any required dependencies nor compiled code

https://github.com/narwhals-dev/narwhals/issues/1886 will probably increase our size a bit more. I think that's OK, as Ibis is a library that a few maintainers have said that they want to support. But it does bring us closer to the limits.

Some strategies to reduce size are:

  • Reduce overly-long docstrings. Some examples of how to do this are in https://github.com/narwhals-dev/narwhals/pull/1939 and https://github.com/narwhals-dev/narwhals/pull/1915
  • More code-sharing. https://github.com/narwhals-dev/narwhals/pull/1876 is a nice example, and I think there's more opportunities to do this
  • Directly implement some methods at the Narwhals level, instead of at the compliant level. For example, is is_duplicated just the negation of is_unique?
  • Freeze new features which don't have a use case. Series.hist is fine because it's been requested by Marimo, so it has a clear use case. Anything without a clear use-case, I think we may need to put the brakes on, at least until https://github.com/narwhals-dev/narwhals/issues/1886 is resolved
  • See if there's any linter configurations that would reduce the size. I really don't want to minify Narwhals - legibility is important - but maybe there's some simple settings we can tweak, like line length / grouping imports / commas, that can reduce size a bit "for free"

Any help towards this goal would be appreciated - thank you, and thank you to everyone who has contributed in any way to Narwhals 🙏

MarcoGorelli avatar Feb 05 '25 10:02 MarcoGorelli

This might only be a neglible win, but # type: ignore[no-any-return] shows up 159 times in narwhals.

I think you could remove these by adding warn_return_any = false here:

https://github.com/narwhals-dev/narwhals/blob/78f8c0a28f5e5a11ed57a082e57f071398e9f5ef/pyproject.toml#L211-L213

That rule seems like a poor fit for narwhals anyway. You'd need to remove all those comments afterwards to avoid 159 of these though:

narwhals/stable/v1/__init__.py:2325: error: Unused "type: ignore" comment  [unused-ignore]

dangotbanned avatar Feb 07 '25 13:02 dangotbanned

This is an interesting one. Thinking about it, I'm not sure how much trying to save on whitespace characters and docstrings would help, when comparing to other packages which take about 100MB after installation.

I very much appreciate narwhals being very lightweight, but I'd say as long as it doesn't have any dependencies, it's VERY lightweight, almost independent of saving on little things.

As for including features only if a usecase is there, I think of it as a double edged sword kinda thing:

  • on one hand, you're right that it makes sense to have some sort of a compass on what to include and what not to include
  • on the other hand, if we include features only when they come up, then we'd in in the situation that if package X needs a feature in narwhals, they'd need to trigger its inclusion, and then wait for a release, and even then they'd need to "depend" on a very recent narwhals release. Let say the feature involves support of library Y, now, is this feature supports the latest release of library Y? Or every feature from start supports multiple versions of the corresponding library? Having minimum dependency versions to be very recent makes dependency resolutions hell.

So I guess the tl;dr; here for me is:

  • yes to no dependency
  • yes to having some sort of an "inclusion criteria"
  • yes to not being a 100MB kinda package

but I wouldn't worry about the package having a 5MB download size.

adrinjalali avatar Feb 12 '25 15:02 adrinjalali

Thanks for your comments, much appreciated!

Maybe we don't indeed need a strict cap, but I would like to keep closely monitoring size - I don't think any dataframe library started thinking they'd get 400MB wheels, but it is where things tend to go if unchecked (seriously, the PySpark 4.0 wheel is >400MB, wut 🤯 https://pypi.org/project/pyspark/4.0.0.dev2/#files)

I'd still like to suggest slowing down on new features, so that we can focus on:

  • static typing, which has helped find useful issues
  • making tests more consistent, so it's easier to add backends
  • refactoring ExprKind parsing, so there's less duplication of logic for when expressions need broadcasting
  • setting up performance tracking, and checking in CI that overhead for some common operations is below x %
  • better docs, better organised API reference, more varied tutorials
  • setting things up for stable.v2 (especially, determining how to support order-dependent operations for SQL-like backends). We should at least have a POC ready for SQLFrame / PySpark

We can then resume adding features (filling out the .list namespace would be very nice, for example). But for the past year, things have moved very fast, and we only have limited attention and time, so I think we should spend some focused time on "important but not urgent" tasks like the ones above before expanding further

MarcoGorelli avatar Feb 14 '25 11:02 MarcoGorelli

On the topic of docstrings downstream users could always just use -OO at runtime if they are trying to strip those. That may not fully be what you are after, but may be close enough

https://docs.python.org/3.12/using/cmdline.html#cmdoption-OO

WillAyd avatar Feb 18 '25 17:02 WillAyd

Could some ruff rules work, e.g. flake8-simplify (SIM): https://docs.astral.sh/ruff/rules/#flake8-simplify-sim?

mikeweltevrede avatar Mar 29 '25 16:03 mikeweltevrede

Could some ruff rules work, e.g. flake8-simplify (SIM): https://docs.astral.sh/ruff/rules/#flake8-simplify-sim?

Thanks @mikeweltevrede, I'm a big fan of SIM, but I think we've already got them enabled. I haven't checked to see if there was anything interesting in the preview rules though 🤔

These currently ignored rules would likely help, if only to nudge towards writing shorter, more reusable functions/methods:

https://github.com/narwhals-dev/narwhals/blob/5550ad897992e3a639cea3fb0cca5bede98da6d7/pyproject.toml#L149

https://github.com/narwhals-dev/narwhals/blob/5550ad897992e3a639cea3fb0cca5bede98da6d7/pyproject.toml#L165-L166

https://github.com/narwhals-dev/narwhals/blob/5550ad897992e3a639cea3fb0cca5bede98da6d7/pyproject.toml#L169

dangotbanned avatar Mar 29 '25 16:03 dangotbanned

Thanks all for comments

Been thinking about this, and tbh I'd still like to make non-binding commitment to keep the wheel under 500kb. Even an arbitrary and imperfect upper-limit is better than no limit at all. At the very least, it forces us to keep track of how big the project is getting. The latest release is 377.3 kB, so we've still got plenty of widdle room

MarcoGorelli avatar Jul 26 '25 13:07 MarcoGorelli