altair icon indicating copy to clipboard operation
altair copied to clipboard

feat: make pandas and NumPy optional dependencies, don't require PyArrow for plotting with Polars/Modin/cuDF

Open MarcoGorelli opened this issue 1 year ago • 12 comments

Suggested in https://github.com/vega/altair/issues/3445#issuecomment-2194585068 closes #3456 closes #3445

Demo

Polars can work without PyArrow nor pandas installed (EDIT: and without NumPy either!):

Screenshot 2024-06-30 153556

Pyarrow table is still supported

(notice the .to_arrow)

Screenshot 2024-06-30 150956

Ibis is still supported

Screenshot 2024-06-30 150914

Performance

There's also a perf improvement to not repeatedly converting Polars to PyArrow - for example, if I run

import polars as pl
import altair as alt
from datetime import date

# dataset: https://github.com/Mcompetitions/M6-methods/blob/main/assets_m6.csv
df = pl.read_csv('../scratch/assets.csv', try_parse_dates=True)
df = df.filter(pl.col('date')>date(2022, 11, 15))
def func():
    alt.Chart(df).mark_line().encode(
        x='date', y='price', color='symbol',
    ).to_json()
results = %timeit -o -q func()
results.best

in IPython, then I see:

  • Altair main branch: 0.033584889399935494
  • Altair MarcoGorelli:narwhalify branch: 0.024283270300657023

That's for a (4475, 3)-shaped dataframe - for dataframe with many columns (esp. with strings / categoricals) I'd expect the difference to grow

MarcoGorelli avatar Jun 30 '24 15:06 MarcoGorelli

@MarcoGorelli if all the tests are passing locally for you, try running these two:

hatch run generate-schema-wrapper
hatch run update-init-file

If there are any changes other than whitespace, you'll get a more informative error than https://github.com/vega/altair/actions/runs/9733301918/job/26860053484?pr=3452

Otherwise those two should clean everything up.

dangotbanned avatar Jun 30 '24 16:06 dangotbanned

https://github.com/vega/altair/actions/runs/9734437115/job/26862450764?pr=3452

@MarcoGorelli try this to run the tests on the matrix locally:

hatch test --all

The help for hatch is pretty useful as well:

hatch test -h

dangotbanned avatar Jun 30 '24 19:06 dangotbanned

@MarcoGorelli these were two minor things. I couldn't figure out a way to request these another way.

fix: add missing pandas import to doctest

519     Examples 520     -------- 521     >>> data = pd.DataFrame({'foo': ['A', 'B', 'A', 'B'], UNEXPECTED EXCEPTION: NameError("name 'pd' is not defined") Traceback (most recent call last):   File "..\Local\hatch\env\virtual\.pythons\3.12\python\Lib\doctest.py", line 1361, in __run           exec(compile(example.source, filename, "single",   File "<doctest altair.utils.core.parse_shorthand[0]>", line 1, in <module> NameError: name 'pd' is not defined C:\Users\danie\Documents\GitHub\altair\altair\utils\core.py:521: UnexpectedException ============================================== short test summary info ==============================================  FAILED altair/utils/core.py::altair.utils.core.parse_shorthand ========================= 1 failed, 1250 passed, 1 skipped, 2 xfailed, 1 xpassed in 30.64s ==========================
diff --git a/altair/utils/core.py b/altair/utils/core.py
index e6620146..d5d753db 100644
--- a/altair/utils/core.py
+++ b/altair/utils/core.py
@@ -520,6 +520,7 @@ def parse_shorthand(
 
     Examples
     --------
+    >>> import pandas as pd
     >>> data = pd.DataFrame({'foo': ['A', 'B', 'A', 'B'],
     ...                      'bar': [1, 2, 3, 4]})
 

refactor(typing): Reuse _typing alias as InferredVegaLiteType

diff --git a/altair/utils/core.py b/altair/utils/core.py
index e6620146..f3a5469a 100644
--- a/altair/utils/core.py
+++ b/altair/utils/core.py
@@ -42,6 +42,7 @@ else:
 if TYPE_CHECKING:
     from types import ModuleType
     import typing as t
+    from altair.vegalite.v5.schema._typing import StandardType_T as InferredVegaLiteType
     from altair.utils._dfi_types import DataFrame as DfiDataFrame
     from altair.utils.data import DataType
     from narwhals.typing import IntoExpr
@@ -199,9 +200,6 @@ TIMEUNITS = [
 ]
 
 
-InferredVegaLiteType = Literal["ordinal", "nominal", "quantitative", "temporal"]
-
-
 def infer_vegalite_type_for_pandas(
     data: object,
 ) -> InferredVegaLiteType | tuple[InferredVegaLiteType, list[Any]]:

dangotbanned avatar Jul 02 '24 11:07 dangotbanned

Thank you very much for making this Pull Request. That is very much appreciated!

This PR introduces a new dependency on narwhals. Narwhals is described as a lightweight and extensible compatibility layer between multiple dataframe libraries. Narwhals can replace our hard dependency on pandas, the size of pandas is ~11 MB (without mentioning its subdependencies). This by itself is a very interesting development already.

Next to this it enables support for native dataframes (pandas still included) without converting these to a pyarrow Tables through the dataframe interchange protocol.

The latest release of Altair is ~850 kB and after this PR, Altair will have the following hard dependencies:

  • jinja2 : ~130 kB
  • jsonschema : ~90 kB
  • typing_extensions: ~35 kB
  • packaging : ~55 kB
  • narwhals : ~80 kB
  • ~~numpy : 18 MB~~ (also removed as part of this PR)

When using Altair with only remote endpoints (like APIs that are returning JSON-data), I've the feeling that we do not need to rely on numpy. ~~Therefor I'm still hopefull that we can reduce this hard dependency on numpy as well one day.~~

I've a few questions regarding this PR and narwhals.

  1. Just curious, why is the lower limit of pandas in narwhals version 1.1.5? As the supported lower limit of pandas in Altair before this PR was version 0.25.

  2. You serialize the nw.Date type to:

if dtype == nw.Date:
	columns.append(nw.col(name).dt.to_string("%Y-%m-%d"))

And change a test from:

"a": {"date": "2019-01-01T00:00:00"},

To:

"a": {"date": "2019-01-01"},

Upon first sight, I think we cannot introduce this breaking change. Does this change still adhere to full ISO-8601 dates representation as is required by Javascript? See also past issues like this: https://github.com/vega/altair/issues/1027.

  1. Can you explain the behavior of nw.from_native()? When I look to this code:
data = nw.from_native(data, eager_only=True, strict=False)
if isinstance(data, nw.DataFrame):
	return data
if isinstance(data, DataFrameLike):
	from altair.utils.data import arrow_table_from_dfi_dataframe

	pa_table = arrow_table_from_dfi_dataframe(data)
	return nw.from_native(pa_table, eager_only=True)

I'm a bit surprised that the data object, as a result of passing it into nw.from_native can be a non-narwhals dataframe. If it is not supported by narwhals it returns the original data object?

  1. Maybe a naive question, but do you have any intention to support dataframes that support the dataframe interface protocol within narwhals?

  2. Not being a type expert, but I cannot really understand why there needs to be quotes around the type "IntoDataFrame", like in:

DataType: TypeAlias = Union[
    Dict[Any, Any], "IntoDataFrame", SupportsGeoInterface, DataFrameLike
]

I remember I have discussed this with @dangotbanned recently in https://github.com/vega/altair/pull/3431, see https://github.com/vega/altair/pull/3431#issuecomment-2162877321 onwards, but I can't find a clear answer anymore on this topic.

Thanks again for this PR! Really promising! (also this PR will supersede https://github.com/vega/altair/pull/3384)

mattijn avatar Jul 02 '24 21:07 mattijn

  1. Not being a type expert, but I cannot really understand why there needs to be quotes around the type "IntoDataFrame", like in:
DataType: TypeAlias = Union[
    Dict[Any, Any], "IntoDataFrame", SupportsGeoInterface, DataFrameLike
]

I remember I have discussed this with @dangotbanned recently in #3431, see #3431 (comment) onwards, but I can't find a clear answer anymore on this topic.

Thanks for the ping @mattijn

The good news is you can trust the ruff config now for managing cases like this (particularly TCH), what @MarcoGorelli has written is correct.

https://github.com/MarcoGorelli/altair/blob/2df9422dca75a26eb16fd0c8030b6e89b4d4fa0f/altair/utils/data.py#L44-L66

All of the imports within the TYPE_CHECKING block must be quoted when used in a runtime alias definition. IntoDataFrame is used in: DataType, TIntoDataFrame, SampleReturnType.

AFAIK, this should be fine to move to a regular import though

  • alt.utils.data requires explicit exports in alt.utils.__init__.__all__
  • narwhals is already imported at runtime in https://github.com/MarcoGorelli/altair/blob/2df9422dca75a26eb16fd0c8030b6e89b4d4fa0f/altair/utils/data.py#L26

dangotbanned avatar Jul 03 '24 09:07 dangotbanned

Thanks so much for your quick and careful reviews @mattijn and @dangotbanned 🙏 ! Much appreciated

  1. No reason :) I just lowered the minimum version of pandas in Narwhals - I set it to 0.25.3 rather than 0.25 though, as when I tried using 0.25 in CI it just froze indefinitely. 0.25.3 is also the one that Altair is currently testing against in CI - is it OK to set that as the minimum?

  2. Interesting, thanks, and I'm sorry for not having picked up on that! Quoting the issue

    If you were east of London when running this code, you wouldn't see this issue 😄.

    that might be why I didn't see it when testing it out 😳 . Polars doesn't let us format Dates with time directives (because the underlying Rust crate chrono doesn't either) - but, we can just cast to datetime, I've added a comment about why this is done. On my laptop, pyarrow.compute.strftime is about 200 times slower than casting Date to Datetime - so, the extra cast to ensure consistency, in the grand scheme of the things, is quite negligible, and not the bottleneck of the sanitize function

  3. That's because of strict=False in the first line of the function https://narwhals-dev.github.io/narwhals/api-reference/narwhals/#narwhals.from_native, the idea is:

    • strict=False: if the object isn't supported by Narwhals, then it just gets passed through as-is
    • strict=True: if the object isn't supported by Narwhals, then raise
  4. We'd like to have 2 levels of supported dataframes in Narwhals:

    • full API
    • basic. I.e. an extended and more user-friendly version of the dataframe interchange protocol. Anything which already supports the interchange protocol would be in scope. My main gripe with the interchange protocol is that it's bundled with dataframe libraries themselves, which means is super slow to change - for example, Jon reported https://github.com/pandas-dev/pandas/issues/54239, and it got fixed, but it's going to be several years until the minimum pandas version gets bumped up enough that it can be used... 😭
  5. thanks @dangotbanned for helping out with this one!

I've the feeling that we do not need to rely on numpy

"in for a penny, in for a pound" - I've added a commit to do this. NumPy is used very little, mostly isinstance checks and np.ceil, so I think it can already be made optional. Then Altair would truly be the lightest visualisation library in the ecosystem?

I've caught up with Polars devs, and they're on board with using Altair in polars.DataFrame.plot if the plots can be done directly without going via pandas, so we may want to have a conversation later about how that should look (personally, I think df.plot.line(x='x', y='y') as a shorthand for alt.Chart(df).mark_line().encode(x='x', y='y').interactive() would be really nice, and users would ❤️ love ❤️ it)

MarcoGorelli avatar Jul 03 '24 11:07 MarcoGorelli

Thanks @MarcoGorelli

When I originally suggested narwhals I didn't expect you'd take an active role - and this PR is certainly far beyond that! 😄 I get the impression that this is beneficial for both projects, as any issues you've encountered have been quickly resolved in subsequent narwhals releases

I've caught up with Polars devs, and they're on board with using Altair in polars.DataFrame.plot if the plots can be done directly without going via pandas, so we may want to have a conversation later about how that should look (personally, I think df.plot.line(x='x', y='y') as a shorthand for alt.Chart(df).mark_line().encode(x='x', y='y').interactive() would be really nice, and users would ❤️ love ❤️ it)

I think most people would agree on the value of this, but probably best for another issue. @joelostblom would probably be able to chime in, there was also this discussion https://github.com/vega/altair/discussions/2789#discussioncomment-4579799

I would expect there would be more support if the implementation was different to the existing hvPlot (polars) / hvPlot accessor. Currently, the burden would be on altair to define what kinds of charts (e.g. line, scatter, etc), but this isn't a 1-to-1 with how many of the other plotting libraries work.

My suggestion would be for polars to define a protocol with some chart methods, with altair implementing those that make sense for quick EDA. That way, these methods/functions may have less of a chance of leaking out into the rest of the altair api.

dangotbanned avatar Jul 03 '24 11:07 dangotbanned

Thanks @dangotbanned !

I would expect there would be more support if the implementation was different to the existing hvPlot (polars) / hvPlot accessor.

Definitely - plot is marked as unstable in Polars 😎 What I had in mind was just something like

class PlotNameSpace:
    def __init__(self, df: DataFrame) -> None:
        self.df = df
    def line(self, *args, **kwargs):
        return alt.Chart(self.df).mark_line().encode(*args, **kwargs).interactive()
    def point(self, *args, **kwargs):
        return alt.Chart(self.df).mark_point().encode(*args, **kwargs).interactive()
    ...

Anyway, we can figure out the details later, whether or not Altair decides to go ahead with Narwhals

(for the record, I think hvplot is amazing and its developers are really friendly, so I'd still advocating for it to be advertised in the Polars docs - but if Altair could provide a more lightweight alternative which doesn't require other heavy libraries, I think it'd be a more suitable default)

MarcoGorelli avatar Jul 03 '24 13:07 MarcoGorelli

Really awesome to see you were able to remove numpy as a hard dependency too! Its great to see this happening!

A few more comments while you might still have this PR open.

In my opinion, there is no need to serialize microseconds for type Date (also reducing characters in the vega-lite specification). If we define local_iso_fmt_string = "%Y-%m-%dT%H:%M:%S" and use that when nw.Date applies and when nw.Datetime applies we can use f"{local_iso_fmt_string}%.f" or something similar.

A single comment like # using `strict=False` will return the `data` as-is if the object cannot be converted. would be appreciated for my future-self or other people who will be reading the following line: https://github.com/vega/altair/blob/7312f25eea31b6c86e273d591641b6c2fb494e3b/altair/utils/core.py#L477

This line can be removed: https://github.com/vega/altair/blob/7312f25eea31b6c86e273d591641b6c2fb494e3b/altair/utils/core.py#L451

Thanks again for pushing this!

mattijn avatar Jul 03 '24 14:07 mattijn

A bit off topic, but regarding polars.DataFrame.plot functionality: (1) you also might learn some lessons from this archived repo: https://github.com/altair-viz/pdvega, (2) there is also some work underway to extend the .interactive() method within altair https://github.com/vega/altair/pull/3394.

mattijn avatar Jul 03 '24 14:07 mattijn

This is exciting! :) I'd also like to have a quick look. Hopefully get to it this weekend. @jonmmease if you have a moment you might want to have a look as well as you've worked on the interchange protocol implementation and hence might be aware of edge cases we need to consider.

binste avatar Jul 03 '24 14:07 binste

I'm very interested in this PR, but I'm on holiday and won't have time to look at it until mid next week

jonmmease avatar Jul 04 '24 01:07 jonmmease

thanks for your lovely reviews! 🙏

I've updated to:

  • use the narwhals.stable.v1 (see here for the stable api policy)
  • Polars is in the dev section of pyproject.toml, whilst pandas and NumPy in all (but pandas is also in dev as per https://github.com/vega/altair/pull/3452#discussion_r1664492208)

~~EDIT: marking as draft as I think it should be possible to address https://github.com/vega/altair/issues/3471 with this PR too~~ ✅ done, and fixed another bug in the process

MarcoGorelli avatar Jul 07 '24 21:07 MarcoGorelli

@dangotbanned, I totally agree agree that you have been going very deep in the altair code recently! Which is very much appreciated. If being a maintainer would make it more easy for you to contribute more to altair, I would like to propose you to become an official maintainer.

mattijn avatar Jul 11 '24 19:07 mattijn

@MarcoGorelli

FYI, I think you should hold off on updating this branch until https://github.com/vega/altair/pull/3475 is resolved, to avoid getting an unrelated mypy error - like I did in a567c35 (#3467)

dangotbanned avatar Jul 12 '24 10:07 dangotbanned

Thanks @jonmmease for your careful review, really appreciate it! 🙏

But when the "vegafusion" data transformer is enabled, VegaFusion is able to detect which columns are needed and it will request only the required columns when loading into pyarrow (See here).

Aah, that's amazing! I'd missed that. To address this here, we've made a new release of Narwhals which includes interchange-level support - that is to say, if the input object implements the interchange protocol, then we let you inspect the schema

Like this, parse_shorthand doesn't need to do any data conversion, and so:

  • for Ibis users using the default data transformers, we fix the Date dtype bug
  • for Ibis users using the vegafusion data transformer, we preserves the status-quo 😇

(and maybe polars.LazyFrame should also implement the interchange protocol just at the metadata level, but that's one for a separate discussion)

MarcoGorelli avatar Jul 14 '24 19:07 MarcoGorelli

Thanks for the changes @MarcoGorelli!

I still have one question which I cannot really understand. Given the following example:

import datetime
import polars as pl
import pyarrow as pa
import narwhals as nw

data_type_date = [{"date": datetime.date(2024,7,15)}]
local_iso_fmt_string = "%Y-%m-%dT%H:%M:%S"
nw_selector = [nw.col('date').cast(nw.Datetime).dt.to_string(local_iso_fmt_string)]

pl_df = pl.DataFrame(data_type_date)
nw_pl_df = nw.from_native(pl_df, eager_only=True)
nw_pl_df.select(nw_selector).rows(named=True)

This returns

[{'date': '2024-07-15T00:00:00'}]

But if I do:

py_tb = pa.Table.from_pylist(data_type_date)
nw_py_tb = nw.from_native(py_tb, eager_only=True)
nw_py_tb.select(nw_selector).rows(named=True)

It returns

[{'date': '2024-07-15T00:00:00.000000'}]

I expected these output to be equal.

mattijn avatar Jul 15 '24 09:07 mattijn

thanks @mattijn for taking a look!

Libraries differ in how they interpret strftime directives - there's an issue about this in Arrow https://github.com/apache/arrow/issues/20146 We also have a note about this to the Narwhals docs: https://narwhals-dev.github.io/narwhals/api-reference/expressions_dt/#narwhals.expression.ExprDateTimeNamespace.to_string

Practically speaking, if '2024-07-15' gets transformed to

  • '2024-07-15T00:00:00' for Polars and pandas-like libraries, and
  • '2024-07-15T00:00:00.000000' for PyArrow

does that cause any issues?

Note that, if the datetime contains fractional seconds, then the outputs match

In [2]: import datetime
   ...: import polars as pl
   ...: import pyarrow as pa
   ...: import narwhals as nw
   ...:
   ...: data_type_date = [{"date": datetime.datetime(2024,7,15,1,2,3,123456)}]
   ...: local_iso_fmt_string = "%Y-%m-%dT%H:%M:%S%.f"
   ...: nw_selector = [nw.col('date').cast(nw.Datetime).dt.to_string(local_iso_fmt_string)]
   ...:
   ...: pl_df = pl.DataFrame(data_type_date)
   ...: nw_pl_df = nw.from_native(pl_df, eager_only=True)
   ...: nw_pl_df.select(nw_selector).rows(named=True)
Out[2]: [{'date': '2024-07-15T01:02:03.123456'}]

In [3]: py_tb = pa.Table.from_pylist(data_type_date)
   ...: nw_py_tb = nw.from_native(py_tb, eager_only=True)
   ...: nw_py_tb.select(nw_selector).rows(named=True)
Out[3]: [{'date': '2024-07-15T01:02:03.123456'}]

MarcoGorelli avatar Jul 15 '24 09:07 MarcoGorelli

Yes for type nw.Datetime it works correctly. The question I have is about the nw.Date type.

One change with your latest comment compared to my comment above is that for me the local_iso_fmt_string does not contain a %.f.

I thought it was an issue with pyarrow, but if I do:

import pyarrow.compute as pc
pc.strftime(py_tb['date'], format=local_iso_fmt_string).to_pylist()

It returns

['2024-07-15T00:00:00']

That seems right. And in the narwhals docs you link to (thanks for the link!) it is mentioned that:

for PyArrow, we replace "%S.%f" with "%S".

That makes me even more confused with your statement:

Practically speaking, if '2024-07-15' gets transformed to

  • '2024-07-15T00:00:00' for Polars and pandas-like libraries, and
  • '2024-07-15T00:00:00.000000' for PyArrow

Thanks for answering so promptly! Appreciated.

mattijn avatar Jul 15 '24 10:07 mattijn

The difference is the .cast(nw.Datetime) 😉 If you start with a datetime, then PyArrow's strftime outputs fractional seconds even if we just specify '%S':


In [1]: import pyarrow as pa

In [2]: from datetime import date, datetime

In [3]: import pyarrow.compute as pc

In [4]: pc.strftime(pa.array([datetime(2020, 1, 1)]), '%Y-%m-%dT%H%:%M%:%S')
Out[4]:
<pyarrow.lib.StringArray object at 0x7fd513149cc0>
[
  "2020-01-01T00%:00%:00.000000"
]

So, in summary:

  • if we're starting with dates, then we cast to datetime and format with '%Y-%m-%dT%H:%M:%S'. PyArrow differs from other libraries here in that it outputs '.000000' too
  • if we start with datetimes, then we format with '%Y-%m-%dT%H:%M:%S%.f'

MarcoGorelli avatar Jul 15 '24 10:07 MarcoGorelli

Interesting... Let me see if I can find something on this at the pyarrow library. Thanks!

mattijn avatar Jul 15 '24 10:07 mattijn

We have to cast the nw.Date type to nw.Datetime only because polars does not support time directives isn't it? Maybe we check upfront if we have to cast or not. @jonmmease, how important is this? Do you have some suggestions here. Its about this piece: https://github.com/vega/altair/blob/8b4b3dbeadbe4459eecbe0ad4459e6fb701dcdfe/altair/utils/core.py#L451-L456

mattijn avatar Jul 15 '24 11:07 mattijn

Sure - if you're OK with a teeny-tiny bit of Polars-specific logic, this can work 👍

        if dtype == nw.Date and nw.get_native_namespace(data) is get_polars():
            # Polars doesn't allow formatting `Date` with time directives.
            # The date -> datetime cast is extremely fast compared with `to_string`
            columns.append(
                nw.col(name).cast(nw.Datetime).dt.to_string(local_iso_fmt_string)
            )
        elif dtype == nw.Date:
            columns.append(nw.col(name).dt.to_string(local_iso_fmt_string))
        elif dtype == nw.Datetime:
            columns.append(nw.col(name).dt.to_string(f"{local_iso_fmt_string}%.f"))

Just pushed a commit to do that

MarcoGorelli avatar Jul 15 '24 11:07 MarcoGorelli

thanks @jonmmease ! review comments look great to me, much appreciated 🙏 ✅

MarcoGorelli avatar Jul 15 '24 15:07 MarcoGorelli

Thanks to everyone involved! This is very exciting and we should cut a release soon :) I'll open a new discussion for it.

binste avatar Jul 15 '24 17:07 binste