pyreadstat icon indicating copy to clipboard operation
pyreadstat copied to clipboard

Feature Request: Direct Read/Write Support for Polars DataFrames

Open albertxli opened this issue 11 months ago • 6 comments

Hi there,

I’d like to request a feature that allows direct reading and writing of data files into a Polars DataFrame, in addition to the current support for pandas. While I’m aware that converting between pandas and Polars is a viable workaround, it can be costly—particularly with large files—due to the overhead of switching formats. I think native Polars support would let us fully leverage Polars’ performance advantages without incurring extra conversion steps.

Thank you for considering this request, and I look forward to any feedback or discussion on potentially adding Polars integration!

albertxli avatar Jan 22 '25 03:01 albertxli

Implementing reading to a polars dataframe would be easy, as everything is in place. If you check the Readme it is described how to do it without passing through pandas. I can put that inside the functoon easily.

Writing from polars would be much more involved, because the writer needs to be redisegned to switch between polars and pandas. I can think about it and do in the future, but will take time. I will also prioritize writing pandas 3 dataframes.

ofajardo avatar Jan 27 '25 15:01 ofajardo

I believe it would be extremely helpful if you could demonstrate or add a function that allows direct reading of a dataset into a Polars DataFrame. I fully understand the complexity involved in enabling direct file writing from Polars, but I would love to see this feature sometime in the near future—especially given how Polars is gaining significant traction. Its intuitive, straightforward syntax makes a fully integrated experience for reading and writing DataFrames a natural next step.

Thank you again for all your excellent work on this library. It has been a fantastic alternative for those of us who frequently work with statistical files.

albertxli avatar Feb 09 '25 21:02 albertxli

@ofajardo @ninjaalbie @jccguma At this point, to support Pandas, Polars, and DuckDB with a single implementation, maybe you could consider outputting an Apache Arrow RecordBatch which implements the Python dataframe interchange protocol? This also means pyreadstat need no longer be coupled to Pandas. To deal with files larger than working memory (typical of SAS), you could also output a RecordBatchReader.

mettekou avatar Feb 18 '25 08:02 mettekou

thanks @mettekou that is a very interesting suggestion, I was not aware of this feature, it can indeed be a good solution to have an unified interface to deal with Pandas and Polars. Do you know by chance how is the memory management? i.e., when converting, do they copy all the data or just reference it to the original dataframe? What about the speed of the conversion? Those are the only possible drawbacks I could see.

ofajardo avatar Feb 21 '25 16:02 ofajardo

I'd also be interested in reading SPSS files without the Pandas dependency since I've switched to Polars. Using output_format="dict" already makes this possible, but you might want to consider removing pandas from the dependencies (and make it optional instead).

cbrnr avatar Mar 12 '25 09:03 cbrnr

thanks @mettekou that is a very interesting suggestion, I was not aware of this feature, it can indeed be a good solution to have an unified interface to deal with Pandas and Polars. Do you know by chance how is the memory management? i.e., when converting, do they copy all the data or just reference it to the original dataframe? What about the speed of the conversion? Those are the only possible drawbacks I could see.

@ofajardo For Pandas, it seems not all conversions from Arrow are zero-copy. For Polars, coming from Arrow is mostly zero-copy.

I'd also be interested in reading SPSS files without the Pandas dependency since I've switched to Polars. Using output_format="dict" already makes this possible, but you might want to consider removing pandas from the dependencies (and make it optional instead).

@cbrnr Indeed, and also make Polars optional and just provide Arrow by default, so you cover every current and future data frame library.

mettekou avatar Mar 12 '25 10:03 mettekou

I am documenting some of my experiments here:

First, I did a memory profiling using memray, writing a pandas dataframe to a SPSS file. First a very small dataframe of 1 MB. In addition to the 1MB consumed by the dataframe, write_sav added 4 MB. Then a bigger dataframe of 1 Gb. In addition to the 1Gb consumed by the dataframe, write_sav added again only 4MB. So it seems that right now write_sav is very lean on consuming memory resources, which is good news.

Now, if I would use pyarrow.interchange.from_dataframe in write_save to first transform the pandas dataframe to a pyarrow structure, I would need to make a copy of the original pandas dataframe, as I cannot modify the pandas dataframe provided by the user. The question is how much memory is that copying consuming, i.e. does it make a full copy and therefore the memory comsumption would double? Or actually is it referencing somehow the underlying structures in memory, so the memory will not double? If the latter, how large is the memory increase?

Using the 1Gb pandas dataframe, I transformed it to a new pyarrow structure using pyarrow.interchange.from_dataframe. The pyarrow structure added 1Gb additional to the 1Gb consumed by the pandas dataframe. This happen independently of the backend of the pandas dataframe being numpy or pyarrow. So it seems that it does a full copy, doubling the memory comsumption, which is not good news. pyarrow.table.from_pandas was even worse adding 2Gb instead of 1 to the process somehow.

ofajardo avatar Jun 25 '25 15:06 ofajardo

Hi,

Thanks for looking into this. I am not too familiar with how the backend process work but I recently came across to a library named "narwhals" https://github.com/narwhals-dev/narwhals that allows you to write "data frame - agnostic" packages that will work with almost any popular python data frame library out there including pandas, polars and duckdb and so on. And from reading the github repo and it seems this can provide you with some ways to make [pyreadstat] library fit with any data frame libraries much easier.

Here is an interesting read from the Polars website on how "Altair" (a plotting library) that previously only support Pandas and PyArrow but was able to extend their compatibility to Polars through the Narwhals expressive API. https://pola.rs/posts/lightweight_plotting/

Again, I am not software developer myself so really can't comment on how easy it is to potentially utilize Narwhals to make Pyreadstat a completely "dataframe agnostic" library but I hope this can provide some new ways in thinking about different approaches.

Thanks again!

albertxli avatar Jun 25 '25 17:06 albertxli

Thanks for the suggestion, I will look into it

ofajardo avatar Jun 25 '25 18:06 ofajardo

hey, narwhals look really awesome! It looks really fit for purpose for the problem at hands here. I did a quick test, writing the large 1 Gb dataframe of integers to a spss sav file, internally narwhifying the dataframe (I have not tested polars yet). In terms of memory usage, the overhead added by narwhals was minimal! In terms of speed it was equivalent! And it definetely simplifies the code.

So, it looks oromising and therefore, I will continuing looking at it. I have to say there are many things to do here before it would be ready, and I may find a blocker down the road. In such case I would reach out to the narwhals team for help. Unfortunately also the time I can spend on this project right now is limited, so it definitely will take a while.

ofajardo avatar Jun 26 '25 16:06 ofajardo

Thanks for the update and I am really glad to hear that Narwhals can help to achieve the goals! Of course, I totally understand this will take some time and can't wait until the updated package that will support all mainstream python dataframe libraries.

Good luck, let me know if I can help with looking into any other tools to make the transition a bit easier, happy to help anytime!

albertxli avatar Jul 07 '25 21:07 albertxli

Good news! Support for polars is ready on the branch test_narwhals! I would be very grateful if people could test it before releasing, both polars and pandas. You can either build from that branch or download and install a wheel from here. Please provide feedback!

At the moment all the documentation has to be updated.

In order to write, just pass the polars dataframe. In order to read set the argument output_format="polars"

Narwhals has helped a lot and simplyfied things. Narwhals team was really helpful and approachable.

Still, there was a lot of work from my side involved on adjusting things to work for polars and particularly to make things consistent between pandas and polars. So, it is not as narwhals makes things to work automatically for other dataframe packages. Right now none of the other supported packages by narwhals work out of the box with pyreadstat, therefore support is restricted to pandas and polars and no plan to expand for now.

ofajardo avatar Aug 02 '25 14:08 ofajardo

Thanks a lot for the update and this is so exciting! I will spend sometime this upcoming week and do some testing on both pandas and polars dataframe with the testing 'ver. 1.3.1' and report back with any feedback.

albertxli avatar Aug 03 '25 14:08 albertxli

Hi, I am happy to report back that the tests are pretty positive for the most part. I tested the read and write functions with both pandas and polars on sav files. On the pandas side, things are pretty consistent and the performance is stable. However, I've found an error while using polars, see below:

=== Package Versions ===
polars         : 1.32.2
pandas         : 2.3.1
numpy          : 2.3.2
pyreadstat     : 1.3.1

=== System Info ===
Python         : 3.13.5
Platform       : Windows-11-10.0.26100-SP0
Architecture   : 64bit


df, meta = pyreadstat.read_sav(
    r"C:\Users\user\Downloads\latin1_encoing_spss.sav",
    output_format="polars", 
    encoding="latin1",
    apply_value_formats=True)
ERROR MESSAGE:
Traceback (most recent call last):
  File "C:\Users\user\Downloads\uvtesting1\.venv\Lib\site-packages\marimo\_runtime\executor.py", line 138, in execute_cell
    exec(cell.body, glbls)
    ~~~~^^^^^^^^^^^^^^^^^^
  Cell 
marimo://pyrstesting.py#cell=cell-1

, line 1, in <module>
    df, meta = pyreadstat.read_sav(
               ~~~~~~~~~~~~~~~~~~~^
        r"C:\Users\user\Downloads\latin1_encoing_spss.sav",
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        output_format="polars",
        ^^^^^^^^^^^^^^^^^^^^^^^
        encoding="latin1",
        ^^^^^^^^^^^^^^^^^^
        apply_value_formats=True)
        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyreadstat/pyreadstat.pyx", line 397, in pyreadstat.pyreadstat.read_sav
  File "C:\Users\user\Downloads\uvtesting1\.venv\Lib\site-packages\pyreadstat\pyfunctions.py", line 72, in set_value_labels
    df_copy = df_copy.with_columns(nw.col(var_name).cast(nw.Categorical))
  File "C:\Users\user\Downloads\uvtesting1\.venv\Lib\site-packages\narwhals\dataframe.py", line 1481, in with_columns
    return super().with_columns(*exprs, **named_exprs)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Downloads\uvtesting1\.venv\Lib\site-packages\narwhals\dataframe.py", line 165, in with_columns
    return self._with_compliant(self._compliant_frame.with_columns(*compliant_exprs))
                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Downloads\uvtesting1\.venv\Lib\site-packages\narwhals\_polars\dataframe.py", line 357, in func
    raise catch_polars_exception(e) from None
narwhals.exceptions.ComputeError: cannot cast 'Object' type 

I used both uv and conda environment to test the same script and got the same error. This error only arose with encoding="latin1" and apply_value_formats=True while using Polars. If I set apply_value_formats=False then it would work with reading using Polars. The same parameter works well with pandas. This seems an issue tied to narwhals's data type. Anyway, this is the only major thing I found with the limited time I have for testing. Will report back if I encounter any other errors. Thanks a lot!

albertxli avatar Aug 08 '25 21:08 albertxli

Thanks for testing! Would it be possible for you to share the file for me to debug? If not maybe you could create a dummy file to reproduce the error, here I think the key are the labels somehow are producing a column of type Object because it is mixed strings and numbers and polars rejects to transform that into a category

ofajardo avatar Aug 09 '25 05:08 ofajardo

What happens if you do formats_as_category=False? What happens in polars and pandas?

ofajardo avatar Aug 09 '25 05:08 ofajardo

This is a data file produced by one of our partners so unfortunately it's not sharable. I tried to re-create a dummy data file with some latin1 characters in both column labels and value labels and save it using the latest SPSS version 31 with latin1 encoding. The dummy data worked really well with no issues while reading using Polars and apply value formats set as True. I assume our partner might be using a really old SPSS version which when trying to read it into label formats some parts just broke.

I also tried do formats_as_category=False as you suggested, then pyreadstat was able to successfully read the it into a polars dataframe with the labels as data. While all the categorical variables are read as type 'object' but I can convert them into categorical data if needed.

I'll try to see if I can make a dummy for you to test out but my system is default utf8 and SPSS doesn't allow change encoding while write. Hope this helps.

albertxli avatar Aug 09 '25 19:08 albertxli

thanks for the info. What I will do is check if the series is of object type and if so do not try to cast to categorical, i.e. you would get the same as formats_as_category=False. How are you testing right now? from the test_narwhals branch or from the wheels?

ofajardo avatar Aug 10 '25 08:08 ofajardo

Ok I have added some code to circumvent the issue, it is available on the test_narwhals branch and the wheels on the same place. Could you please try it and let me know?

ofajardo avatar Aug 11 '25 08:08 ofajardo

Great, thanks for the updates. I will test again today from the wheels and report back the results.

albertxli avatar Aug 11 '25 14:08 albertxli

I managed to create a dummy data set to showcase some of the issue described below and you can download it from here. With the testing, the dummy data yields the same warning message as the actual data.

=== Package Versions ===
polars         : 1.32.2
pandas         : 2.3.1
pyreadstat     : 1.3.1 #From the wheels

=== System Info ===
Python         : 3.13.5
Platform       : Windows-11-10.0.26100-SP0
Architecture   : 64bit
1st run-->

df, meta = pyreadstat.read_sav("latin1_dummy_2.sav",
                               output_format="polars",
                               encoding='windows-1252',
                               apply_value_formats=True,
                               formats_as_category=True)

And got the following warnings:
C:\Users\user\Downloads\pyrstesting_3.13\.venv\Lib\site-packages\pyreadstat\pyfunctions.py:61: RuntimeWarning: You requested formats_as_category=True or formats_as_ordered_category=True, but it was not possible to cast variable 'S4' to category
  warnings.warn(msg, RuntimeWarning)

You can see the print of the data looks like this. Column "S4" is supposed to be read as an "cat" column but instead of read as an "object". While the rest of the columns are read as "int" columns but they should be "cat" columns. S4 looks like the breaking point.

2nd run -->
df, meta = pyreadstat.read_sav("latin1_dummy_2.sav",
                               output_format="polars",
                               encoding='windows-1252',
                               apply_value_formats=True,
                               formats_as_category=False)

See print here. No warning messages this time.

I also tested both scenarios with pandas and things just worked with no warning message. One theory I have is how Polars treats 'missing data' give column "S4" contain missing data despite it is a categorical column. Polars' default missing data type is null instead of pandas' 'NaN" and do you think this could play a role here?

I have send an invite to you so you should be able to see the dummy file and all result links inside this reply. Let me know if I can help with anything else.

albertxli avatar Aug 11 '25 17:08 albertxli

Thanks a lot for the sample file, that was really useful! The problem was that I had to check if the elements in the column after applying the labels are all strings or of the same type, and I was not taking into account that some values could be null, so indeed as yous said the problem was in the nulls. The problem in my opinion has nothing to do with the encoding, because for me the warning arised also without using the encoding argument. BTW, without encodngin argument all the data look fine, so for me it looks more like an UTF-8 file than in another encoding, but cannot say for sure.

In any case, I have solved the issue (in my hands), and I have produced new wheels. Would you be so kind to test the new wheels once again?

Thanks!

ofajardo avatar Aug 12 '25 10:08 ofajardo

Of course. I had done another round of testing with both the dummy_2 data and my OG data. "dummy_2" data passed the test so all good there. However I ran into the following error with my full data and I have created dummy_3.sav so you can reproduce the error:

df, meta = pyreadstat.read_sav(latin1_dummy_3.sav",
                                   output_format="polars",
                                   apply_value_formats=True,
                                   formats_as_category=False)
------------------------------------------------------------------------------------------------------------------------------

Traceback (most recent call last):
  File "pyrstesting_3.13_aug_12\.venv\Lib\site-packages\marimo\_runtime\executor.py", line 138, in execute_cell
    exec(cell.body, glbls)
    ~~~~^^^^^^^^^^^^^^^^^^
  Cell 
marimo://test.py#cell=cell-2

, line 1, in <module>
    df, meta = pyreadstat.read_sav(latin1_dummy_3.sav",
               ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                       output_format="polars",
                                       ^^^^^^^^^^^^^^^^^^^^^^^
                                       apply_value_formats=True,
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^
                                       formats_as_category=False
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^
                                )
                                ^
  File "pyreadstat/pyreadstat.pyx", line 397, in pyreadstat.pyreadstat.read_sav
  File "C:\Users\user\Downloads\pyrstesting_3.13_aug_12\.venv\Lib\site-packages\pyreadstat\pyfunctions.py", line 64, in set_value_labels
    df_copy = df_copy.with_columns(nw.col(var_name).replace_strict(labels))
  File "C:\Users\user\Downloads\pyrstesting_3.13_aug_12\.venv\Lib\site-packages\narwhals\dataframe.py", line 1487, in with_columns
    return super().with_columns(*exprs, **named_exprs)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Downloads\pyrstesting_3.13_aug_12\.venv\Lib\site-packages\narwhals\dataframe.py", line 171, in with_columns
    return self._with_compliant(self._compliant_frame.with_columns(*compliant_exprs))
                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Downloads\pyrstesting_3.13_aug_12\.venv\Lib\site-packages\narwhals\_polars\dataframe.py", line 348, in func
    raise catch_polars_exception(e) from None
narwhals.exceptions.InvalidOperationError: conversion from `f64` to `null` failed in column 'literal' for 1 out of 2 values: [1.0]

In the new dummy_3 data, I have add 2 addition columns where all values are 'missing data'. This is how our full data looks due to data collection phase variation. In certain phases these columns will not be completely empty. But for this dataset, S5_1 and S5_2 are completely missing data so I think that is causing the issue here. Happy to test it out again once fixed!

albertxli avatar Aug 12 '25 21:08 albertxli

Thanks! I was already wondering if the null columns could cause problems, so the example is really helpful. It is fixed now (basically I am ignoring null columns, because there is no format to apply). Please test again and let me know what you think.

ofajardo avatar Aug 13 '25 10:08 ofajardo

Happy to let you know the error is gone and the null columns are read as intended, thank you for fixing it! While running test on the actual data file, I had the following warning message so I created dummy_4 data for you to test it out.

df, meta = pyreadstat.read_sav("latin1_dummy_4.sav",
                               output_format="polars",
                               apply_value_formats=True,
                            )

C:\Users\user\Downloads\pyrstesting_3.13_aug_13\.venv\Lib\site-packages\pyreadstat\pyfunctions.py:64: RuntimeWarning: You requested formats_as_category=True or formats_as_ordered_category=True, but it was not possible to cast variable 'D1' to category
  warnings.warn(msg, RuntimeWarning)

In the new dummy-4.sav file, I added a new D1 column. The tricky part about this column is it's suppose to be a categorical column by definition. But somehow the data involved a value '6' which doesn't have a value label in the meta. So polars turned it into a object column which is fair because the mixed data type. Interestingly, with the same data read config, pandas reads D1 as categorical column so I am curious to hear your opinions on this. I don't think this is an error but just want to point it out to you to make the read experience more robust. This metadata/data value mismatch situation happens very often when we receive sav data from a partner so I think it is nice to get a warning to know that certain data value doesn't have value_label in the metadata.

albertxli avatar Aug 13 '25 16:08 albertxli

Right, the warning comes from me, and it is because you have (by default) formats_as_category=True. If you would set to False the warning would be gone. Pandas is happy to create categories out if mixed types columns, but polars raises an Exception (not possible to cast Object to Category). So I do not see any other alternative. Do you have any suggestion?

ofajardo avatar Aug 13 '25 16:08 ofajardo

Got it and all good from my end. I think this warning messages serves a good purpose. A good reason I am migrating from pandas to polars is its stricter data type. Ideally each column should contain just a single data type especially for categorical data. So it is great to have the warning there. Because of it then I realize we have unlabeled data value so I can go fix it.

Thank you again for the adding the native experience to read and write with Polars!

albertxli avatar Aug 13 '25 20:08 albertxli

Ok cool. I will release it soon!

ofajardo avatar Aug 13 '25 20:08 ofajardo

ready in version 1.3.1! if new problems arise, please open a separate issue.

ofajardo avatar Aug 14 '25 14:08 ofajardo

Thank you, will do!

albertxli avatar Aug 14 '25 14:08 albertxli