pyreadstat icon indicating copy to clipboard operation
pyreadstat copied to clipboard

[Issue-308]: Add support for reading directly from file handles

Open slobodan-ilic opened this issue 3 months ago • 3 comments

File Handle Support for SAV Files

Summary

Adds support for reading SAV files directly from file-like objects (e.g., zip archives, BytesIO) without extracting to disk.

Use Case

Read large SAV files from zip archives without temporary file extraction:

import zipfile import pyreadstat

with zipfile.ZipFile('data.zip', 'r') as zf: with zf.open('survey.sav') as f: df, meta = pyreadstat.read_sav(f) # No extraction needed!

Implementation

  • Leverages ReadStat's built-in I/O handler system
  • Custom Cython handlers bridge Python file objects to C callbacks
  • Unified parser function handles both paths and file objects
  • Zero breaking changes - existing code works unchanged

Testing

  • All existing tests pass (92 tests)
  • New integration test validates zip archive reading
  • Data integrity verified (identical output to path-based reading)

Changes

  • pyreadstat.read_sav() now accepts file-like objects with read() and seek() methods
  • +132 net lines (after refactoring to eliminate duplication)
  • Extensible to other formats (DTA, SAS7BDAT, etc.)

slobodan-ilic avatar Oct 13 '25 17:10 slobodan-ilic

hi, thanks a lot for the PR! It is really great!

Is this ready or WIP?

I read your PR quickly, and I have a couple of questions, a couple of asks and one suggestion:

questions:

  • Why is the _read_fileobj implemented only for sav? Would it benefit also all the other reading functions?
  • in read_sav, if a file object you route it to _read_fileobj, which skips apply_value_formats ... why? I think that is still needed.

suggestions:

  • Would it be possible to just use the regular run_conversion instead of having a separate _read_fileobj? It looks to me as if run_conversion could be used out of the box with the changes you did. Maybe it is something you did for testing purposes but now can be discarded? This would solve my previous questions.

asks:

  • Could you add in the README file in the section Usage -> more reading options, a new sub-section describing briefly this new feature, and a couple of examples on how it could be used? I am thinking on your zip usecase ,and also what is described in #279 if it applies (reading from a remote file)
  • The test_read_sav_from_zip_file_handle, could you put it in test_narwhalified.py and make sure it works for both pandas and polars?

ofajardo avatar Oct 14 '25 13:10 ofajardo

Hey @ofajardo , thanks for your comment. I think all of your observations are valid. I should've created the PR as a draft (I just converted it to one just now). I first wanna make sure our team can use it for what they need, and if it proves to be usable, then I'll get back to the PR and make it much tighter, and also address all of your concerns.

slobodan-ilic avatar Oct 14 '25 14:10 slobodan-ilic

Awesome! Thanks!

ofajardo avatar Oct 14 '25 15:10 ofajardo