stata-parquet
stata-parquet copied to clipboard
Rust-based parquet reader in stata
I hope it isn't presumptuous of me to post this here - I've developed a parquet reader/writer for Stata using Rust and polars - stata_parquet_io. I tried your excellent package, with some success (it took some work updating the c++ code to work with more recent arrow builds). However, the arrow library just isn't that fast at reading parquet files relative to alternatives like polars or duckdb (or I did something wrong). I figured it was better to start from scratch and rely heavily on polars code for the parquet file handling and batching than try to update this code to use something besides arrow (maybe the wrong decision, but it's too late now...).
Before that, I had also tried using python, but Stata's python interface is so slow at serializing data from pandas, except for relatively small files. For anything relatively large, I found it was better to use I/O to have pandas write dta files in separate threads while Stata loaded the completed chunks in the main thread (actually parquet->polars->pandas->dta->stata) - that was the fastest I could get things for large files (100 million+ rows), but it was still 100x or more slower than Stata loading the equivalent dta file (and even slower than polars loading the parquet file in python). Processing that would take a minute in polars could take an hour or more to do in Stata, mostly because of slow reads.
Anyways, I'd be very curious to hear your thoughts, if you have the interest and time, as I'm sure I've left things out or missed things you already thought of or addressed. I'm also curious for you and others to test things to see if my pre-build binaries work. I'm hoping by pre-compiling things, they'll be a little easier for others to use. I only have access to a single Windows computer and Linux setup to test on - you can find the latest binaries I'm testing on at https://github.com/jrothbaum/stata_parquet_io/actions under "build and release"->Artifacts.
@jrothbaum I should have time over the summer. I've had the occasional e-mail about parquet so I do have some interest in a working interface for Stata. This is an interesting idea: I have no experience with Rust or polars but it would be good to learn a bit. I don't have time for the next two weeks but I'll have a bit after.
In the meantime I do have a question: I did this so long ago now, but my recollection was that Stata can read .dta files into memory basically at the read speed of the disk where the data is stored. I thought the main limitation was reading chunks therein: Subsets or specific columns, where Stata became painfully slow and other formats really shone. Is that not the case?
The overhead of serializing to Stata does make reading from parquet to Stata slower than reading the same full dta file often 10x slower or more - Stata is genuinely fast at reading dta files. In the other hand, Stata speeds up only slightly reading a subset of columns from a dta file whereas reading from parquet speeds up in proportion to the share of columns skipped (I have some basic benchmarks at https://github.com/jrothbaum/stata_parquet_io).
Honestly, for me the bigger benefit of having reasonably performant parquet IO in Stata is with multi language pipelines. Stata is so much slower at working with large datasets (joins, recodes, etc.) that I'd much rather work in python/polars or duckdb, but I don't have any interest in running regressions in python. I started on the rust-based plugin less to make Stata more efficient on it's own than to make it usable at all in my work. I regularly work with datasets of 100+million records and saving a separate dta copy of everything is not feasible. Polars (and hence rust) just saved me a lot of the overhead of figuring out how to stream read/write parquet files with reasonable performance.
Constantly needing to convert files to dta and back is a major bottleneck and I wish Stata would do a better job making tools available for interfacing with other file formats (how can there be no way to assign values to strl columns in the c plugin API!!!!) or developing IO tools for other formats that aren't so slow. Having said that, I get that they have limited resources and a lot of directions in which they could improve their product.