fastparquet
fastparquet copied to clipboard
Write without loading to RAM (skip pandas?)
My understanding of pandas library is that it requires loading the entire dataset into memory. Is there any way to avoid this requirements and write data from a stream or stored file - without having preloaded the entire table into ram via a Pandas dataset?
My concern with using this library is that it may fail with larger source data files. Is there any collective best practice or mitigation for this concern to avoid failures. Note this concern applies to very large datasets but also to small worker nodes (e.g. in a CI/CD stack) with small amounts of RAM (1-4 GB).
In short: yes, it is often possible to load and process pandas datasets by chunk, and some of the loaders (CSV...) have methods for doing that. For this library, you can use fastparquet.ParquetFile.iter_row_groups. "Row Group" is logical unit within parquet, and you cannot iterate with smaller pieces.
However, you might find that dask is your best bet for processing bigger-than-memory datasets in a more general sense than iterating over
Awesome - thank you. I think using the chucksize argument for read_csv() should mitigate this issue. I should be able to define a configurable variable like max_partition_rows or chunksize and then would just pass one "chunk" at a time to the fastparquet write() or iter_row_groups() function. (Also, I should have clarified in my original post that I'm specifically looking to write parquet files with this library.)
Pseudocode for anyone else interested:
import pandas as pd
data_iterator = pd.read_csv("large_data.csv", chunksize=100000)
chunk_list = []
# Each chunk is in dataframe format
for data_chunk in data_iterator:
# use .iter_row_groups() or .write() to write one or more chunks to the parquet file.
Feel free to close this issue as needed.
For writing, I would not issue repeated append calls, but instead write separate files and load them together later. Again, dask can help with this wort of thing.
For writing, I would not issue repeated append calls, but instead write separate files and load them together later. Again, dask can help with this wort of thing.
Hi Martin, Please, could you explain why you unadvise using append in such a case? I am intending to do the same, but if not recommended, I would as well prefer to know why.
Basically, my understanding is that when you append to an existing parquet data set, metadata will get updated. Later on, I can then use the metadata to select the data I will want to load again (speaking about time serie, I will be able to know in which parquet file the timestamp range of interest is located thanks to the min/max timestamp per file as indicated in the metadata)
If I write one file at a time, then metadata do not get consolidated, and selective loading / loading by chunck becomes more difficult, does it not?
Thanks for your advice on this, bests
Please, could you explain why you unadvise using append in such a case?
Each append requires reading the whole metadata, altering it in memory, and then writing it all to a file again. With detailed delving into the thrift code, it *would8 be possible to read up to a certain row-group in the metadata and start writing the new metadata there; but this code doesn't exist, and I think would be hard to write. The usefulness of a _metadata file for a dataset that is evolving is questionable.