Memory consumption?
Does this extension require memory to (uncompress) Parquet fully into the memory? Or can it "stream" the Parquet by row groups? E.g. in case of SELECT * FROM parquet('file.parquet').
In other words, does the engine need to have memory available for a full uncompressed Parquet data blob?
The intent was that it doesn't uncompress the Parquet file fully into memory, although I don't think that I specifically tested that.
The library itself doesn't create any buffers for more than a row of data at once.
IIRC, the underlying parquet-cpp library loads an entire row group's worth of data into memory at once for each column that is referenced.
Sorry I can't be more helpful here!
That was very helpful, thank you. I guess I need to play with some PRAGMA settings to get to the point where sqlite does not OOM when just transferring a big Parquet file to sqlite database.
The process is quite slow probably due to single theeaded sqlite IO (it is also CPU heavy so I guess it doe deconpression and IO on the same thread). Even with snappy.
As I dont think the .cooy is available for Parquet files.
Do you think it would make sense to launch a separate thread for deconpression to get more CPU power within multicore machines?
Separate threads would make sense - I think someone else mentioned that since each column is compressed separately, there's a lot of opportunity for parallelism. If we did add that, it'd probably make sense to make the parallelism configurable.
I wonder if SELECT * might actually load everything into memory. When we emit values, we emit them with SQLITE_TRANSIENT (see https://github.com/cldellow/sqlite-parquet-vtable/blob/d44c88ad64499265ca64ac0d1ffc456a5b262092/parquet/parquet.cc#L294) which tells SQLite that it can't rely on the pointer being valid for the duration of the query, and so it must do a defensive copy. This is described at https://sqlite.org/c3ref/result_blob.html
It sounds like you're attaching the parquet file and then trying to ingest it into a regular SQLite table? I can imagine that if you were doing INSERT INTO xyz SELECT * FROM abc that sqlite could do that with a fixed amount of memory, but maybe they don't. If that was the case, a workaround, might be to do N queries that use where clauses to filter the table into N disjoint sets.
BTW - this repo is largely inactive now. In fact, I see that you have a PR from a few months ago that I failed to notice :(
Would you like to be added as a collaborator so you can directly merge stuff?
BTW - this repo is largely inactive now. In fact, I see that you have a PR from a few months ago that I failed to notice :(
Would you like to be added as a collaborator so you can directly merge stuff?
That would be nice. I could refine the PR and merge it.
Great, invite sent!
The issue was with using a UNION ALL VIEW over two Parquet files causing the whole VIEW to be materialised when accessed. Resolved the issue by not using the VIEW in the first place.