sqlite-parquet-vtable icon indicating copy to clipboard operation
sqlite-parquet-vtable copied to clipboard

Memory consumption?

Open dforsber opened this issue 4 years ago • 8 comments

Does this extension require memory to (uncompress) Parquet fully into the memory? Or can it "stream" the Parquet by row groups? E.g. in case of SELECT * FROM parquet('file.parquet').

In other words, does the engine need to have memory available for a full uncompressed Parquet data blob?

dforsber avatar Apr 20 '21 06:04 dforsber

The intent was that it doesn't uncompress the Parquet file fully into memory, although I don't think that I specifically tested that.

The library itself doesn't create any buffers for more than a row of data at once.

IIRC, the underlying parquet-cpp library loads an entire row group's worth of data into memory at once for each column that is referenced.

Sorry I can't be more helpful here!

cldellow avatar Apr 20 '21 13:04 cldellow

That was very helpful, thank you. I guess I need to play with some PRAGMA settings to get to the point where sqlite does not OOM when just transferring a big Parquet file to sqlite database.

The process is quite slow probably due to single theeaded sqlite IO (it is also CPU heavy so I guess it doe deconpression and IO on the same thread). Even with snappy.

As I dont think the .cooy is available for Parquet files.

dforsber avatar Apr 20 '21 18:04 dforsber

Do you think it would make sense to launch a separate thread for deconpression to get more CPU power within multicore machines?

dforsber avatar Apr 20 '21 18:04 dforsber

Separate threads would make sense - I think someone else mentioned that since each column is compressed separately, there's a lot of opportunity for parallelism. If we did add that, it'd probably make sense to make the parallelism configurable.

I wonder if SELECT * might actually load everything into memory. When we emit values, we emit them with SQLITE_TRANSIENT (see https://github.com/cldellow/sqlite-parquet-vtable/blob/d44c88ad64499265ca64ac0d1ffc456a5b262092/parquet/parquet.cc#L294) which tells SQLite that it can't rely on the pointer being valid for the duration of the query, and so it must do a defensive copy. This is described at https://sqlite.org/c3ref/result_blob.html

It sounds like you're attaching the parquet file and then trying to ingest it into a regular SQLite table? I can imagine that if you were doing INSERT INTO xyz SELECT * FROM abc that sqlite could do that with a fixed amount of memory, but maybe they don't. If that was the case, a workaround, might be to do N queries that use where clauses to filter the table into N disjoint sets.

cldellow avatar Apr 20 '21 18:04 cldellow

BTW - this repo is largely inactive now. In fact, I see that you have a PR from a few months ago that I failed to notice :(

Would you like to be added as a collaborator so you can directly merge stuff?

cldellow avatar Apr 20 '21 18:04 cldellow

BTW - this repo is largely inactive now. In fact, I see that you have a PR from a few months ago that I failed to notice :(

Would you like to be added as a collaborator so you can directly merge stuff?

That would be nice. I could refine the PR and merge it.

dforsber avatar Apr 21 '21 17:04 dforsber

Great, invite sent!

cldellow avatar Apr 21 '21 18:04 cldellow

The issue was with using a UNION ALL VIEW over two Parquet files causing the whole VIEW to be materialised when accessed. Resolved the issue by not using the VIEW in the first place.

dforsber avatar Apr 30 '21 20:04 dforsber