vortex icon indicating copy to clipboard operation
vortex copied to clipboard

`time_elapsed_opening` spends too much time

Open xudong963 opened this issue 3 months ago • 20 comments

I have two tables, one with vortex format, another parquet.

Vortex:

dft -c "explain analyze select * from vortex_table where ticker = 'EAOK' order by market limit 10" --flightsql
Registering deltalake
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CooperativeExec, metrics=[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |   FileScanExec: table = vortex_table, filters=ticker@0 = EAOK, output_ordering=["market@2 ASC NULLS LAST", "start_date@39 ASC NULLS LAST", "last_updated_utc@16 ASC NULLS LAST"], scan_direction=Native, page_size=1024, partitions=[22], fetch=10, metrics=[output_rows=10, elapsed_compute=1ns, batches_splitted=0, file_open_errors=0, file_scan_errors=0, time_elapsed_opening=23.989515ms, time_elapsed_processing=33.761589ms, time_elapsed_scanning_total=15.77185ms, time_elapsed_scanning_until_data=15.77165ms] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Parquet:

 dft -c "explain analyze select * from parquet_table where ticker = 'EAOK' order by market limit 10" --flightsql
Registering deltalake
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CooperativeExec, metrics=[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|                   |   FileScanExec: table = parqut_table, filters=ticker@0 = EAOK, output_ordering=["market@2 ASC NULLS LAST", "start_date@39 ASC NULLS LAST", "last_updated_utc@16 ASC NULLS LAST"], scan_direction=Native, page_size=1024, partitions=[22], fetch=10, metrics=[output_rows=10, elapsed_compute=1ns, batches_splitted=0, bytes_scanned=622370, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=0, page_index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=1902, pushdown_rows_pruned=63634, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=41, bloom_filter_eval_time=26.722µs, metadata_load_time=133.167µs, page_index_eval_time=2ns, row_pushdown_eval_time=260.039µs, statistics_eval_time=84.213µs, time_elapsed_opening=397.295µs, time_elapsed_processing=3.295531ms, time_elapsed_scanning_total=3.29974ms, time_elapsed_scanning_until_data=3.299529ms] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

xudong963 avatar Sep 17 '25 11:09 xudong963

Image

The open about 34%

xudong963 avatar Sep 17 '25 11:09 xudong963

Is this with the most recent release?

AdamGS avatar Sep 17 '25 11:09 AdamGS

yes, 0.52.1

xudong963 avatar Sep 17 '25 11:09 xudong963

The parquet opener

Image

xudong963 avatar Sep 17 '25 11:09 xudong963

I dare not say that the SQLs corresponding to the flamegraph are exactly the same, but they should be similar. They are collected in the online environment (There is a sequence of events.)

xudong963 avatar Sep 17 '25 11:09 xudong963

We have a known fix to avoid constructing the full scan plan up-front, but seems like the other metrics are a little slow too. Do you mind sharing some properties about the schema? Number of columns, nested types or not, etc.

gatesn avatar Sep 17 '25 14:09 gatesn

@gatesn thank you, I've sent @AdamGS the related vortex files.

xudong963 avatar Sep 17 '25 14:09 xudong963

I ran another SQL query and found that the DataSourceExec will result in an OOM, whereas Parquet does not, which may also be related to the full scan.

xudong963 avatar Sep 18 '25 01:09 xudong963

Hey, is there anything I could do about this issue?

xudong963 avatar Sep 24 '25 03:09 xudong963

Sorry I got buried in other issues. I checked now and I don't see any new data, do you see this behavior with the files you sent us for #4466?

AdamGS avatar Sep 24 '25 10:09 AdamGS

No worries, yes, but I only gave you one file; the actual table contains 32 files.

xudong963 avatar Sep 26 '25 02:09 xudong963

I think @onursatici is looking into this and similar issues now, it should be better by the next release.

AdamGS avatar Oct 07 '25 14:10 AdamGS

Thank you, sweet to hear that!

xudong963 avatar Oct 08 '25 13:10 xudong963

I saw you released a new version last week. Do you think it deserves trying? @AdamGS

xudong963 avatar Oct 16 '25 09:10 xudong963

There are defiantly a bunch of good improvements related to DataFusion and to wider file schemas. I'm not sure we got to the root if this issue yet, but I think there are reason to believe you'll get a performance win.

AdamGS avatar Oct 16 '25 13:10 AdamGS

I upgraded to 0.54.0, but got the error:

RUST_BACKTRACE=1 vx browse data.01.vortex
Error: Array encoding vortex.masked not found in registry vortex.bool, vortex.ext, vortex.varbin, vortex.chunked, fastlanes.bitpacked, vortex.datetimeparts, fastlanes.delta, vortex.fixed_size_list, vortex.alprd, vortex.bytebool, vortex.dict, vortex.fsst, fastlanes.for, vortex.pco, vortex.runend, vortex.zigzag, vortex.sparse, vortex.varbinview, vortex.decimal_byte_parts, vortex.null, vortex.sequence, vortex.primitive, vortex.zstd, vortex.list, vortex.struct, vortex.constant, vortex.alp, vortex.decimal
Backtrace:
   0: std::backtrace::Backtrace::create
   1: vortex_file::footer::Footer::from_flatbuffer
   2: vortex_file::footer::deserializer::FooterDeserializer::deserialize
   3: vortex_file::open::VortexOpenOptions::open::{{closure}}
   4: tokio::runtime::park::CachedParkThread::block_on
   5: tokio::runtime::runtime::Runtime::block_on
   6: vx::main
   7: std::sys::backtrace::__rust_begin_short_backtrace
   8: std::rt::lang_start::{{closure}}
   9: std::rt::lang_start_internal
  10: _main

xudong963 avatar Oct 17 '25 03:10 xudong963

The table seems to be queryable, but I can't open the file by vx

xudong963 avatar Oct 17 '25 05:10 xudong963

The cost is still

Image

xudong963 avatar Oct 17 '25 05:10 xudong963

The left side shows many drop_in_place.

I went through the codebase, and found that each vortex error contains a backtrace: https://github.com/vortex-data/vortex/blob/develop/vortex-error/src/lib.rs#L72, which will definitely increase memory/drop cost, and some err types may be large without Box

xudong963 avatar Oct 17 '25 06:10 xudong963

I think what's happening is that you're using older version than 0.54.0 to read file generated by 0.54.0. Vortex doesn't have forward compatibility yet

robert3005 avatar Oct 17 '25 10:10 robert3005