datafusion
datafusion copied to clipboard
Perf: load default Utf8View for CSV datatype
Which issue does this PR close?
- Closes #16241
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
๐ค ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing default_utf8_for_unkown_type (7586ca2b84e3312f08281d27d4aa25a5c6eba339) to 5b08b843cc52fc843ef54063bebd4c30b9b0f3a0 diff
Benchmarks: h2o_small_window
Results will be posted here when complete
๐ค: Benchmark completed
Details
Comparing HEAD and default_utf8_for_unkown_type
--------------------
Benchmark h2o_window.json
--------------------
โโโโโโโโโโโโโโโโณโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโ
โ Query โ HEAD โ default_utf8_for_unkown_type โ Change โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ QQuery 1 โ 1907.12ms โ 1874.72ms โ no change โ
โ QQuery 2 โ 3435.06ms โ 3435.28ms โ no change โ
โ QQuery 3 โ 3836.65ms โ 4366.66ms โ 1.14x slower โ
โ QQuery 4 โ 1008.06ms โ 959.61ms โ no change โ
โ QQuery 5 โ 2238.31ms โ 2291.12ms โ no change โ
โ QQuery 6 โ 2998.08ms โ 2978.58ms โ no change โ
โ QQuery 7 โ 2948.78ms โ 2920.08ms โ no change โ
โ QQuery 8 โ 11870.28ms โ 12098.03ms โ no change โ
โ QQuery 9 โ 866.09ms โ 852.79ms โ no change โ
โ QQuery 10 โ 938.85ms โ 940.99ms โ no change โ
โ QQuery 11 โ 910.25ms โ 894.20ms โ no change โ
โ QQuery 12 โ 1769.99ms โ 1741.19ms โ no change โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโ
โ Benchmark Summary โ โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Total Time (HEAD) โ 34727.52ms โ
โ Total Time (default_utf8_for_unkown_type) โ 35353.24ms โ
โ Average Time (HEAD) โ 2893.96ms โ
โ Average Time (default_utf8_for_unkown_type) โ 2946.10ms โ
โ Queries Faster โ 0 โ
โ Queries Slower โ 1 โ
โ Queries with No Change โ 11 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโ
It looks like no performance improvement for h2o_window benchmark result...
It looks like no performance improvement for h2o_window benchmark result...
Now that I think about it, the h2o benchmark may not have any string columns ๐ค
Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ?
It looks like no performance improvement for h2o_window benchmark result...
Now that I think about it, the h2o benchmark may not have any string columns ๐ค
Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ?
Thank you @alamb , this is a good point. Do some investigation from benchmark code now.
# Runs the tpch benchmark
run_tpch() {
SCALE_FACTOR=$1
if [ -z "$SCALE_FACTOR" ] ; then
echo "Internal error: Scale factor not specified"
exit 1
fi
TPCH_DIR="${DATA_DIR}/tpch_sf${SCALE_FACTOR}"
RESULTS_FILE="${RESULTS_DIR}/tpch_sf${SCALE_FACTOR}.json"
echo "RESULTS_FILE: ${RESULTS_FILE}"
echo "Running tpch benchmark..."
# Optional query filter to run specific query
QUERY=$([ -n "$ARG3" ] && echo "--query $ARG3" || echo "")
debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format parquet -o "${RESULTS_FILE}" $QUERY
}
/// File format: `csv` or `parquet`
#[structopt(short = "f", long = "format", default_value = "csv")]
file_format: String,
It looks like we default to parquet for tpch, but it also supports csv, i will try to create a PR to support csv from tpch benchmark parameters.
Because from the generator code for tpch, we also generate the CSV format, so it's reasonable for us to support CSV benchmark also, i will create a PR soon. Thanks
# Create 'tbl' (CSV format) data into $DATA_DIR if it does not already exist
FILE="${TPCH_DIR}/supplier.tbl"
if test -f "${FILE}"; then
echo " tbl files exist ($FILE exists)."
else
echo " creating tbl files with tpch_dbgen..."
docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}"
fi
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look
Thank you @alamb , i created a ticket for the support for csv tpch benchmark first. https://github.com/apache/datafusion/issues/16370
Updated:
Submitted a PR for review:
https://github.com/apache/datafusion/pull/16373
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.