datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Perf: load default Utf8View for CSV datatype

Open zhuqi-lucas opened this issue 6 months ago โ€ข 7 comments

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

zhuqi-lucas avatar Jun 03 '25 13:06 zhuqi-lucas

๐Ÿค– ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing default_utf8_for_unkown_type (7586ca2b84e3312f08281d27d4aa25a5c6eba339) to 5b08b843cc52fc843ef54063bebd4c30b9b0f3a0 diff Benchmarks: h2o_small_window Results will be posted here when complete

alamb avatar Jun 03 '25 18:06 alamb

๐Ÿค–: Benchmark completed

Details

Comparing HEAD and default_utf8_for_unkown_type
--------------------
Benchmark h2o_window.json
--------------------
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Query        โ”ƒ       HEAD โ”ƒ default_utf8_for_unkown_type โ”ƒ       Change โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ QQuery 1     โ”‚  1907.12ms โ”‚                    1874.72ms โ”‚    no change โ”‚
โ”‚ QQuery 2     โ”‚  3435.06ms โ”‚                    3435.28ms โ”‚    no change โ”‚
โ”‚ QQuery 3     โ”‚  3836.65ms โ”‚                    4366.66ms โ”‚ 1.14x slower โ”‚
โ”‚ QQuery 4     โ”‚  1008.06ms โ”‚                     959.61ms โ”‚    no change โ”‚
โ”‚ QQuery 5     โ”‚  2238.31ms โ”‚                    2291.12ms โ”‚    no change โ”‚
โ”‚ QQuery 6     โ”‚  2998.08ms โ”‚                    2978.58ms โ”‚    no change โ”‚
โ”‚ QQuery 7     โ”‚  2948.78ms โ”‚                    2920.08ms โ”‚    no change โ”‚
โ”‚ QQuery 8     โ”‚ 11870.28ms โ”‚                   12098.03ms โ”‚    no change โ”‚
โ”‚ QQuery 9     โ”‚   866.09ms โ”‚                     852.79ms โ”‚    no change โ”‚
โ”‚ QQuery 10    โ”‚   938.85ms โ”‚                     940.99ms โ”‚    no change โ”‚
โ”‚ QQuery 11    โ”‚   910.25ms โ”‚                     894.20ms โ”‚    no change โ”‚
โ”‚ QQuery 12    โ”‚  1769.99ms โ”‚                    1741.19ms โ”‚    no change โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Benchmark Summary                           โ”ƒ            โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Total Time (HEAD)                           โ”‚ 34727.52ms โ”‚
โ”‚ Total Time (default_utf8_for_unkown_type)   โ”‚ 35353.24ms โ”‚
โ”‚ Average Time (HEAD)                         โ”‚  2893.96ms โ”‚
โ”‚ Average Time (default_utf8_for_unkown_type) โ”‚  2946.10ms โ”‚
โ”‚ Queries Faster                              โ”‚          0 โ”‚
โ”‚ Queries Slower                              โ”‚          1 โ”‚
โ”‚ Queries with No Change                      โ”‚         11 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

alamb avatar Jun 03 '25 18:06 alamb

It looks like no performance improvement for h2o_window benchmark result...

zhuqi-lucas avatar Jun 04 '25 02:06 zhuqi-lucas

It looks like no performance improvement for h2o_window benchmark result...

Now that I think about it, the h2o benchmark may not have any string columns ๐Ÿค”

Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ?

alamb avatar Jun 04 '25 10:06 alamb

It looks like no performance improvement for h2o_window benchmark result...

Now that I think about it, the h2o benchmark may not have any string columns ๐Ÿค”

Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ?

Thank you @alamb , this is a good point. Do some investigation from benchmark code now.

# Runs the tpch benchmark
run_tpch() {
    SCALE_FACTOR=$1
    if [ -z "$SCALE_FACTOR" ] ; then
        echo "Internal error: Scale factor not specified"
        exit 1
    fi
    TPCH_DIR="${DATA_DIR}/tpch_sf${SCALE_FACTOR}"

    RESULTS_FILE="${RESULTS_DIR}/tpch_sf${SCALE_FACTOR}.json"
    echo "RESULTS_FILE: ${RESULTS_FILE}"
    echo "Running tpch benchmark..."
    # Optional query filter to run specific query
    QUERY=$([ -n "$ARG3" ] && echo "--query $ARG3" || echo "")
    debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format parquet -o "${RESULTS_FILE}" $QUERY
}
    /// File format: `csv` or `parquet`
    #[structopt(short = "f", long = "format", default_value = "csv")]
    file_format: String,

It looks like we default to parquet for tpch, but it also supports csv, i will try to create a PR to support csv from tpch benchmark parameters.

Because from the generator code for tpch, we also generate the CSV format, so it's reasonable for us to support CSV benchmark also, i will create a PR soon. Thanks

 # Create 'tbl' (CSV format) data into $DATA_DIR if it does not already exist
    FILE="${TPCH_DIR}/supplier.tbl"
    if test -f "${FILE}"; then
        echo " tbl files exist ($FILE exists)."
    else
        echo " creating tbl files with tpch_dbgen..."
        docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}"
    fi

zhuqi-lucas avatar Jun 04 '25 10:06 zhuqi-lucas

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

alamb avatar Jun 11 '25 03:06 alamb

Thank you @alamb , i created a ticket for the support for csv tpch benchmark first. https://github.com/apache/datafusion/issues/16370

Updated:

Submitted a PR for review:

https://github.com/apache/datafusion/pull/16373

zhuqi-lucas avatar Jun 11 '25 14:06 zhuqi-lucas

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Aug 12 '25 02:08 github-actions[bot]