iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Integration tests performance degradation

Open kevinjqliu opened this issue 1 year ago • 6 comments

Apache Iceberg version

main (development)

Please describe the bug 🐞

Integration tests feel significantly slower than before.

make test-integration

Running on the latest main branch took 266 seconds, which is more than 4 minutes

/usr/bin/time make test-integration

=============================================================== 765 passed, 4 skipped, 2415 deselected, 24 warnings in 217.18s (0:03:37) ================================================================
      265.95 real        17.03 user         7.11 sys

Compare this to pyiceberg-0.6.0rc6 tag, integration tests only took 100 seconds.

git checkout pyiceberg-0.6.0rc6
/usr/bin/time make test-integration

============================================================== 508 passed, 4 skipped, 2305 deselected in 48.61s ==============================================================
       99.89 real         9.43 user         4.16 sys

Here are the top 10 slowest tests:

PYTEST_ARGS="--durations=10" make test-integration

============================================================================ slowest 10 durations ============================================================================
5.93s call     tests/integration/test_add_files.py::test_add_files_to_unpartitioned_table[1]
3.52s call     tests/integration/test_reads.py::test_ray_nan[catalog_hive]
3.12s setup    tests/integration/test_add_files.py::test_add_files_to_unpartitioned_table[1]
2.66s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-int]
2.62s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_v1_v2_append_null[int]
2.46s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_appended_null_partitioned[2-binary]
2.35s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-float]
2.26s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_v1_v2_append_null[bool]
2.21s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-long]
2.21s call     tests/integration/test_reads.py::test_table_properties_dict[catalog_hive]

kevinjqliu avatar Apr 13 '24 19:04 kevinjqliu

Some potential optimizations:

  • parallelize pytest execution, this requires that each test can be independently run
  • un-paratermized some tests, I noticed that tests like test_query_filter_only_nulls are parameterized by many args making the number of tests equal to the args multiplied together.

kevinjqliu avatar Apr 13 '24 19:04 kevinjqliu

Interesting, @kevinjqliu are you already working on this?

Gowthami03B avatar Apr 18 '24 14:04 Gowthami03B

Yeah, I started looking at ways to optimize all the tests. I'm blocked on making tests run in isolation. There's a wip PR #598

kevinjqliu avatar Apr 18 '24 16:04 kevinjqliu

Here's what I'm blocked on specifically.

Parallelize this test test_query_filter_appended_null,

PYTEST_ARGS="-n auto -k test_query_filter_appended_null" /usr/bin/time make test-integration

The problem is the table default.arrow_table_v1_with_null used has scope="session", autouse=True, which because of the session scope, can't run in parallel

kevinjqliu avatar Apr 18 '24 16:04 kevinjqliu

I would love to check where the time is being spent.

Concerning parallelization, I'm a bit hesitant since I think this will also apply more pressure on the rest catalog. I'm doubtful if it will really speed up the tests, but the numbers will tell.

Fokko avatar Apr 19 '24 10:04 Fokko

Making all the tests run in parallel requires lots and lots of changes.

For the short term, the best way forward is to look at individual tests. pytest's --duration shows parameterized tests as separate tests. I wrote a script to roll up the parameterized tests

Here are the top 10 tests by duration, group by test name for parameterized tests

tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_appended_null_partitioned, 22 tests, took 47.68
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_null_partitioned, 22 tests, took 39.73
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_only_nulls_partitioned, 22 tests, took 29.37
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_v1_v2_append_null, 11 tests, took 22.38
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_without_data_partitioned, 22 tests, took 17.06
tests/integration/test_add_files.py:test_add_files_to_unpartitioned_table, 3 tests, took 13.00
tests/integration/test_partitioning_key.py:test_partition_key, 28 tests, took 6.25
tests/integration/test_rest_schema.py:test_disallowed_updates, 337 tests, took 5.88
tests/integration/test_reads.py:test_ray_nan, 2 tests, took 3.88
tests/integration/test_writes/test_writes.py:test_query_filter_appended_null, 24 tests, took 3.66

Top 5 tests are all from test_partitioned_writes.py, totaling 156.22 seconds (2.6 minutes). These tests are heavily parameterized

@pytest.mark.parametrize(
    "part_col", ['int', 'bool', 'string', "string_long", "long", "float", "double", "date", "timestamptz", "timestamp", "binary"]
)

kevinjqliu avatar Apr 21 '24 17:04 kevinjqliu

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Oct 19 '24 00:10 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Nov 02 '24 00:11 github-actions[bot]