Integration tests performance degradation
Apache Iceberg version
main (development)
Please describe the bug 🐞
Integration tests feel significantly slower than before.
make test-integration
Running on the latest main branch took 266 seconds, which is more than 4 minutes
/usr/bin/time make test-integration
=============================================================== 765 passed, 4 skipped, 2415 deselected, 24 warnings in 217.18s (0:03:37) ================================================================
265.95 real 17.03 user 7.11 sys
Compare this to pyiceberg-0.6.0rc6 tag, integration tests only took 100 seconds.
git checkout pyiceberg-0.6.0rc6
/usr/bin/time make test-integration
============================================================== 508 passed, 4 skipped, 2305 deselected in 48.61s ==============================================================
99.89 real 9.43 user 4.16 sys
Here are the top 10 slowest tests:
PYTEST_ARGS="--durations=10" make test-integration
============================================================================ slowest 10 durations ============================================================================
5.93s call tests/integration/test_add_files.py::test_add_files_to_unpartitioned_table[1]
3.52s call tests/integration/test_reads.py::test_ray_nan[catalog_hive]
3.12s setup tests/integration/test_add_files.py::test_add_files_to_unpartitioned_table[1]
2.66s call tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-int]
2.62s call tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_v1_v2_append_null[int]
2.46s call tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_appended_null_partitioned[2-binary]
2.35s call tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-float]
2.26s call tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_v1_v2_append_null[bool]
2.21s call tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-long]
2.21s call tests/integration/test_reads.py::test_table_properties_dict[catalog_hive]
Some potential optimizations:
- parallelize
pytestexecution, this requires that each test can be independently run - un-paratermized some tests, I noticed that tests like
test_query_filter_only_nullsare parameterized by many args making the number of tests equal to the args multiplied together.
Interesting, @kevinjqliu are you already working on this?
Yeah, I started looking at ways to optimize all the tests. I'm blocked on making tests run in isolation. There's a wip PR #598
Here's what I'm blocked on specifically.
Parallelize this test test_query_filter_appended_null,
PYTEST_ARGS="-n auto -k test_query_filter_appended_null" /usr/bin/time make test-integration
The problem is the table default.arrow_table_v1_with_null used has scope="session", autouse=True, which because of the session scope, can't run in parallel
I would love to check where the time is being spent.
Concerning parallelization, I'm a bit hesitant since I think this will also apply more pressure on the rest catalog. I'm doubtful if it will really speed up the tests, but the numbers will tell.
Making all the tests run in parallel requires lots and lots of changes.
For the short term, the best way forward is to look at individual tests. pytest's --duration shows parameterized tests as separate tests. I wrote a script to roll up the parameterized tests
Here are the top 10 tests by duration, group by test name for parameterized tests
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_appended_null_partitioned, 22 tests, took 47.68
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_null_partitioned, 22 tests, took 39.73
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_only_nulls_partitioned, 22 tests, took 29.37
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_v1_v2_append_null, 11 tests, took 22.38
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_without_data_partitioned, 22 tests, took 17.06
tests/integration/test_add_files.py:test_add_files_to_unpartitioned_table, 3 tests, took 13.00
tests/integration/test_partitioning_key.py:test_partition_key, 28 tests, took 6.25
tests/integration/test_rest_schema.py:test_disallowed_updates, 337 tests, took 5.88
tests/integration/test_reads.py:test_ray_nan, 2 tests, took 3.88
tests/integration/test_writes/test_writes.py:test_query_filter_appended_null, 24 tests, took 3.66
Top 5 tests are all from test_partitioned_writes.py, totaling 156.22 seconds (2.6 minutes).
These tests are heavily parameterized
@pytest.mark.parametrize(
"part_col", ['int', 'bool', 'string', "string_long", "long", "float", "double", "date", "timestamptz", "timestamp", "binary"]
)
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'