[BUG] Rapids tools test on Databricks fail
Describe the bug The rapids tools test integration build on databricks is failing. We are just running against the event logs created by the integration tests and I'm guessing those changed but we need to investigate more.
10:24:32 ++ ls -1 rapids_4_spark_profile/local-1661692137602/profile.log rapids_4_spark_profile/local-1661692137651/profile.log rapids_4_spark_profile/local-1661692137944/profile.log rapids_4_spark_profile/local-1661692137979/profile.log rapids_4_spark_profile/local-1661698019032/profile.log rapids_4_spark_profile/local-1661698019035/profile.log rapids_4_spark_profile/local-1661698519710/profile.log
10:24:32 + '[' 7 -lt 8 ']'
10:24:32 + echo 'Number of '\''profile.log'\'' less than expected 8'
Not sure if one of them changed, Alex noticed:
[2022-08-24T15:24:59.431Z] 22/08/24 15:24:59 INFO ApplicationInfo: Parsing Event Log: file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-9.1-65/local-1661347072168
[2022-08-24T15:25:05.954Z] 22/08/24 15:25:05 WARN Profiler: Error parsing JSON: ApacheSparkEventLog(file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-9.1-65/local-1661347072168)
[2022-08-24T15:25:05.954Z] 22/08/24 15:25:05 INFO Profiler: No application to process. Exiting
If its just integration tests changing we need to fix our build or come up with more reliable test
it looks like we are getting json parse exceptions on 2 files, on is:
exception com.fasterxml.jackson.core.JsonParseException: Unexpected character (',' (code 44)): was expecting a colon to separate field name and value
Since this is Databricks, its very possible they changed the events and parsing so we need to investigate if we need to use their version to parse it properly.
The latest build passed on this so perhaps something got corrupt here. I'll reopen if we see it again.
Heppened again.
In recent failure, the event log json is malformed which causes the profiler to fail:
{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecution{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLE{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":2,"description":"src/main/python/conditionals_test.py::test_case_when[Float] (call)...
It might be an issue with Databricks logging in that it's intermittent and a race condition could be hit. Not sure it makes sense for us to adjust the tools jar to handle malformed JSON in the event log. Previous logs that caused a failure had similar issues with malformed json as validated by jq.
I'd rather not have us ignore this as depending on the event it could be critical. I would rather wait to see if it becomes an issues with real event logs for users. If its not very often we may be better off to fix up the event logs so we get all the information.
the only case here is if it keeps failing in our testing that is pretty annoying. We might want to make sure our testing isn't doing something weird... like with the parallel testing for instance.
happened again (w/ runtime 10.4 this time)
[2022-10-09T15:08:15.578Z] 22/10/09 15:08:15 INFO ApplicationInfo: Parsing Event Log: file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-10.4-44/local-1665325816567
[2022-10-09T15:08:15.578Z] 22/10/09 15:08:15 WARN Profiler: Error parsing JSON: ApacheSparkEventLog(file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-10.4-44/local-1665325816567)
[2022-10-09T15:08:15.578Z] 22/10/09 15:08:15 INFO Profiler: No application to process. Exiting
happened again with 22.12-SNAPSHOT jar:
[2022-10-16T17:50:44.990Z] 22/10/16 17:50:44 WARN Profiler: Error parsing JSON: ApacheSparkEventLog(file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-10.4-46/local-1665941642741)
[2022-10-16T17:50:44.990Z] 22/10/16 17:50:44 INFO Profiler: No application to process. Exiting
For the recent tools job failures, I see these exceptions show up near the end of the previous pytests job (and the exceptions are not seen on days where all jobs pass):
09:39:25 22/10/09 14:39:24 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException
09:39:25 java.util.concurrent.TimeoutException
09:39:25 at java.util.concurrent.FutureTask.get(FutureTask.java:205)
09:39:25 at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
09:39:25 at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
09:39:25 22/10/09 14:39:24 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException
09:39:25 java.util.concurrent.TimeoutException
09:39:25 at java.util.concurrent.FutureTask.get(FutureTask.java:205)
09:39:25 at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
09:39:25 at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
09:39:25 22/10/09 14:39:24 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException
09:39:25 java.util.concurrent.TimeoutException
09:39:25 at java.util.concurrent.FutureTask.get(FutureTask.java:205)
09:39:25 at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
09:39:25 at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
09:39:25 22/10/09 14:39:24 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException
09:39:25 java.util.concurrent.TimeoutException
09:39:25 at java.util.concurrent.FutureTask.get(FutureTask.java:205)
09:39:25 at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
09:39:25 at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
09:39:57 22/10/09 14:39:54 ERROR ShutdownHookManager: ShutdownHookManager shutdown forcefully after 30 seconds.
09:39:57 22/10/09 14:39:54 ERROR ShutdownHookManager: ShutdownHookManager shutdown forcefully after 30 seconds.
09:39:57 22/10/09 14:39:54 ERROR ShutdownHookManager: ShutdownHookManager shutdown forcefully after 30 seconds.
09:39:57 22/10/09 14:39:54 ERROR ShutdownHookManager: ShutdownHookManager shutdown forcefully after 30 seconds.
Not sure if the exception is connected to invalid event log output being written it, but I'm wondering if there's some way to reduce concurrency for pytests or at least simplify event log output for the subsequent tools jar test.
Is parallelism set to 4 for the pytests execution? Not sure how to interpret this command, but I see -n 4:
exec python /home/ubuntu/integration_tests/runtests.py --rootdir /home/ubuntu/integration_tests /home/ubuntu/integration_tests/src/main/python -n 4 -v -rfE '' --std_input_path=/home/ubuntu/integration_tests/src/test/resources --color=yes '' --junitxml=TEST-pytest-1665935200368720540.xml --runtime_env=databricks --test_type=nightly
yes, it means start 4 pytest workers to execute tests in parallel. More details please refer to https://pytest-xdist.readthedocs.io
Could you estimate how much longer the pytests execution would take if n was set to 1 or 2, compared to how long it currently takes? That is likely what is causing the event log corruption in Databricks.
Could you estimate how much longer the pytests execution would take if n was set to 1 or 2, compared to how long it currently takes? That is likely what is causing the event log corruption in Databricks.
non-parallel IT run would cost ~ 8+ hours parallelism as 2 would cost ~ 4+ hours
I am fine to disable parallel run for this CI, but is also means double/quadruple the $$$ since the databricks charge us by instance uptime. cc @sameerz
I would rather not disable parallelism, we were just using these files out of convenience, we should come up with different way to test the tools that is more reliable
perhaps leverage Jenkins parallelism or other tool (e.g. parallel cmd) to run test instead of pytest-xdist? It would be separated spark jobs for different tests but still in parallel
new CI jobs is deployed, that leveraged Jenkins parallelism. Will see whether this issue can be reproduced
@zhanga5 Thanks for the update! Should we close this issue for now and re-open if the issue shows up again?
lets close this one as fix merged. We could reopen/file new tickets if we found new issue