spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] Rapids tools test on Databricks fail

Open tgravescs opened this issue 3 years ago • 9 comments

Describe the bug The rapids tools test integration build on databricks is failing. We are just running against the event logs created by the integration tests and I'm guessing those changed but we need to investigate more.


10:24:32  ++ ls -1 rapids_4_spark_profile/local-1661692137602/profile.log rapids_4_spark_profile/local-1661692137651/profile.log rapids_4_spark_profile/local-1661692137944/profile.log rapids_4_spark_profile/local-1661692137979/profile.log rapids_4_spark_profile/local-1661698019032/profile.log rapids_4_spark_profile/local-1661698019035/profile.log rapids_4_spark_profile/local-1661698519710/profile.log
10:24:32  + '[' 7 -lt 8 ']'
10:24:32  + echo 'Number of '\''profile.log'\'' less than expected 8'

Not sure if one of them changed, Alex noticed:

[2022-08-24T15:24:59.431Z] 22/08/24 15:24:59 INFO ApplicationInfo: Parsing Event Log: file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-9.1-65/local-1661347072168
[2022-08-24T15:25:05.954Z] 22/08/24 15:25:05 WARN Profiler: Error parsing JSON: ApacheSparkEventLog(file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-9.1-65/local-1661347072168)
[2022-08-24T15:25:05.954Z] 22/08/24 15:25:05 INFO Profiler: No application to process. Exiting

If its just integration tests changing we need to fix our build or come up with more reliable test

tgravescs avatar Aug 30 '22 13:08 tgravescs

it looks like we are getting json parse exceptions on 2 files, on is: exception com.fasterxml.jackson.core.JsonParseException: Unexpected character (',' (code 44)): was expecting a colon to separate field name and value

tgravescs avatar Aug 30 '22 14:08 tgravescs

Since this is Databricks, its very possible they changed the events and parsing so we need to investigate if we need to use their version to parse it properly.

tgravescs avatar Aug 30 '22 14:08 tgravescs

The latest build passed on this so perhaps something got corrupt here. I'll reopen if we see it again.

tgravescs avatar Aug 30 '22 20:08 tgravescs

Heppened again.

tgravescs avatar Sep 02 '22 18:09 tgravescs

In recent failure, the event log json is malformed which causes the profiler to fail:

{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecution{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLE{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":2,"description":"src/main/python/conditionals_test.py::test_case_when[Float] (call)...

It might be an issue with Databricks logging in that it's intermittent and a race condition could be hit. Not sure it makes sense for us to adjust the tools jar to handle malformed JSON in the event log. Previous logs that caused a failure had similar issues with malformed json as validated by jq.

mattahrens avatar Sep 21 '22 17:09 mattahrens

I'd rather not have us ignore this as depending on the event it could be critical. I would rather wait to see if it becomes an issues with real event logs for users. If its not very often we may be better off to fix up the event logs so we get all the information.

tgravescs avatar Sep 21 '22 17:09 tgravescs

the only case here is if it keeps failing in our testing that is pretty annoying. We might want to make sure our testing isn't doing something weird... like with the parallel testing for instance.

tgravescs avatar Sep 21 '22 17:09 tgravescs

happened again (w/ runtime 10.4 this time)

[2022-10-09T15:08:15.578Z] 22/10/09 15:08:15 INFO ApplicationInfo: Parsing Event Log: file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-10.4-44/local-1665325816567
[2022-10-09T15:08:15.578Z] 22/10/09 15:08:15 WARN Profiler: Error parsing JSON: ApacheSparkEventLog(file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-10.4-44/local-1665325816567)
[2022-10-09T15:08:15.578Z] 22/10/09 15:08:15 INFO Profiler: No application to process. Exiting

zhanga5 avatar Oct 10 '22 01:10 zhanga5

happened again with 22.12-SNAPSHOT jar:

[2022-10-16T17:50:44.990Z] 22/10/16 17:50:44 WARN Profiler: Error parsing JSON: ApacheSparkEventLog(file:/dbfs/test_rapids_tools/jenkins-rapids_it-awsDatabricks-10.4-46/local-1665941642741)
[2022-10-16T17:50:44.990Z] 22/10/16 17:50:44 INFO Profiler: No application to process. Exiting

zhanga5 avatar Oct 17 '22 02:10 zhanga5

For the recent tools job failures, I see these exceptions show up near the end of the previous pytests job (and the exceptions are not seen on days where all jobs pass):

09:39:25  22/10/09 14:39:24 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException
09:39:25  java.util.concurrent.TimeoutException
09:39:25  	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
09:39:25  	at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
09:39:25  	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
09:39:25  22/10/09 14:39:24 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException
09:39:25  java.util.concurrent.TimeoutException
09:39:25  	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
09:39:25  	at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
09:39:25  	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
09:39:25  22/10/09 14:39:24 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException
09:39:25  java.util.concurrent.TimeoutException
09:39:25  	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
09:39:25  	at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
09:39:25  	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
09:39:25  22/10/09 14:39:24 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException
09:39:25  java.util.concurrent.TimeoutException
09:39:25  	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
09:39:25  	at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
09:39:25  	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
09:39:57  22/10/09 14:39:54 ERROR ShutdownHookManager: ShutdownHookManager shutdown forcefully after 30 seconds.
09:39:57  22/10/09 14:39:54 ERROR ShutdownHookManager: ShutdownHookManager shutdown forcefully after 30 seconds.
09:39:57  22/10/09 14:39:54 ERROR ShutdownHookManager: ShutdownHookManager shutdown forcefully after 30 seconds.
09:39:57  22/10/09 14:39:54 ERROR ShutdownHookManager: ShutdownHookManager shutdown forcefully after 30 seconds.

Not sure if the exception is connected to invalid event log output being written it, but I'm wondering if there's some way to reduce concurrency for pytests or at least simplify event log output for the subsequent tools jar test.

mattahrens avatar Oct 18 '22 15:10 mattahrens

Is parallelism set to 4 for the pytests execution? Not sure how to interpret this command, but I see -n 4:

exec python /home/ubuntu/integration_tests/runtests.py --rootdir /home/ubuntu/integration_tests /home/ubuntu/integration_tests/src/main/python -n 4 -v -rfE '' --std_input_path=/home/ubuntu/integration_tests/src/test/resources --color=yes '' --junitxml=TEST-pytest-1665935200368720540.xml --runtime_env=databricks --test_type=nightly

mattahrens avatar Oct 18 '22 16:10 mattahrens

yes, it means start 4 pytest workers to execute tests in parallel. More details please refer to https://pytest-xdist.readthedocs.io

zhanga5 avatar Oct 21 '22 02:10 zhanga5

Could you estimate how much longer the pytests execution would take if n was set to 1 or 2, compared to how long it currently takes? That is likely what is causing the event log corruption in Databricks.

mattahrens avatar Oct 21 '22 11:10 mattahrens

Could you estimate how much longer the pytests execution would take if n was set to 1 or 2, compared to how long it currently takes? That is likely what is causing the event log corruption in Databricks.

non-parallel IT run would cost ~ 8+ hours parallelism as 2 would cost ~ 4+ hours

I am fine to disable parallel run for this CI, but is also means double/quadruple the $$$ since the databricks charge us by instance uptime. cc @sameerz

pxLi avatar Oct 24 '22 07:10 pxLi

I would rather not disable parallelism, we were just using these files out of convenience, we should come up with different way to test the tools that is more reliable

tgravescs avatar Oct 24 '22 13:10 tgravescs

perhaps leverage Jenkins parallelism or other tool (e.g. parallel cmd) to run test instead of pytest-xdist? It would be separated spark jobs for different tests but still in parallel

zhanga5 avatar Oct 25 '22 01:10 zhanga5

new CI jobs is deployed, that leveraged Jenkins parallelism. Will see whether this issue can be reproduced

zhanga5 avatar Dec 05 '22 01:12 zhanga5

@zhanga5 Thanks for the update! Should we close this issue for now and re-open if the issue shows up again?

mattahrens avatar Dec 05 '22 14:12 mattahrens

lets close this one as fix merged. We could reopen/file new tickets if we found new issue

pxLi avatar Dec 09 '22 02:12 pxLi