spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[TASK] Figure out a testing plan for enterprisiness

Open revans2 opened this issue 4 years ago • 2 comments

We recently had an issue where contiguousSplit started to fail on 2GB partitions. We know that there are some issues with similar limits in shuffle #45 but it is the unknown unknowns that are more problematic because we cannot make informed decisions about prioritizing fixing these issues.

We need to come up with a test plan to really hammer on size limits in both cudf and this plugin so we can have a better understanding of what limits exist and so we can come up with a proper plan to address them.

Avoid Crashes:

Highest priority:

  • [x] https://github.com/NVIDIA/spark-rapids/issues/5028
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/4968
  • [ ] https://github.com/rapidsai/cudf/issues/10368
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/548
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/2065
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5029
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4061
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5140
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5108

Next on the list:

  • [ ] https://github.com/NVIDIA/spark-rapids/issues/325
  • [x] https://github.com/NVIDIA/spark-rapids/issues/836
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/1501
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/1940
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/2354
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/2708
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/3300
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/4034
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/45
  • [x] https://github.com/NVIDIA/spark-rapids/issues/302
  • [x] https://github.com/NVIDIA/spark-rapids/issues/3328

Test for new issues:

  • [ ] https://github.com/NVIDIA/spark-rapids/issues/86
  • [ ] QUERY FUZZ TESTING ISSUE
  • [ ] Test common scenarios: avro #5657
  • [ ] Test common scenarios: Notebook, REPL #5704
  • [ ] Test configs that are too late to test via pytest.mark.parametrize https://github.com/NVIDIA/spark-rapids/issues/5703

Recover from crashes:

  • [ ] https://github.com/NVIDIA/spark-rapids/issues/4210

Auto Tune:

  • [ ] https://github.com/NVIDIA/spark-rapids/issues/635
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/1399
  • [x] https://github.com/NVIDIA/spark-rapids/issues/2424
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/4164

Better Error Reporting:

  • [ ] https://github.com/NVIDIA/spark-rapids/issues/53
  • [x] https://github.com/rapidsai/cudf/issues/10553
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/1405

revans2 avatar Mar 04 '21 19:03 revans2

Discussed and need to break down larger work items into tasks.

sameerz avatar May 28 '21 01:05 sameerz

This is an epic that is being supported by other issues. Not specific to a release.

sameerz avatar Jul 13 '21 21:07 sameerz

This became a dumping ground for a lot of reliability issues. I am going to rename this. Remove everything that is not done, and then file new epics to track each of the individual issues.

revans2 avatar Apr 04 '23 18:04 revans2