spark-rapids
spark-rapids copied to clipboard
[TASK] Figure out a testing plan for enterprisiness
We recently had an issue where contiguousSplit started to fail on 2GB partitions. We know that there are some issues with similar limits in shuffle #45 but it is the unknown unknowns that are more problematic because we cannot make informed decisions about prioritizing fixing these issues.
We need to come up with a test plan to really hammer on size limits in both cudf and this plugin so we can have a better understanding of what limits exist and so we can come up with a proper plan to address them.
Avoid Crashes:
Highest priority:
- [x] https://github.com/NVIDIA/spark-rapids/issues/5028
- [ ] https://github.com/NVIDIA/spark-rapids/issues/4968
- [ ] https://github.com/rapidsai/cudf/issues/10368
- [ ] https://github.com/NVIDIA/spark-rapids/issues/548
- [ ] https://github.com/NVIDIA/spark-rapids/issues/2065
- [x] https://github.com/NVIDIA/spark-rapids/issues/5029
- [x] https://github.com/NVIDIA/spark-rapids/issues/4061
- [x] https://github.com/NVIDIA/spark-rapids/issues/5140
- [x] https://github.com/NVIDIA/spark-rapids/issues/5108
Next on the list:
- [ ] https://github.com/NVIDIA/spark-rapids/issues/325
- [x] https://github.com/NVIDIA/spark-rapids/issues/836
- [ ] https://github.com/NVIDIA/spark-rapids/issues/1501
- [ ] https://github.com/NVIDIA/spark-rapids/issues/1940
- [ ] https://github.com/NVIDIA/spark-rapids/issues/2354
- [ ] https://github.com/NVIDIA/spark-rapids/issues/2708
- [ ] https://github.com/NVIDIA/spark-rapids/issues/3300
- [ ] https://github.com/NVIDIA/spark-rapids/issues/4034
- [ ] https://github.com/NVIDIA/spark-rapids/issues/45
- [x] https://github.com/NVIDIA/spark-rapids/issues/302
- [x] https://github.com/NVIDIA/spark-rapids/issues/3328
Test for new issues:
- [ ] https://github.com/NVIDIA/spark-rapids/issues/86
- [ ] QUERY FUZZ TESTING ISSUE
- [ ] Test common scenarios: avro #5657
- [ ] Test common scenarios: Notebook, REPL #5704
- [ ] Test configs that are too late to test via pytest.mark.parametrize https://github.com/NVIDIA/spark-rapids/issues/5703
Recover from crashes:
- [ ] https://github.com/NVIDIA/spark-rapids/issues/4210
Auto Tune:
- [ ] https://github.com/NVIDIA/spark-rapids/issues/635
- [ ] https://github.com/NVIDIA/spark-rapids/issues/1399
- [x] https://github.com/NVIDIA/spark-rapids/issues/2424
- [ ] https://github.com/NVIDIA/spark-rapids/issues/4164
Better Error Reporting:
- [ ] https://github.com/NVIDIA/spark-rapids/issues/53
- [x] https://github.com/rapidsai/cudf/issues/10553
- [ ] https://github.com/NVIDIA/spark-rapids/issues/1405
Discussed and need to break down larger work items into tasks.
This is an epic that is being supported by other issues. Not specific to a release.
This became a dumping ground for a lot of reliability issues. I am going to rename this. Remove everything that is not done, and then file new epics to track each of the individual issues.