spark-rapids
spark-rapids copied to clipboard
[FEA] Force use PERFILE scan in low shuffle merge.
Currently low shffule merge requires the scan mode set to PERFILE, which is not user friendly since user need to change this spark session level conf to enable this feature. We could force using PERFILE scan during low shuffle merge so that even if the session's spark conf is not.
cc @jlowe Do you think this solution is reasonable?
@liurenjie1024 why do we need to force it to be per-file? Is it because we don't want to merge files so we can insert in a count for the number of rows? We already do this for operators like input_file_name, but I really would prefer to have a solution that does not require this. Especially because the only reason we do it for input_file_name is because Spark uses a thread local value to hold the name of the file instead of asking the input format to provide it as a column. But for me it is an acceptable solution. Just not an ideal one.
Do you think this solution is reasonable?
The answer depends on how well it performs in practice. The problems we've seen in the past with PERFILE are aggravated by partitioned tables, since they tend to generate small files which perform poorly on the GPU (not enough data per file loaded to get the necessary parallelism for speedup). Auto compaction helps as more data gets written to the partitions, but sometimes there simply isn't enough data in partitions and tasks are handed files across partitions. We can combine those today, but not when PERFILE is forced.
Another potential problem with this approach is whether it will be selectively applied only to the tables that need it within the query. The merge source could be a quite complex query composed of many scans, and ideally we wouldn't want those scans to be forced to PERFILE since it's not necessary for the low shuffle merge logic.
@liurenjie1024 why do we need to force it to be per-file? Is it because we don't want to merge files so we can insert in a count for the number of rows? We already do this for operators like
input_file_name, but I really would prefer to have a solution that does not require this. Especially because the only reason we do it forinput_file_nameis because Spark uses a thread local value to hold the name of the file instead of asking the input format to provide it as a column. But for me it is an acceptable solution. Just not an ideal one.
@revans2 Currently low shuffle merge uses filename + row_id to idenitfy a row in target file, and currently we only enable this feature when for file scan mode has been set to PERFILE. I can send you design doc if you are interested in more details.
The answer depends on how well it performs in practice. The problems we've seen in the past with PERFILE are aggravated by partitioned tables, since they tend to generate small files which perform poorly on the GPU (not enough data per file loaded to get the necessary parallelism for speedup). Auto compaction helps as more data gets written to the partitions, but sometimes there simply isn't enough data in partitions and tasks are handed files across partitions. We can combine those today, but not when PERFILE is forced.
Hi, @jlowe Sorry I may be unclear about the context. The background is some customer want to try low shuffle merge now, but to enable it we need to set two options: PERFILE scan mode and enable low shuffle merge. This may not be a problem when running a merge statement only, but it's inconvenient when our customer uses scripts/notebooks to run a lot of queries together, e.g. they need to enable PERFILE befere merge statement, and disable PERFILE after it. That's why I want to enfore PERFILE scan in low shuffle merge algorithm, so that user only need to add one option for all jobs to enable it. This doesn't not mean we will not implement it for other scan modes, they are still on the roadmap.
Another potential problem with this approach is whether it will be selectively applied only to the tables that need it within the query. The merge source could be a quite complex query composed of many scans, and ideally we wouldn't want those scans to be forced to PERFILE since it's not necessary for the low shuffle merge logic.
Yeah, I will ensure it's only applied to target table, and it's only necessary for target table.
Thanks for clarifying what this is doing. Note that an alternative to this approach is to fully implement a row ID metadata column. If we had that, we wouldn't need the PERFILE+nosplit hack for low-shuffle merge.
I agree we should make it easier to use low-shuffle merge, and making the PERFILE hack more specific to the scan would be a step towards that (both in terms of usability and performance, since not all scans would need to be PERFILE).
An alternative is to implement a row ID column properly which shouldn't be too difficult. We already know the start/end row IDs of each rowgroup we're loading, so it should be easy to generate the segmented sequences to cover all the rows. We already support the input filename as a column, and the combination of these uniquely identifies the row. One drawback is the need to manifest the filename for each row, which IIRC PERFILE+nosplit is avoiding.
An alternative is to implement a row ID column properly which shouldn't be too difficult. We already know the start/end row IDs of each rowgroup we're loading, so it should be easy to generate the segmented sequences to cover all the rows. We already support the input filename as a column, and the combination of these uniquely identifies the row. One drawback is the need to manifest the filename for each row, which IIRC PERFILE+nosplit is avoiding.
I think when we have PERFILE, we can implement row id column without file split disable as you said. But it's not easy to remove PERFILE scan totally in first step since in multithread mode, we need to merge several small files into one file, and we can no longer get input_file_name easily as now. I guess we may also need extra effort to implement row ID column properly.
I think as first step, we can limit PERFILE scan to target files only and remove the dependency on file scan mode to make this feature both easier to use and more performant. I will also avoid disabling splits to use the method you mentioned to implement row id. What do you think?
I think as first step, we can limit PERFILE scan to target files only
I agree this would be a valuable, incremental step.