vizier-scala icon indicating copy to clipboard operation
vizier-scala copied to clipboard

LoadCSV (with warnings) kills the GC during query optimization on sufficiently wide datasets.

Open okennedy opened this issue 2 years ago • 0 comments

Describe the bug Spark's optimizer tries to aggressively inline projections. This means it replicates the entire inlined subtree, each spot it occurs in. That's fine when the column being inlined is small, or is inlined a fixed number of times. Otoh, LoadSparkCSV creates the query:

Project[ csv.col1, csv.col2, csv.col3, ... ]
  Project [ from_csv( input_data, Struct( StructField(col1), StructField(col2), StructField(col3), ... ) ) as csv ]

which optimizes to:

Project[ 
   from_csv( input_data, Struct( StructField(col1), StructField(col2), StructField(col3), ... ).col1,
   from_csv( input_data, Struct( StructField(col1), StructField(col2), StructField(col3), ... ).col2,
   from_csv( input_data, Struct( StructField(col1), StructField(col2), StructField(col3), ... ).col3,
   ...
]

In other words, the size of the resulting tree is $O(n^2)$ in the number of columns, which, for a sufficiently wide dataset will thrash the GC (~150 columns seems to be enough to do it).

Expected behavior Don't thrash the GC. Some thoughts on a fix:

  • Find a way to "cache" the csv once its been parsed and read in... this may require a preprocessing step during load, but that dupes the CSV... which isn't really what we want.
  • Reimplement LoadSparkCSV in RDFs.
  • Find some way to prevent spark from optimizing below a specific point. "Hand-"optimize the query up to the load, and then ask spark to nicely stop optimizing somehow.

okennedy avatar Oct 07 '22 02:10 okennedy