iis icon indicating copy to clipboard operation
iis copied to clipboard

Consider removing spark job parameters for number of emitted files

Open przemyslawjacewicz opened this issue 5 years ago • 1 comments

Currently some IIS spark jobs have a parameter for setting the number of output files. This functionality was recently revised when we introduced an explicit repartition step when creating datastores for writing #1103 . The number of output files can also affect execution times of integration tests when a production value for number of files is used in tests, what causes a large number of files to be created for a small test datastore. This in turn can spawn a large number of unnecessary madis processes and vastly increase the execution time of test - from a couple of seconds to tens of minutes.

Original motivation for this parameter was to avoid explosion of files when performing unions of datastores. I think we could investigate if this is still an issue for our spark jobs as we have a large number of spark jobs without explicit number of output files that perform without any problems.

przemyslawjacewicz avatar Oct 05 '20 17:10 przemyslawjacewicz

Refactorization ideas to avoid setting explicit number of output files

  • set the size of an output file
  • set the number of records in an output file

przemyslawjacewicz avatar Apr 01 '21 11:04 przemyslawjacewicz