Consider removing spark job parameters for number of emitted files
Currently some IIS spark jobs have a parameter for setting the number of output files. This functionality was recently revised when we introduced an explicit repartition step when creating datastores for writing #1103 . The number of output files can also affect execution times of integration tests when a production value for number of files is used in tests, what causes a large number of files to be created for a small test datastore. This in turn can spawn a large number of unnecessary madis processes and vastly increase the execution time of test - from a couple of seconds to tens of minutes.
Original motivation for this parameter was to avoid explosion of files when performing unions of datastores. I think we could investigate if this is still an issue for our spark jobs as we have a large number of spark jobs without explicit number of output files that perform without any problems.
Refactorization ideas to avoid setting explicit number of output files
- set the size of an output file
- set the number of records in an output file