heat
heat copied to clipboard
Improve load-functionality: load multiple files into one DNDarray
Feature functionality In HPC data analytics we often encounter the problem that there is not one large .h5-file to be processed, but instead many many single files (e.g., csv or images etc.). Therefore it is necessary to implement a load-routine for this situation, e.g., to implement a routine for loading multiple files into a single DNDarray (of course, some balancing has to be done at the end).
Note: Using the existing single-file load routines is not an option, because extensive stacking of DNDarrays along the split axis is very expensive!
Example for a scientific data set split into a plenty of files: http://cdn.gea.esac.esa.int/Gaia/
Some idea for the function signature:
load( foldername --> path of a folder containing multiple .csv, .npy, .h5 etc.,
dtype,
balance --> (re)balance after loading,
split --> axis along which data is split (and thus concatenated),
device,
comm)
pseudocode would be sth like this:
file_list = list all files contained in the directory foldernames as list of strings
file_list = sort(file_list) # depending on argument "order"
local_file_list = part of file_list that belongs to current MPI-process
local_array_list = [load(file).to(device) for file in local_file_list]
local_array = stack(local_array_list)
array = DNDarray(local_array,...)
Todos:
- [x] basic load functionality as described above for .npy-files
- [x] unittests for this
- [ ] scaling tests (cluster?) and performance optimization
- [ ] think about extension to .csv (?)
- [ ] think about extension to images (?)
- [ ] think about distribution of file list to processes depending on file sizes (?)
@krajsek (tagged you because I closed old #740 where you were assigned)
reviewed and updated within #1109
(assigned to me as reservation for @Reisii)