heat icon indicating copy to clipboard operation
heat copied to clipboard

Improve load-functionality: load multiple files into one DNDarray

Open coquelin77 opened this issue 3 years ago • 4 comments

Feature functionality In HPC data analytics we often encounter the problem that there is not one large .h5-file to be processed, but instead many many single files (e.g., csv or images etc.). Therefore it is necessary to implement a load-routine for this situation, e.g., to implement a routine for loading multiple files into a single DNDarray (of course, some balancing has to be done at the end).

Note: Using the existing single-file load routines is not an option, because extensive stacking of DNDarrays along the split axis is very expensive!

Example for a scientific data set split into a plenty of files: http://cdn.gea.esac.esa.int/Gaia/

Some idea for the function signature:

load( foldername --> path of a folder containing multiple .csv, .npy, .h5 etc., 
          dtype, 
          balance --> (re)balance after loading, 
          split --> axis along which data is split (and thus concatenated), 
          device, 
          comm) 

pseudocode would be sth like this:

file_list = list all files contained in the directory foldernames as list of strings 
file_list = sort(file_list) # depending on argument "order"
local_file_list = part of file_list that belongs to current MPI-process 
local_array_list = [load(file).to(device) for file in local_file_list]
local_array = stack(local_array_list) 
array = DNDarray(local_array,...) 

Todos:

  • [x] basic load functionality as described above for .npy-files
  • [x] unittests for this
  • [ ] scaling tests (cluster?) and performance optimization
  • [ ] think about extension to .csv (?)
  • [ ] think about extension to images (?)
  • [ ] think about distribution of file list to processes depending on file sizes (?)

coquelin77 avatar Jan 17 '22 08:01 coquelin77

@krajsek (tagged you because I closed old #740 where you were assigned)

mrfh92 avatar Aug 14 '23 08:08 mrfh92

reviewed and updated within #1109

mrfh92 avatar Aug 14 '23 08:08 mrfh92

(assigned to me as reservation for @Reisii)

mrfh92 avatar Jan 11 '24 15:01 mrfh92