SALSA
SALSA copied to clipboard
Optimizing the RAM consumption when preparing data for training
The load_chunk_data
method is aggressively cosuming huge amounts of RAM when concatenating np arrays.
I am currently trying to implement something that will reduce the RAM consumption
@karnwatcharasupat @thomeou I am happy to request a PR when I am done, if that's is acceptable by you.
PS: I noticed that the previous method never worked, and I apologize for not properly testing it; I am trying something new now.
@karnwatcharasupat The splitting idea didn't work, even after I fixed it to actually concat the chunks because in the end, I am still going to concatenate np arrays that will eventually reach the shape of (7, 1920000, 200), which is unhandleable anyway. I had an idea to not concatenate them at all, but to export them to the db_data
in get_split
method, like this for example:
db_data = {
'features': features,
'features_2': features_2,
'features_3': features_3,
'features_4': features_4,
'sed_targets': sed_targets,
'doa_targets': doa_targets,
'feature_chunk_idxes': feature_chunk_idxes,
'gt_chunk_idxes': gt_chunk_idxes,
'filename_list': filename_list,
'test_batch_size': test_batch_size,
'feature_chunk_len': self.chunk_len,
'gt_chunk_len': self.chunk_len // self.label_upsample_ratio
}
where features
, features_2
, features_3
, and features_4
are just features
, but splitted into 4 chunks. And then adjust the use of features
in the whole project to include the other features
sequentially. I have already developed such a method to export 4 arrays, but I am still exploring the code to better understand it before changing how it works. Currently, I can see that the get_split
method is called when training in the datamodule.py
file, specifically in
train_db = self.feature_db.get_split(split=self.train_split, split_meta_dir=self.split_meta_dir, stage='fit')
and in
val_db = self.feature_db.get_split(split=self.val_split, split_meta_dir=self.split_meta_dir, stage='inference')
The call from train_db
variable is currently my problem.
If you have an idea how to add the chunks part to the code, please let me know.