pycave icon indicating copy to clipboard operation
pycave copied to clipboard

GMM with Mini-Batches

Open justuswill opened this issue 1 year ago • 1 comments

Hi,

Like #7 and #19 I am trying to fit a GMM to a large dataset [10^10, 50] and want to (need to) use mini-batching.

However, in contrast to the previous answers, gmm.fit only accpects a TensorLike and won't work with my data which is a torch.utils.data.DataLoader. Even if I input a torch.utils.data.Dataset it only computes a GMM on the first batch.

What is the preferred way to do what I want to do?

Ideally, I would want my code to work like this:

from pycave.bayes import GaussianMixture as GMM
from torch.utils.data import Dataset, DataLoader

data = Data(DATA_PATH).dataloader(batch_size=256)
assert(type(data) == DataLoader)

gmm = GMM(num_components=3, batch_size=256, trainer_params=dict(accelerator='gpu', devices=1))
class_labels = gmm.fit_predict(data)
means, stds = gmm.model_.means, gmm.model_.covariances

Manually changing the code in gmm/estimator.py (among others) from

num_features = len(data[0])
...
loader = DataLoader(
    dataset_from_tensors(data),
    batch_size=self.batch_size or len(data),
    collate_fn=collate_tensor,
)
is_batch_training = self._num_batches_per_epoch(loader) == 1          # Also, shouldn't this be > anyway?

to

num_features = data.dataset[0].shape[1]
...
loader = data
is_batch_training = True

allows the for error-free fitting and prediction but I am not sure if the output is trustworthy.

justuswill avatar Mar 09 '23 17:03 justuswill

Hi, were you able to solve this issue? I am also trying to do GMM training with mini-batches. My dataset size is huge and I cannot load all the data into the memory.

hashim19 avatar Jan 02 '24 21:01 hashim19