alignn icon indicating copy to clipboard operation
alignn copied to clipboard

Running ALIGNN on Multi-GPUs

Open aydinmirac opened this issue 2 years ago • 16 comments

Dear All,

I would like to run ALIGNN on multi GPUs. When I checked the code I could not find any option.

Is there any method to run ALIGNN on multi GPUs such as using PyTorch Lightning or DDP function from PyTorch (Distributed Data Parallel)?

Best regards, Mirac

aydinmirac avatar Jan 22 '23 10:01 aydinmirac

There is a config option to wrap the model in DistributedDataParallel, but I think that may be a dead code path at this point. https://github.com/usnistgov/alignn/blob/736ec739dfd697d64b1c2a01dc84678a24bcfacd/alignn/train.py#L651

I think the most straightforward path is through ignite.distributed. Lightning would probably be pretty straightforward to use as well if you wrap our models and dataloaders (I tried using lightning once but it's been a while).

I have some experimental code for training on multiple GPUs that I think hasn't gotten merged into the the main repo. This more or less followed the ignite.distributed CIFAR example. I don't think there was much required beyond wrapping the model, optimizer, and dataloader in ignite's auto distributed helpers, and fixing the training entry point to have the right call signature

Cleaning that up is actually on my near-term to-do list. Is there a particular form or use case for multi-GPU training/inference that you think would be useful?

bdecost avatar Jan 22 '23 20:01 bdecost

Hi @bdecost,

Thank you so much for your reply.

Let me try your solution. I will inform you as soon as possible.

I am using 2xRTX-4500 and distributing data with the following PyTorch-Lightning example code snippet. It is very useful and simple:

trainer = pl.Trainer(
    callbacks=callbacks,
    logger=logger,
    default_root_dir=root_dir,
    max_epochs=200,
    devices=2,
    accelerator="gpu",
    strategy="ddp"
)
trainer.fit(task, datamodule=data)

My model does not fit into single GPU because of big molecule structures. These use cases are pretty common if you are studying big structures. I think adding a feature like that (I mean, being able to split a model easily) would be very helpful when using ALIGNN.

aydinmirac avatar Jan 23 '23 17:01 aydinmirac

My model does not fit into single GPU because of big molecule structures. These use cases are pretty common if you are studying big structures. I think adding a feature like that (I mean, being able to split a model easily) would be very helpful when using ALIGNN.

Are you looking to shard your model across multiple GPUs? In that case lightning looks like it might be able to set that up automatically? I've only tried using data parallelism with ignite to get better throughput, I'm not sure ignite supports model parallelism.

What kind of batch sizes are you working with? I'm just wondering if you could reduce your batch size and model size to work within your GPU memory limit, and maybe use data parallelism to get higher throughput / effective batch size. The layer configuration study we did indicates that you can drop to a configuration with 2 ALIGNN and 2 GCN layers without really sacrificing performance, and this substantially reduces the model size (not quite by a factor of 2). See also the cost vs accuracy summary of that experiment. I'm not sure this tradeoff would hold for larger molecular structures since the performance tradeoff study was done using Jarvis dft_3d dataset, but it might be worth trying.

bdecost avatar Jan 24 '23 15:01 bdecost

Hi @bdecost,

Let me elaborate my case.

I am using OMDB dataset which includes big molecules that has 82 atoms per molecule on average. That is quite bigger than QM9 dataset. As a model, I am using SchNetPack and ALIGNN to train the dataset.

I trained Schnet and ALIGNN on A100 (40GB GPU memory) before. Batch size and GPU memory usage on A100 are below:

Schnet --> batch size = 64, Memory usage = 39GB ALIGNN --> batch size = 48, Memory usage = 32GB

If I increase batch size, both models crash on A100. Now I have to train these models on 2xRTX A4500 (20GB memory each) and split them into 2 GPUs without out of memory problem. Schnet uses Lightning to eliminate this issue.

I just wondered if I can do this with ALIGNN. Otherwise, as you mentioned, I have to reduce batch size. Also I will try your suggestions as you mentioned above.

aydinmirac avatar Jan 24 '23 16:01 aydinmirac

Ok, cool. I wasn't sure if you had such large molecules that fitting a single instance in GPU memory was a problem.

What you want is more straightforward to do than model parallelism I think. Our current training setup doesn't make it as simple as a configuration setting, but I think it would be nice for us to support

bdecost avatar Jan 24 '23 20:01 bdecost

We are also interested in training on some larger datasets. How is the state of distributed training right now?

JonathanSchmidt1 avatar Jan 04 '24 22:01 JonathanSchmidt1

Hi,

did you try distributed:true which is based on accelerate package to manage multi-gpu training?

knc6 avatar Jan 07 '24 05:01 knc6

With distributed:true the training hangs at "building line graphs" even when only using one node. However there seem to be quite a lot of issues with hanging processes in accelerate. And unfortunately the kernel version of the supercomputing system I am on is 5.3.18 for which this seems to be a known issue https://huggingface.co/docs/accelerate/basic_tutorials/troubleshooting .

Did you have any reports of the training hanging at this stage?

JonathanSchmidt1 avatar Jan 09 '24 10:01 JonathanSchmidt1

@knc6 just a quick check in concerning data parallel as the distributed was removed. I am getting some device errors with dataparallel and I am also not sure whether dataparallel can properly split the batches for dgl graphs. dgl._ffi.base.DGLError: Cannot assign node feature "e_src" on device cuda:1 to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.

Is the dataparallel training working for you? thank you so much!

JonathanSchmidt1 avatar Apr 23 '24 14:04 JonathanSchmidt1

There is a WIP DistributedDataParallel implementation of ALIGNN in the ddp branch. Currently, it prints outputs multiple times instead, hopefully should be an easy fix. Also, I need to check the reproducibility and few other aspects of DDP.

knc6 avatar Apr 28 '24 02:04 knc6

Great, will try it out.

JonathanSchmidt1 avatar Apr 28 '24 09:04 JonathanSchmidt1

Thank for sharing the branch. I tested it with cached datasets and 2 gpus and it was reproducible and consistent with what I would expect from 1 gpu.

However I am noticing already that RAM will become an issue at some point (2 tasks with 400k materials is already 66 Gb, 8 tasks with 4M materials would be ~3.2Tb) due to the in-memory datasets.

JonathanSchmidt1 avatar Apr 29 '24 18:04 JonathanSchmidt1

@JonathanSchmidt1 Thanks for checking. I have perhaps a possible solution to lessen memory footprint using LMDB dataset, will get back to you soon.

knc6 avatar Apr 29 '24 19:04 knc6

that's a good idea, lmdb datasets definitely work for this. If you would like to use lmdb datasets, there are a few examples of how to do lmdb datasets in e.g. https://github.com/IntelLabs/matsciml/tree/main for both dgl as well as pyg. I have also implemented one of them, unfortunately I just dont have the capacity right now to promise anything. However taking one of their classes that parse pmg structures, e.g. materials project dataset or alexandriadataset and switching to your line graph conversion might be easy. E.g. for m3gnet preprocessing they just use:

@registry.register_dataset("M3GMaterialsProjectDataset")
class M3GMaterialsProjectDataset(MaterialsProjectDataset):
    def __init__(
        self,
        lmdb_root_path: str | Path,
        threebody_cutoff: float = 4.0,
        cutoff_dist: float = 20.0,
        graph_labels: list[int | float] | None = None,
        transforms: list[Callable[..., Any]] | None = None,
    ):
        super().__init__(lmdb_root_path, transforms)
        self.threebody_cutoff = threebody_cutoff
        self.graph_labels = graph_labels
        self.cutoff_dist = cutoff_dist
        self.clear_processed = True

    def _parse_structure(
        self,
        data: dict[str, Any],
        return_dict: dict[str, Any],
    ) -> None:
        super()._parse_structure(data, return_dict)
        structure: None | Structure = data.get("structure", None)
        self.structures = [structure]
        self.converter = Structure2Graph(
            element_types=element_types(),
            cutoff=self.cutoff_dist,
        )
        graphs, lg, sa = M3GNetDataset.process(self)
        graphs, lattices, lg, sa = M3GNetDataset.process(self)
        return_dict["graph"] = graphs[0]

JonathanSchmidt1 avatar Apr 29 '24 19:04 JonathanSchmidt1

@JonathanSchmidt1 I have implemented a basic LMDB dataset here, feel free to give a try.

knc6 avatar Apr 30 '24 18:04 knc6

Thank you very much. Will give it a try this week.

JonathanSchmidt1 avatar May 02 '24 21:05 JonathanSchmidt1