djl icon indicating copy to clipboard operation
djl copied to clipboard

Error on training with multiple GPUs

Open zachgk opened this issue 2 years ago • 3 comments
trafficstars

Copying an issue from a Slack thread:

oleogin 1 month ago Hi I'm encountering an issue when trying to run training on multiple GPUs using DJL. I'm using the following code:

DefaultTrainingConfig config = new DefaultTrainingConfig(loss);
config.optExecutorService(Executors.newFixedThreadPool(16));
...
EasyTrain.trainBatch(trainer, batch);
trainer.step();

However, I'm getting the following error: Gradient values are all zeros, please call gradientCollector.backward() on your target NDArray (usually loss), before calling step(). (edited)


Zach Kimberg 1 month ago Can you share more of the training script? Is it failing on the first training step or a later one?


oleogin 1 month ago Thanks for your response. Yes, the error occurs on the first step of training. But the script works perfectly on a single GPU. Here's the code snippet from the trainBatch method:

if (splits.length > 1 && trainer.getExecutorService().isPresent()) {
                // multi-threaded
                ExecutorService executor = trainer.getExecutorService().get();
                List<CompletableFuture<Boolean>> futures = new ArrayList<>(splits.length);
                for (Batch split : splits) {
                    futures.add(
                            CompletableFuture.supplyAsync(
                                    () -> trainSplit(trainer, collector, batchData, split),
                                    executor));
                }
                CompletableFuture.allOf(futures.stream().toArray(CompletableFuture[]::new));
            } else {
                // sequence
                for (Batch split : splits) {
                    trainSplit(trainer, collector, batchData, split);
                }
            }

It appears that the trainSplit method is not being executed properly when multiple GPUs are used and an executor is specified. To address this issue, I've replaced the EasyTrain.trainBatch method with the following code:

Arrays.stream(splits)
                                    .parallel()
                                    .forEach(split -> {
                                        NDList data = split.getData();
                                        NDList labels = split.getLabels();
                                        NDList preds = trainer.forward(data, labels);
                                        NDArray lossValue = trainer.getLoss().evaluate(labels, preds);
                                        collector.backward(lossValue);
                                        batchData.getLabels().put(labels.get(0).getDevice(), labels);
                                        batchData.getPredictions().put(preds.get(0).getDevice(), preds);

                                    });

However, this has resulted in another error:

Check failed: !AGInfo: :IsNone(*i): Cannot differentiate node because it is not in a computational graph. You need to set is_recording to true or use autograd.record() to save computational graphs for backward. If you want to differentiate the same graph twice, you need to pass retain_graph=True to backward.

I hope this provides more context and helps resolve the issue. Let me know if you need further information.


oleogin 14 days ago Here is a toy example that works on a single GPU but fails to run on multiple GPUs.

public class Example {

    public static void main(String[] args) throws TranslateException {
        final int batchSize = 1024;
        final float lr = 0.001f;

        SequentialBlock block = new SequentialBlock()
                .add(Linear.builder()
                        .setUnits(2048)
                        .build()
                )
                .add(Linear.builder()
                        .setUnits(2048)
                        .build()
                );

        Model model = Model.newInstance("model", Device.gpu());
        model.setBlock(block);

        TrainingConfig config = setupTrainingConfig(new L2Loss("l2"), lr);

        try (Trainer trainer = model.newTrainer(config)){
            trainer.initialize(new Shape(2, 2048));

            Stream.generate(() -> model.getNDManager().newSubManager())
                    .map(mgr -> {
                        Batchifier dataBatchifier = new StackBatchifier();
                        Batchifier labelBatchifier = new StackBatchifier();

                        NDList[] input = IntStream.range(0, batchSize)
                                .mapToObj(value -> mgr.randomNormal(new Shape(1, 2048)))
                                .map(ndArray -> new NDList(ndArray))
                                .toArray(NDList[]::new);

                        NDList[] output = IntStream.range(0, batchSize)
                                .mapToObj(value -> mgr.randomNormal(new Shape(1, 2048)))
                                .map(ndArray -> new NDList(ndArray))
                                .toArray(NDList[]::new);

                        NDList batchifyData = dataBatchifier.batchify(input);
                        NDList batchifyLabels = labelBatchifier.batchify(output);

                        return new Batch(
                                mgr,
                                batchifyData,
                                batchifyLabels,
                                batchSize,
                                dataBatchifier,
                                labelBatchifier,
                                0,
                                0);
                    })
                    .peek(batch -> EasyTrain.trainBatch(trainer, batch))
                    .peek(batch -> trainer.step())
                    .forEach(Batch::close);
        }
    }

    private static TrainingConfig setupTrainingConfig(Loss loss, float lr) {

        Tracker learningRateTracker = Tracker.fixed(lr);

        Optimizer optimizer = Optimizer.adam()
                .optBeta1(0.9f)
                .optBeta2(0.999f)
                .optEpsilon(1e-7f)
                .optLearningRateTracker(learningRateTracker)
                .build();
//
        return new DefaultTrainingConfig(loss)
                .optExecutorService(Executors.newFixedThreadPool(16))
                .optInitializer(new NormalInitializer(0.01f), p -> true)
                .optOptimizer(optimizer)
                .addEvaluator(new Accuracy());
    }

}

zachgk avatar Mar 14 '23 21:03 zachgk

I ran this example code in the directory examples/src/main/java/ai/djl/examples/training/Example.java but ran into the following error. Does anyone know how to set mxnet engine USE_CUDA=1?

[03:35:16] ../src/imperative/./imperative_utils.h:93: GPU support is disabled. Compile MXNet with USE_CUDA=1 to enable GPU support.
Exception in thread "main" ai.djl.engine.EngineException: MXNet engine call failed: MXNetError: GPU is not enabled
Stack trace:
  File "../src/resource.cc", line 167

KexinFeng avatar Mar 29 '23 04:03 KexinFeng

Here's some extra information regarding the bug: I used the following dependencies:

compile "ai.djl.mxnet:mxnet-engine:0.21.0" runtimeOnly "ai.djl.mxnet:mxnet-native-auto:1.8.0" I ran the example on an a100 GPU with CUDA 11.0 and libcudnn8 version 8.0.4.30-1+cuda11.0 on an AMD64 machine.

user50 avatar Mar 29 '23 15:03 user50

The USE_CUDA=1 is set when compiling the MXNet binaries. If it is disabled, it means you are using the CPU version of MXNet rather than the GPU. Can you try switching to CUDA 11.2 (our supported CUDA versions for MXNet are shown here).

zachgk avatar Mar 30 '23 18:03 zachgk