djl icon indicating copy to clipboard operation
djl copied to clipboard

Is it possible to train a PyTorch SSD model on an M1 Mac - or is this not yet implemented? PtNDArrayEx.multiBoxPrior(PtNDArrayEx.java:697) UnsupportedOperationException: Not implemented

Open juliangamble opened this issue 1 year ago • 5 comments

Description

When running TrainPikachuTest on an M1 Mac I get the error UnsupportedOperationException: Not implemented

Expected Behavior

The TrainPikachuTest runs as expected and a model is produced.

Error Message

Exception in thread "main" java.lang.UnsupportedOperationException: Not implemented
	at ai.djl.pytorch.engine.PtNDArrayEx.multiBoxPrior(PtNDArrayEx.java:697)
	at ai.djl.modality.cv.MultiBoxPrior.generateAnchorBoxes(MultiBoxPrior.java:68)
	at ai.djl.basicmodelzoo.cv.object_detection.ssd.SingleShotDetection.forwardInternal(SingleShotDetection.java:84)
	at ai.djl.nn.AbstractBaseBlock.forwardInternal(AbstractBaseBlock.java:128)
	at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:93)
	at ai.djl.training.Trainer.forward(Trainer.java:189)
	at ai.djl.training.EasyTrain.trainSplit(EasyTrain.java:122)
	at ai.djl.training.EasyTrain.trainBatch(EasyTrain.java:110)
	at ai.djl.training.EasyTrain.fit(EasyTrain.java:58)
	at ai.djl.examples.training.TrainPikachu.runExample(TrainPikachu.java:93)
	at ai.djl.examples.training.TrainPikachuTest.testDetection(TrainPikachuTest.java:52)
	at ai.djl.examples.training.TrainPikachuTest.main(TrainPikachuTest.java:30)

How to Reproduce?

Run the class TrainPikachuTest on an M1 Mac

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Run the TrainPikachuTest class with DJL_DEFAULT_ENGINE=PyTorch

What have you tried to solve it?

  1. Debugging through the code - and looking at the implementation of the class.
  2. Looking for other examples of training doing SingleShotDetection. (Didn't find any).

Environment Info

DJL_DEFAULT_ENGINE=PyTorch
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.jdk/Contents/Home

juliangamble avatar Jul 05 '23 12:07 juliangamble

MXNet has several helper operators specific to SSD and they were used as part of the DJL SSD model you are using. Unfortunately, MXNet doesn't support M1 and the model doesn't run on PyTorch.

If you are interested in contributing here, you could build an implementation of SSD that does not rely on those operators or you could add the missing implementations as part of PtNDArrayEx.

zachgk avatar Jul 05 '23 22:07 zachgk

@zachgk thanks for getting back to me. Thanks for creating an opportunity to contribute.

I'm sizing it up - and working out a specification and way to measure if it is working. In terms of a specification - it seems to be this class here: https://github.com/apache/mxnet/blob/master/src/operator/contrib/multibox_prior.cc Please help me out if you know a better one.

In terms of measuring if it is working - I'm looking in here - and not finding anything that corresponds: https://github.com/apache/mxnet/tree/master/tests/cpp/operator

Can you help me out with how you would measure a working implementation?

juliangamble avatar Jul 06 '23 11:07 juliangamble

Probably the easiest way to test whether it is working is to use a hard-coded value for inputs and outputs. We have some examples in OptimizerTest.

So, find a known sample data and then you can put it into the integration suite so it is run in all engines. This way, it ensures that all engines have matching behavior (including between the MXNet version and your new implementation). It also ensures that the behavior won't change because it would require also changing the values in the test

zachgk avatar Jul 10 '23 22:07 zachgk

I'll get back to you - I'm writing a test.

juliangamble avatar Jul 11 '23 12:07 juliangamble

I've done a pull request on this. https://github.com/deepjavalibrary/djl/pull/2715 The two different unit tests nearly match up, but not quite - so I'm asking for some help on this.

juliangamble avatar Jul 15 '23 07:07 juliangamble