djl Tensorflow & MxNet Sparse bugs

Description

MxNet shape uses int and then casts results into long which crashes when the data has a shape larger than int. This does not happen for PyTorch variant of sparse matrices. It's a simple fix.
PyTorch COO matrix does not support sum() / sum(axis) which are supported according to PyTorch documentation.

Expected Behavior

All should work.

Error Message

It's easy to reproduce

How to Reproduce?

Can't shorten it down as it's in my pipeline with my data.

Steps to reproduce

(Paste the commands you ran that produced the error.) MxNet

Have a sparse matrix with more columns that int can cover.
Ask for getShape()
Crash

PyTorch

Try to call .sum() on a COO matrix
Fails

What have you tried to solve it?

Tried using MxNet which found a new crash
Tried using Tensorflow, does not support sparcity (even if TF has sparsetensor)

Environment Info

Using Windows 10

Aug 11 '21 13:08 Lundez

@Lundez Thanks for reporting this issue, will take a look.

Aug 15 '21 03:08 frankfliu

@Lundez

I created a PR trying to address MXNet large tensor issue: #1183, unfortunately, getShape() will still cause crash. By default, MXNet is compiled without large tensor support for performance reason. You have to manually compile MXNet with USE_INT64_TENSOR_SIZE=1 flag. And then you can set MNXET_LIBRARY_PATH environment variable to load your customized libmxnet.so. See: http://docs.djl.ai/docs/development/troubleshooting.html#4-how-to-run-djl-using-other-versions-of-apache-mxnet

Aug 22 '21 04:08 frankfliu

@frankfliu I see. Thank you for the assistance! 🤗

Aug 22 '21 05:08 Lundez

@frankfliu did you ever get around to validate the PyTorch COOMatrix.sum() issue?

Aug 24 '21 16:08 Lundez

djl djl copied to clipboard

Tensorflow & MxNet Sparse bugs

Description

Expected Behavior

Error Message

How to Reproduce?

Steps to reproduce

What have you tried to solve it?

Environment Info

djl
djl copied to clipboard