djl icon indicating copy to clipboard operation
djl copied to clipboard

Tensorflow & MxNet Sparse bugs

Open Lundez opened this issue 4 years ago • 4 comments

Description

  1. MxNet shape uses int and then casts results into long which crashes when the data has a shape larger than int. This does not happen for PyTorch variant of sparse matrices. It's a simple fix.
  2. PyTorch COO matrix does not support sum() / sum(axis) which are supported according to PyTorch documentation.

Expected Behavior

All should work.

Error Message

It's easy to reproduce

How to Reproduce?

Can't shorten it down as it's in my pipeline with my data.

Steps to reproduce

(Paste the commands you ran that produced the error.) MxNet

  1. Have a sparse matrix with more columns that int can cover.
  2. Ask for getShape()
  3. Crash

PyTorch

  1. Try to call .sum() on a COO matrix
  2. Fails

What have you tried to solve it?

  1. Tried using MxNet which found a new crash
  2. Tried using Tensorflow, does not support sparcity (even if TF has sparsetensor)

Environment Info

Using Windows 10

Lundez avatar Aug 11 '21 13:08 Lundez

@Lundez Thanks for reporting this issue, will take a look.

frankfliu avatar Aug 15 '21 03:08 frankfliu

@Lundez

I created a PR trying to address MXNet large tensor issue: #1183, unfortunately, getShape() will still cause crash. By default, MXNet is compiled without large tensor support for performance reason. You have to manually compile MXNet with USE_INT64_TENSOR_SIZE=1 flag. And then you can set MNXET_LIBRARY_PATH environment variable to load your customized libmxnet.so. See: http://docs.djl.ai/docs/development/troubleshooting.html#4-how-to-run-djl-using-other-versions-of-apache-mxnet

frankfliu avatar Aug 22 '21 04:08 frankfliu

@frankfliu I see. Thank you for the assistance! 🤗

Lundez avatar Aug 22 '21 05:08 Lundez

@frankfliu did you ever get around to validate the PyTorch COOMatrix.sum() issue?

Lundez avatar Aug 24 '21 16:08 Lundez