umap icon indicating copy to clipboard operation
umap copied to clipboard

Invalid array access in transform()

Open msedwar opened this issue 6 years ago • 10 comments

When using the fit() then transform() workflow, umap shows undefined behavior such as intermittently returning a matrix of all NaN values, segfaulting, or even returning a correct value.

Looking into the issue, it appears as though the transform() function shows the undefined behavior when a training data set is used to fit() the model and a test set (with a cardinality greater than the training set) is passed to the transform() function.

The culprit appears to be the modulus n_vertices in the optimize_layout() function. When the size of tail_embedding is less than n_vertices, the accelerated code may perform an invalid array access (leading to segfaults, occassionally the correct value, etc.).

I believe the problem can be fixed by passing to optimize_layout() the shape of the original embedding in the transform() function instead of the shape of the new transformation data.

I am currently unsure if this would still be valid for the algorithm (though it still seems like a valid use case). I would appreciate any feedback on this discovery, thanks!

msedwar avatar Jul 20 '18 16:07 msedwar

Thanks that's good spotting and I believe a correct analysis of the issue. I'll have a look to be sure and see if I can get a robust fix in place.

On Fri, Jul 20, 2018 at 12:45 PM Matthew Edwards [email protected] wrote:

When using the fit() then transform() workflow, umap shows undefined behavior such as intermittently returning a matrix of all NaN values, segfaulting, or even returning a correct value.

Looking into the issue, it appears as though the transform() function shows the undefined behavior when a training data set is used to fit() the model and a test set (with a cardinality greater than the training set) is passed to the transform() function.

The culprit appears to be the modulus n_vertices in the optimize_layout() function. When the size of tail_embedding is less than n_vertices, the accelerated code may perform an invalid array access (leading to segfaults, occassionally the correct value, etc.).

I believe the problem can be fixed by passing to optimize_layout() the shape of the original embedding in the transform() function instead of the shape of the new transformation data.

I am currently unsure if this would still be valid for the algorithm (though it still seems like a valid use case). I would appreciate any feedback on this discovery, thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/89, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBa3wxd88gmNUbQEXqZoroIQ9x9Rvks5uIgkjgaJpZM4VYZHd .

lmcinnes avatar Jul 20 '18 17:07 lmcinnes

Great analysis of the issue. It is a very valid use case. I ran into this exact problem yesterday and was going to spend some time over the weekend tracking down what was causing it. Thank you very much for saving me the time and effort.

jc-healy avatar Jul 20 '18 21:07 jc-healy

So I believe I just pushed what I believe is a fix for this, if either of you can verify that the current master now works I would appreciate it.

Sorry for the delay -- I've been busy with a few other things.

On Fri, Jul 20, 2018 at 5:18 PM jc-healy [email protected] wrote:

Great analysis of the issue. It is a very valid use case. I ran into this exact problem yesterday and was going to spend some time over the weekend tracking down what was causing it. Thank you very much for saving me the time and effort.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/89#issuecomment-406729527, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBaNk_UgewU1347dk4qvZarnysRqLks5uIkk2gaJpZM4VYZHd .

lmcinnes avatar Jul 24 '18 21:07 lmcinnes

It appeared to still have an issue when using fit, then transform. I submitted a fix for it (the similar fix to fit, but for the transform function, and again only one line). I can verify that it works on my local for both dataflow cases.

msedwar avatar Jul 25 '18 15:07 msedwar

I seem to be encountering this problem too. I'm running a 8193x6 dataset to fit then transform, which is working fine. Then I am trying to transform a ~43000x6 dataset and it's just returning NaNs. When I take a subset of the data (<8193, arbitrarily picked slice) it works fine again. Once I get to around 8734x6 the transform almost exclusively returns NaNs.

tbloch1 avatar Aug 31 '18 15:08 tbloch1

This seems like a quirky bug :-( I did manage to fix some issues with NaN results, so I'm hoping that this turns out to be similar or related. I'll try to take a look at this on Monday.

lmcinnes avatar Sep 01 '18 00:09 lmcinnes

@tbloch1 I believe I pushed a commit for another issue that may have also fixed this one. If you could test the latest master and see if you still get the same problem I would appreciate it.

lmcinnes avatar Sep 05 '18 13:09 lmcinnes

It appears to be working fine now!

tbloch1 avatar Sep 06 '18 09:09 tbloch1

That's great news. I'm glad this worked. Hopefully this is the last of these issues to track down.

On Thu, Sep 6, 2018 at 5:36 AM tbloch1 [email protected] wrote:

It appears to be working fine now!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/89#issuecomment-419030189, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBYoRdbDVsFVoMVBg6M-zP6o52zGyks5uYOyJgaJpZM4VYZHd .

lmcinnes avatar Sep 06 '18 13:09 lmcinnes

@lmcinnes Hello, we stumbled upon on the same NaNs error using transform over larger dataset than the one used to fit. Maybe you ought to take a look at this :) Thanks for your work guys, I very much appreciate this library !

benaisc avatar Dec 03 '21 10:12 benaisc