atlas [BUG] Using`log_metric()` with numpy types or other invalid values crashes Redis 😰

We have faced this issue before, but for some reason it was ignored. Recently it was resurfaced by https://github.com/SebastianGergen again in the Slack community channels.

Doing log_metric() with anything other than ints or floats essentially break the entire system and requires you to re-install Atlas because Atlas goes into a non-recoverable state.

This happens because our Flask API is expecting a certain contract with Redis which is not being met.

We should fail more gracefully, or support numpy types.

Related: https://github.com/dessa-oss/atlas/issues/96

Other details This was his message:

Hi Atlas Team, one more issue that I faced today in my "on premise" atlas installation: when I make "foundations.log_metric('accuracy', accuracy)" and add a value which is somehow "invalid" (like accuracy = np.random.rand(1)[0]) it seems that the whole database crashes or something. I tried several simple "valid" values like accuracy = 0.5, and it works fine.  The result after the crash is problematic: The project overview page does not show up any more (actually it only shows up for a brief short milisecond and then disappears), and I am left with a blank GUI page in the browser. Only way to recover for me until now is re-installing :disappointed: Sorry. this description is not very clear. I hope you can get my problem :slightly_smiling_face:

when I use accuracy = float(np.random.rand(1)[0]) and foundations.log_metric('accuracy', accuracy) it works fine. So i guess that numpy makes a problem, anyways it is pretty unconvenient, that at least from my perspective I can't recover atlas after once appending a numpy based value.

Apr 22 '20 18:04 mohammedri

Does this PR https://github.com/dessa-oss/atlas/pull/130 fix this issue as well @pippinlee ?

May 05 '20 13:05 mohammedri

Don't believe the colon issue was the same root cause, but @nicknagi correct me if I'm wrong.

May 05 '20 13:05 pippinlee

Sounds good! Did you have an expected timeline for this?

May 05 '20 14:05 mohammedri

Tested out log_metric() using np.float64 types specifically, trying to reproduce previous errors using .log_metric('key', np.random.rand(1)[0]).

I was able to successfully log without error. This may have been thanks to Nick’s fix. I'll add a test to prove np.float64 support. I think we should be clearer with what types we do allow logging for though. Here’s a list of values to help us understand current state WRT numpy:

Valid np.float64(29)

Invalid type(np.random.rand(1)[0]) np.array([1,2,3,4]) np.ndarray([1,2,3])

Note: all of the above invalid values gracefully fail with an error like the following: TypeError: Invalid metric with key="np.ndarray([1,2,3])" of value=[[[ 2.33233293e-316 -4.3347313 ... with type <class 'numpy.ndarray'>. Value should be of type string or number, or a list of strings / numbers.

Ongoing questions:

Is that error message helpful?
Can we better define types we wish to support? That'll help us with these one offs "does log_metric() support type X", and we can provide this in docs to make this clearer to users.

May 11 '20 18:05 pippinlee

Initial thoughts:

Is that error message helpful?

I think the error message is helpful as is. The last message clearly specifies it all. TypeError: Invalid metric with key="np.ndarray([1,2,3])" of value=[[[ 2.33233293e-316 -4.3347313 ... with type <class 'numpy.ndarray'>. Value should be of type string or number, or a list of strings / numbers.

Can we better define types we wish to support?

The intention behind log metric is to support recording metrics such as precision, recall, loss, Area Under Curve, f1-score et.al. NOT to support the outputs of a task e.g. the output of a BERT model, or the output of a generative NLP model - as such it seems weird to me that we support strings and a list of strings / numbers.

On the contrary I can see - why someone would wish to view the output on the GUI.

There are two routes to go from here:

We be more flexible: support strings, lists of string / numbers, np.array et. al.
We stick to what log_metric is supposed to be there for and strip off support for strings / lists.

Can anyone recall why we decided to support strings?

May 11 '20 18:05 mohammedri

My vote is to strip support for strings / lists and use log_metrics for what it is meant to be used for

May 11 '20 18:05 mohammedri

The decision of whether to change the behaviour of this feature (e.g. modify the types supported by log_metric) looks to be out of scope of this issue. I suggest that if this bug has been fixed we close it out and open a new issue to address that separate concern.

May 11 '20 18:05 ekhl

Works for me. Going to submit a PR to cover us for np.float64 and then we can close.

May 11 '20 18:05 pippinlee

I think this issue also came up when Damien and others in the DS team were exploring Atlas. I suspect their use case is a valid concern for us. We should understand better what it is before we mark this as closed.

May 11 '20 18:05 kyledef

@mohammedri & @pippinlee ,

@cindyzczhao and I were able to reproduce this error on 0.1.1 using the following code snippet

import foundations
import numpy as np
x = np.float64(5)
fooundations.log_metric('metric', x)

It broke the GUI and we had to flush redis to get things back up.

May 15 '20 16:05 shazraz

It broke the GUI and we had to flush redis to get things back up.

Thanks for the heads up. Have confirmed this is because we need to do a new release––using current release 0.1.1 still has log_metric bug.

Latest master works without error with logging np.float64(5). Have added a cypress test to confirm this.

May 18 '20 16:05 pippinlee

Hey @pippinlee okay to close this issue as this is fixed?

May 21 '20 14:05 mohammedri

atlas atlas copied to clipboard

[BUG] Using`log_metric()` with numpy types or other invalid values crashes Redis 😰

atlas
atlas copied to clipboard