atlas
atlas copied to clipboard
[BUG] Using`log_metric()` with numpy types or other invalid values crashes Redis 😰
We have faced this issue before, but for some reason it was ignored. Recently it was resurfaced by https://github.com/SebastianGergen again in the Slack community channels.
Doing log_metric()
with anything other than ints or floats essentially break the entire system and requires you to re-install Atlas because Atlas goes into a non-recoverable state.
This happens because our Flask API is expecting a certain contract with Redis which is not being met.
We should fail more gracefully, or support numpy types.
Related: https://github.com/dessa-oss/atlas/issues/96
Other details This was his message:
Hi Atlas Team, one more issue that I faced today in my "on premise" atlas installation: when I make "foundations.log_metric('accuracy', accuracy)" and add a value which is somehow "invalid" (like accuracy = np.random.rand(1)[0]) it seems that the whole database crashes or something. I tried several simple "valid" values like accuracy = 0.5, and it works fine. The result after the crash is problematic: The project overview page does not show up any more (actually it only shows up for a brief short milisecond and then disappears), and I am left with a blank GUI page in the browser. Only way to recover for me until now is re-installing :disappointed: Sorry. this description is not very clear. I hope you can get my problem :slightly_smiling_face:
when I use accuracy = float(np.random.rand(1)[0]) and foundations.log_metric('accuracy', accuracy) it works fine. So i guess that numpy makes a problem, anyways it is pretty unconvenient, that at least from my perspective I can't recover atlas after once appending a numpy based value.
Does this PR https://github.com/dessa-oss/atlas/pull/130 fix this issue as well @pippinlee ?
Don't believe the colon issue was the same root cause, but @nicknagi correct me if I'm wrong.
Sounds good! Did you have an expected timeline for this?
Tested out log_metric()
using np.float64 types specifically, trying to reproduce previous errors using .log_metric('key', np.random.rand(1)[0])
.
I was able to successfully log without error. This may have been thanks to Nick’s fix. I'll add a test to prove np.float64 support. I think we should be clearer with what types we do allow logging for though. Here’s a list of values to help us understand current state WRT numpy:
Valid np.float64(29)
Invalid type(np.random.rand(1)[0]) np.array([1,2,3,4]) np.ndarray([1,2,3])
Note: all of the above invalid values gracefully fail with an error like the following: TypeError: Invalid metric with key="np.ndarray([1,2,3])" of value=[[[ 2.33233293e-316 -4.3347313 ... with type <class 'numpy.ndarray'>. Value should be of type string or number, or a list of strings / numbers.
Ongoing questions:
- Is that error message helpful?
- Can we better define types we wish to support? That'll help us with these one offs "does log_metric() support type X", and we can provide this in docs to make this clearer to users.
Initial thoughts:
Is that error message helpful?
I think the error message is helpful as is. The last message clearly specifies it all.
TypeError: Invalid metric with key="np.ndarray([1,2,3])" of value=[[[ 2.33233293e-316 -4.3347313 ... with type <class 'numpy.ndarray'>. Value should be of type string or number, or a list of strings / numbers.
Can we better define types we wish to support?
The intention behind log metric is to support recording metrics such as precision, recall, loss, Area Under Curve, f1-score et.al. NOT to support the outputs of a task e.g. the output of a BERT model, or the output of a generative NLP model - as such it seems weird to me that we support strings and a list of strings / numbers.
On the contrary I can see - why someone would wish to view the output on the GUI.
There are two routes to go from here:
- We be more flexible: support strings, lists of string / numbers, np.array et. al.
- We stick to what log_metric is supposed to be there for and strip off support for strings / lists.
Can anyone recall why we decided to support strings?
My vote is to strip support for strings / lists and use log_metrics for what it is meant to be used for
The decision of whether to change the behaviour of this feature (e.g. modify the types supported by log_metric
) looks to be out of scope of this issue. I suggest that if this bug has been fixed we close it out and open a new issue to address that separate concern.
Works for me. Going to submit a PR to cover us for np.float64
and then we can close.
I think this issue also came up when Damien and others in the DS team were exploring Atlas. I suspect their use case is a valid concern for us. We should understand better what it is before we mark this as closed.
@mohammedri & @pippinlee ,
@cindyzczhao and I were able to reproduce this error on 0.1.1 using the following code snippet
import foundations
import numpy as np
x = np.float64(5)
fooundations.log_metric('metric', x)
It broke the GUI and we had to flush redis to get things back up.
It broke the GUI and we had to flush redis to get things back up.
Thanks for the heads up. Have confirmed this is because we need to do a new release––using current release 0.1.1 still has log_metric bug.
Latest master works without error with logging np.float64(5)
. Have added a cypress test to confirm this.
Hey @pippinlee okay to close this issue as this is fixed?