telemanom icon indicating copy to clipboard operation
telemanom copied to clipboard

Dataset questions

Open khundman opened this issue 5 years ago • 3 comments

Answers to questions received via email:

Q1. For the anomaly_sequences column in the labeled_anomalies.csv, it means the start and end indices of true anomalies in stream. However, I don’t know the indice of your file is begun at 0 or 1? For example, the [[6000,8127]] for channel id “D-2”, I want to know whether the start indice “6000” means the “6000”(begun at “1”) or “6001”(begun at “0”) row of the file “test/D-2.txt”?

The indices begin at 0.

Q2. For the anomaly_sequences and num values column in the labeled_anomalies.csv, I found that some end indice is larger than the num values: A-8.txt, A-9.txt, D-9.txt, F-2.txt. Is there any mistake?

This was an error and has been cleaned up. The anomalies go to the end of the sequence and the end of the range should equal num_values - 1.

Q3. In both your test and train files, I found most values of data is 0, and I want to know more background knowledge of the data to explain why most value of the value is 0.

The “Raw experiment data” section of the readme explains this: “Model input data also includes one-hot encoded information about commands that were sent or received by specific spacecraft modules in a given time window. No identifying information related to the timing or nature of commands is included in the data.” So you see lots of zeroes where commands weren’t sent/received for to a specific spacecraft module in a time window. At most timesteps for most of the spacecraft submodules, there is no command activity. The first dimension is the prior telemetry values for that channel (the -1.000s in the example you screenshotted) and will be primarily nonzero.

Q4. What is the time interval between the adjacent rows?

For the anomalies from the SMAP spacecraft, values are aggregated into 1 minute buckets. For MSL, the time bucket size is variable as data rates are inconsistent and no interpolation between values was performed to fill missing buckets. This is one factor in the poorer performance seen for MSL anomalies and something we will be addressing in future iterations.

Q5. I found that the anomaly of channel id P-2 are described twice and different (in row 19 and row 53), however, there are no descriptions about the anomaly of channel id T-10.

P-2 is the same channel with two anomalies occurring at different points in time, which is why you see two separate anomalies for that channel. These are entirely separate events and data that happen to occur for the same channel at different points in time. The full ranges of values are non-overlapping and the fact that the anomalous sequences have overlapping indices is coincidental.

T-10 didn’t have enough values to include so it was removed intentionally from the dataset and in the interest of time we didn’t rename all the channels.

khundman avatar Aug 30 '18 14:08 khundman

Thank you for making your analysis and data available, I'd like to use the dataset to assess the performance of multivariate time series anomaly detection algorithms by using the telemetry values of each channel (so my time series' dimension would be the number of channels) but I noticed that all the channels don't have the same number of values (time steps). Have values for different channels not been collected for the same time sequence? If not, is there any way to use the data from each channel jointly as a multivariate time series?

jules-samaran avatar Nov 14 '19 02:11 jules-samaran

@jules-samaran it is not possible to "stack" the telemetry values as you are suggesting. Values for each channel come from different, independent time windows. Also, for MSL, data for a given channel often arrives at irregular intervals and no imputation of timesteps in between values was performed.

khundman avatar Nov 22 '19 21:11 khundman

I am using my own Raw data.I have put train and test data in train and test folder respectively.Currently i am using only single channel with telemetry value at 0 column.However after running i am getting error "No such file or directory:'data\train\T_test.npy" and "No such file or directory:'data\test\T_test.npy". Kindly guide me. I am new in AI/ML

anand-gy avatar Jan 30 '23 06:01 anand-gy