Fei Gao comments

Results 7 comments of


                                            Fei Gao

summarize(text) should give minimum 1 word

I encountered this kind of problem when I'm trying to summarize the following paragraph: > ALBUQUERQUE, N.M. (AP) — The Latest on Martin Luther King Jr. Day celebrations around the...

error

> sometimes it runs this error ![image](https://user-images.githubusercontent.com/45503919/147079001-8f0b0685-41bb-47a9-9598-cb4ea498f62f.png) > > but sometimes it not It seems like something is wrong with the data input. I'd suggest you print the outputs of...

RuntimeError: scatter_add_cuda_kernel does not have a deterministic implementation

> Hi, @rusty1s I try to make [mutag_gin.py](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/mutag_gin.py) output a deterministic result. Following the above suggestions, I have changed the dataset to SparseTensor, `edge_index` to `adj_t` and changed line 53...

AUC decreases A LOT after re-generating cached data

Specifically, I found the inconsistency, i.e., the node features in cached data provided are not aligned with the `feature_encoder`: For instance, as shown below, the `charge` attributes of nodes in...

Now I provide you the missing PREPROCESSING codes.

> Hello, I found some issues in your code, specifically: ① The timestamps of the logs after processing are not aligned with other data; ② In no_fault data, it seems...

Now I provide you the missing PREPROCESSING codes.

> 我看您好像是国人，为了交流方便，我就直接用中文了（主要是我英文水平不高，如果您需要的话后面我再整理一个英文版的）。 > > 我先简述一下数据预处理的流程：您生成的 chunks 中，是先提取 records 中的 start_time 和 end_time，然后每10s作为一个时间区间，遍历三种数据（log、metrics、spans），将在每个时间区间内的数据作为一个样本。 > > ①日志时间戳问题：原论文的三种数据：时间是没对齐的，所以您的代码中也进行了对齐处理，但是对齐后的日志数据时间戳好像处理错误了，导致您用 records 文件生成的时间区间去匹配日志数据的时候，匹配到的都是空值，所以日志数据是没有用上的。（这一点可以将预处理之后生成的 logs 和其他数据的时间戳对比一下就清楚了） > > ②论文有 TT 和 SN 两个数据集，每个数据集又分为 fault（故障注入）和 no_fault（无故障），但是您的代码在预处理数据的时候貌似只用到了 fault 的数据，而...

Now I provide you the missing PREPROCESSING codes.

@YixiangTang @1258820789 @dinghanfei 感谢大家热烈的讨论，我因为现在没有做这个方向了，抱歉没有精力再更新我的预处理代码，各位对于我预处理代码中的修正，欢迎提 PR🙏。另外，如果大家要基于作者的代码进行 research，请必须注意⚠️另外一个致命的问题：Eadro 源代码在划分时间窗口时的数据泄露问题。根据[`get_basic`](https://github.com/BEbillionaireUSD/Eadro/blob/82ff9c9a7a58a9cd2b713aa78754a73c543a15db/codes/preprocess/align.py#L16)这个函数，假设标记的故障时间段为`start_time=1, end_time=20`，`chunk_lenth=10`，那么会将这个故障时间段构造为 10 个 chunk：`[1-->10], [2-->11], ......, [10-->20]`，并且这 10 个 chunk 全部会被认为是正样本（故障样本）。 **注意⚠️，此时已经出现了问题，即这 10 个chunk 之间并不是独立的！他们之间会有很大的时间重叠。** 更加严重的是，[这里](https://github.com/BEbillionaireUSD/Eadro/blob/82ff9c9a7a58a9cd2b713aa78754a73c543a15db/codes/preprocess/align.py#L135)还进一步对所有的 chunk 进行了 shuffle。这可能会导致 `[1-->10]`和`[2-->11]`这两个 chunk...