Discussion - time series and info leak in Graph NN
Overview
I am referring to below great post and trying to build a graph NN for credit card fraud detection. Thanks for sharing the great work! I have a question to discuss with the community, which is the time series problem and potential information leak.
For instance in the citation dataset, when we train the model, we are in fact looking back and mix all publications together. But literally, the paper published in say 2021 will not be able to cite papers published in 2022. The dataset does not have a time stamp so I am not sure how the edge data was processed to avoid the possible info leak.
https://github.com/keras-team/keras-io/blob/127613fbc24124bcf75d81c02a26655cf65b7902/examples/graph/gnn_citations.py#L74-L80
How to reduce info leak when preparing edges
I will have to deal with credit card application data with time stamps, I assume that I should be very careful when dealing the edges to avoid info leak. Past applications cannot link to future applications as in production, this will never happen.
Currently I am using PySpark GraphFrame to prepare the node and edge dataset, in the motif finding I limit the linkage like this:
# Search for pairs of vertices with edges in both directions between them.
motifs = g.find("(app_1)-[edge1]->(entity); (app2_2)-[edge2]->(entity)")\
.filter("acct_1.id > acct_2.id")
acct_1.id > acct_2.id limits that the source application should be received later than the target application, intuitively I think it helps reduce the info leak of sneaking at future info. But I am not sure if this suffices the needs to avoid info leak, if the graph NN takes into account the direction of graphs, if so, looks like this works to block the Propagation? How the community deals with this info leak challenge when preparing data for Graph NN?
Looking for your insights, thanks in advance!