pytorch_geometric
pytorch_geometric copied to clipboard
[Roadmap] Temporal Graph Support 🚀
Updated on Nov 27th.
This is a roadmap towards better temporal data and temporal GNN support in PyG (targeted for v2.5), which spans across data handling, data loading, models and examples.
Data Handling
Previously, we used TemporalData
as our abstraction to represent temporal graphs. However, this data structure is limited, i.e. it can only hold a stream of events, it mixes homogeneous and heterogeneous data, it cannot naturally deal with node features, etc. As such, we want to deprecate TemporalData
and move time handling to Data
and HeteroData
explicitly.
- [ ] Deprecate
TemporalData
- [x] Bring
data.time
support toData
/HeteroData
:- [x] Expose
time
as a property - [x] Support
data.up_to(time) -> Data
anddata.snapshot(start_time, end_time)
- [x] Support
data.sort_by_time
(or similar)
- [x] Expose
- [ ] Convert all datasets that currently utilize
TemporalData
- [ ] Support node-level training labels: This requires a data-structure for which we can query the label efficiently given a node ID and seed timestamp.
- [x] Support
Data.update(data)
/HeteroData.update(data)
in order to add new incoming edges.
Open Questions:
- How do we represent changing node features over time?
Data Loading
With Data
and HeteroData
having time support, we can utilize PyG's data loaders to take care of temporal subgraph sampling. Importantly, NeighborLoader
and LinkNeighborLoader
have been already equipped with temporal sampling support.
- [ ] Deprecate
TemporalDataLoader
- [x] Support node-level/edge-level temporal sampling in
NeighborLoader
- [x] Support node-level/edge-level temporal sampling in
LinkNeighborLoader
Models
Currently, we only expose TGN
as a temporal GNN model, which currently operates on TemporalData
and its own sampler implementation.
- [ ] Re-factor
TGN
to be able to operate onData
/HeteroData
while utilizingNeighborLoader
/LinkNeighborLoader
- [ ] Re-factor
TGN
to support multi-hop sampling (currently capped tonum_hops=1
) - [ ] Finish our
GraphMixer
implementation (#8304)- [ ] Out-source
MLPMixer
inGraphMixer
into anAggregation
module
- [ ] Out-source
- [ ] Finish our
EdgeBank
implementation (#7966) - [ ] Add the
NAT
model
Examples and Tutorials
Afterwards, we need clear and descriptive examples and tutorials on how to leverage PyG for node-level and link-level temporal applications.
- [ ] An example of
GraphMixer
for temporal node-level predictions on TGB dataset - [ ] An example of
EdgeBank
for temporal link-level predictions on TGB dataset - [ ] A tutorial on PyG's temporal GNN support
Advanced Features
One current limitation of most temporal GNN models is that they learn an embedding per node, which gets updated over time. However, it is not necessarily scalable to keep this embedding on the GPU due to memory constraints. As such, advanced features may include data-structures to query and train these embeddings on CPU, with efficient asynchronous device transfers to avoid host2device/device2host bottlenecks. An example of such a data-structure may be inspired by the GNNAutoScale paper.
This would be very useful for our work. Do you have an idea in mind of how this would look?
I sadly hadn't have time to fill in some concrete roadmap plans yet, but we are looking into ways to:
- Allow easy updates to
edge_index
for newly incoming edges - Allow for temporal sampling (either as part of current samplers or as its own sampler).
Ah got it, so the graph itself is temporal, you're not thinking of time series on a graph. Keen to learn more if and when you guys have further thoughts on this. Thanks!
Hi @rusty1s, in order to try to contribute to the Roadmap definition/planning, I want to add some topics/thoughts to be considered:
- Consider extending or updating the TemporalData entity to support all the combinations of static/dynamic graphs with static/temporal signals. That way the Signal implementations in PyGT may inherit from it in future.
- There are Discrete-time dynamic graphs - DTDG (sequences of static graph snapshots taken at intervals in time) and Continuous-time dynamic graphs - CTDG (timed lists of events). There are models that use DTDG and models that use CTDG. The differences between them in terms of implementation might be considered during the design of the structures for Dynamic Graphs.
- Currently, the TGNMemory module implements a way to deal with updates on the graph (item 1. in the list above). A more generic module (extracted from there) might be considered.
If you agree with any of these topics/thoughts, we can transform them into tasks to be done and I can help with some of them :)
Thank you @otaviocx, these are great thoughts. I'm still struggling to find a good way to unify DTDG and CTDG. Currently, TemporalData
refers to a list of events (which we could convert to DTDG but not vice versa). How do you think people want to define their temporal data in the first place? Shall we require lists of events as input and then simply allow conversion to snapshots? Do you think it's best to separate both representations from the start or do you think we should unify them?
Hi @rusty1s, thanks for answering!
How do you think people want to define their temporal data in the first place?
It is hard to say since this area is really recent and still evolving a lot. But, I would say we can start with something as flexible as possible to support many scenarios. That being said, I can think of 2 ways of temporal data representation:
- a list of events (which can already be represented by
TemporalData
) - a graph structure with some timestamp fields representing the temporal layer. In this approach, we can extend/update the
utils.to_networkx
andutils.from_networkx
methods to start supporting temporal info. It can be, for instance, a new param on each function to pass the name of the timestamp field for nodes and edges. Wdyt?- we can evolve this approach to deal with snapshots too. For instance, to have another method, let's say
utils.increment_from_networkx
, which will get a graph snapshot and a list of events starting from this first snapshot (just an idea that came to my mind now).
- we can evolve this approach to deal with snapshots too. For instance, to have another method, let's say
Shall we require lists of events as input and then simply allow conversion to snapshots?
Thinking on the second approach in the answer above, we can also support graph snapshots in a well-known format (such as networkx). But, I would say we can allow the conversion. Even not working with DTDG, I think the snapshots may be useful for the users/researchers as a debug tool. It can be used to show the graph at specific times and to do some manual comparisons and also plotting. Wdyt?
Do you think it's best to separate both representations from the start or do you think we should unify them?
I think we can have these 2 representations as separated structures and allow the conversion between them. Thinking about a Roadmap, as far as I can see, the DTDG are being used less and less by new models. Thus, we may consider the support of snapshots and conversions as further steps... Wdyt?
These are great thoughts, thank you. I'm currently leaning towards option 2, where we let the attribute time
(or t
) denote the timestamp information. This way, we can convert simple data
objects into a stream of events as well (and vice versa). We can also add a Data.update
to append new edges to the temporal graph.
Hi, @rusty1s. Nice, thanks! We can also add a TemporalData.snapshot
method which will receive a param time
and will return a snapshot of the Data until the given time, wdyt?
Thus, I think we can already get some tasks from this conversation to put in this Roadmap. I would say:
- Create a child of
Data
orBaseData
that can represent a Temporal Graph- it can be a refactoring of
TemporalData
, for instance.
- it can be a refactoring of
- Add an
update
method to this newTemporalData
class to allow users to append new edges to the temporal graph. - Add
utils
methods to convert from simpledata
objects into a stream of events (and vice versa).- The stream of events can be represented by the current
TemporalData
class, for instance.
- The stream of events can be represented by the current
- Add a
snapshot
method to this newTemporalData
class to take a snapshot of the graph given a timestamp. - Select at least 5 Temporal GNN models (as diverse as possible) and try to implement them using PyG to investigate what else can be needed.
Is this a good start?
One question: does it make sense to have TemporalData
inheriting from BaseData
?
Yes, this sounds great. I think the separation of Data
and TemporalData
makes sense in the beginning, but it's also possible to merge those two in later releases.
Yes, TemporalData
should inherit from BaseData
IMO. This + update
and conversion to snapshot Data
objects seem like a great starting point.
@otaviocx Thanks for this, its a great plan.
One thing to think of after we are done with the initial 5 steps you mentioned, is to think of how to incorporate node addition/ deletion to the update
method.
Really interesting discussion.
Have we got enough to put together a set of issues so that people like me can also help out :-)
On Sat, 18 Dec 2021, 8:47 pm Jinu Sunil, @.***> wrote:
@otaviocx https://github.com/otaviocx Thanks for this, its a great plan. One thing to think of after we are done with the initial 5 steps you mentioned, is to think of how to incorporate node addition/ deletion to the update method.
— Reply to this email directly, view it on GitHub https://github.com/pyg-team/pytorch_geometric/issues/3230#issuecomment-997197859, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGRPNYAXP7T7LY7QZ6KZNTURR7HJANCNFSM5EVNLVHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
--
By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/ https://grab.com/privacy/
This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.
Thank you! I updated the initial description of the issue according to our discussion. If you are interested in contributing, feel free to send an initial PR or create a separate issue to discuss further :)
Nice! The roadmap looks great.
Thanks for updating the roadmap @rusty1s.
I'm currently working on items 1, 2, 4 and 5. Planning to submit some PRs this week.
Looking forward to it :)
Hey @otaviocx I thought I might try to start helping out with the last task. Do you have anything in mind for the methods that would be best to implement?
I'm not very familiar with what is being done in the field, but for CTDG I was thinking something that looks like the framework proposed in https://arxiv.org/abs/2006.10637 is probably worth having?
Hi @Padarn, sorry for the delay in replying to your message. For some reason, I didn't get notified. I'm not sure if I understood your question, please, excuse me if my answer goes on the wrong path.
So, the TGN is already implemented in https://github.com/pyg-team/pytorch_geometric/blob/master/torch_geometric/nn/models/tgn.py. Actually, this seems to be the only Temporal GNN model defined in PyG. There is a good list of other Temporal architectures here https://pytorch-geometric-temporal.readthedocs.io/en/latest/notes/resources.html
The TemporalData
class currently is used by the TGN implementation. But, it is more like an Event List/Log than a temporal graph. I would say that it may work for other CTDG architectures as it is, but not sure for DTDG. Wdyt?
@rusty1s sorry for the delay in sending the PRs. I was stuck on the implementation of all the missing methods that TemporalData
should have to inherit from BaseData
. I'm preparing two approaches to show soon:
- To add the missing methods in
TemporalData
and use aGlobalStorage
asself._store
similar to the one we have in theData
class. - To inherit from
Data
and have a method to receive the event log as we currently receive in theTemporalData
constructor (src, dst, time, y and message) and convert it to theData
signature (x, edge_index, edge_attr, y and position).
One thing that I realised is that we will also need an HeteroTemporalData
for some models, do you agree? The TemporalData
class as it is doesn't support different types of nodes or edges yet.
Hi @otaviocx no worries and thanks for your reply. Your reply does not go down the wrong path at all. My question was just to understand which architectures you thought would best represent what we want to make sure is supported. The reference you provided gives a good starting point.
I agree that we'll probably need some of the changes coming in your upcoming PRs to support DTDG architectures with TemporalData
without introducing some hacks.
I wasn't aware of https://pytorch-geometric-temporal.readthedocs.io/en/latest/index.html earlier - Is it fair to say we want to bring most of this functionality inside of pytorch geometric?
I wasn't aware of https://pytorch-geometric-temporal.readthedocs.io/en/latest/index.html earlier - Is it fair to say we want to bring most of this functionality inside of pytorch geometric?
Hi @Padarn, I would say that here we will have a more generic approach and elements in a way that pytorch-geometric-temporal
or other derivated libs can take advantage of. But, it is better to confirm with the maintainers (@rusty1s, for instance).
sorry for the delay in sending the PRs
Hi @rusty1s, I've created the first PR with items 1 and 2 of the Roadmap. Item 3 will need some refactoring in the JODIEDataset
and other classes. This dataset seems to be not supporting the DataLoader yet (we need to deal with the data param in the model since there is no train/val/test dataset split). I'm thinking to refactor this dataset to have a split
param in its constructor turning possible to have 3 (train/val/test) DataLoaders in the model example file.
Also, some classes (such as the DataLoader) expects an Union of Data
and HeteroData
as a param (Union[List[Data], List[HeteroData]]
or Union[Data, HeteroData]
). I'm refactoring where is needed to BaseData
.
One thing to think of after we are done with the initial 5 steps you mentioned, is to think of how to incorporate node addition/ deletion to the update method.
Hey @wsad1 and @rusty1s, regarding the update
method, if we follow the TGN strategy to model the temporal graph as a stream of events, we can actually have events to add/update edges or nodes or even to delete them:
Thus, I think the update in such a case will be just the addition of new events to the stream, wdyt?
The logic to compute the nodes/edges addition, update or deletion will be actually in the model (Memory in the case of TGN) or in the snapshot function to get a static version of the graph at the given time (considering all the additions, updates and deletions).
Does it make sense to you?
This is exactly what I was thinking when I read that paper. +1 from me.
On that note, I am looking at working on the temporal snapshots on top of the work you're doing. Let me know if you see any risk of overlapping efforts here.
Thus, I think the update in such a case will be just the addition of new events to the stream, wdyt?
Not sure I understand that. Node-wise and interactions events seem to have different representations, so how do you want to merge them?
Node-wise and interactions events seem to have different representations, so how do you want to merge them
Well... as per the TGN and Jodie papers, the datasets we currently have are all with a single type of event (interactions). Even so, we can model a node-wise event as an event that has the same value for the src
and dst
keys. Just checked and if we get an event like that, only the memory
and embeddings
of this node are changed (on the current TGN implementation).
Now that we are diving a bit deeper into that subject, some other topics/challenges appear:
- We may also need to change the
TGNMemory
module since as it is now, the number of nodes is fixed. Thus, if a new node appears, there is no memory for it on this model. - Most of the CTDG GNNs are using events stream with this same structure: source node id, destination node id, time and message. Thus, it seems the features of the nodes and edges features are mixed in the message. So, if my understanding is right, there is no way to extract a static graph (snapshot) with features in nodes and edges (all the features will be always in edges since they represent the messages/events). We may conclude that the support for DTDG may be based on a different structure (something more similar to what we already have in PyG Temporal, for instance).
- Most (if not all) CTDG GNNs are dealing with multi-graphs, which means that we can have many edges linking the same two nodes. The challenge here is again on the static graph (snapshot) extraction. Should we extract a multi-graph? Or... should we extract a graph considering only the last edge for each node pair?
I am looking at working on the temporal snapshots on top of the work you're doing
@Padarn, considering the topics above, I would say we may need some design definitions before starting to code changes regarding snapshots, wdyt?
@Padarn, Data.to_temporal(time_attr='time')
on the other side is something that I think we can start working on, wdyt?
Thanks for the clarification @otaviocx. So node-level events are treated as self-loops. So, IMO, to_snapshot
should probably simply give you a snapshot in the form of a multi-graph with self-loops (solely holding edge features rather than node features). After that, the user could make self-loop features to node features (e.g., based via some DNN module). In theory, we could have different options for to_snapshot
(like keep_ony_last=True
), but we should start simple and the least magically.
Updates to the number of nodes/edges is another interesting topic, and I have some thoughts on this. For example, one could sub-class Tensor
to a TemporalTensor
that actually allocates more memory in the node dimension than what is currently needed. To the outside, it utilizes a slicing/masking appraoch to only expose the parts of the tensor that are actually filled with information. In the inside, however, we do not have to allocate new memory (at least not all the time) whenever a new node appears.
Most of the CTDG GNNs are using events stream with this same structure: source node id, destination node id, time and message. Thus, it seems the features of the nodes and edges features are mixed in the message. So, if my understanding is right, there is no way to extract a static graph (snapshot) with features in nodes and edges (all the features will be always in edges since they represent the messages/events). We may conclude that the support for DTDG may be based on a different structure (something more similar to what we already have in PyG Temporal, for instance).
@otaviocx I don't think node events should be treated as self-loops, or at least I'm not sure what the rationale for that is. I'm not sure how a model can distinguish between actual self-loops and node events. You can generalize this format to have both node and edge events. This can be implemented by:
- Having an extra
event_type
column/key, indicating whether the event is a node or edge event, or more generally the type of node/edge event. - Similar to how we pad sequence lengths of sequential models like RNNs, when feature sizes are of different lengths, shorter events can be padded with zeros.
- Node events can have anything in the destination node (i.e.
dst
) column/key, such as the source node as you've suggested, although I think changing it to a fixed invalid node id (e.g.-1
) and building in checks to ensure thedst
node id is not valid may make things less error-prone.
There may be different networks to learn from updates to nodes and edges respectively, and we can use slicing/masking to get the relevant events.
This offers a few benefits:
- We can now distinguish between self-loops and node events
- We can have events with different numbers of features
- We can now handle arbitrarily many types of events in different ways, including events involving changes to the number of nodes/edges
What do you think?
@otaviocx Oh I realised that you have a PR https://github.com/pyg-team/pytorch_geometric/pull/3985 that's almost ready for merging. I'll take a closer look later this weekend to understand if what I said above makes sense given that it's to be merged :)
Also left comment abouts TemporalDataLoader
shuffling and testing in your PR!
@Padarn,
Data.to_temporal(time_attr='time')
on the other side is something that I think we can start working on, wdyt?
@Padarn @otaviocx Let me know if I can help with this :)
Hey @yulonglin I haven't started on this task. Please go ahead from my point of view. Sorry I missed the emails from this thread.
@otaviocx sorry for the slow reply
@Padarn, considering the topics above, I would say we may need some design definitions before starting to code changes regarding snapshots, wdyt?
Yeah agreed. I will take a stab at this in a few days if no one else has (not a lot of time around work for the next few days). Maybe I'll split it into a separate issue.