go-gtp
go-gtp copied to clipboard
Thoughts on state persistence?
How/where should this be handled? How to survive a restart without losing all the sessions in case of scheduled and unscheduled maintenance?
To support scheduled, I am guessing, the app can stop serving, get list of sessions, encode (using gob?) and write to file. On startup read file, decode, start serving.
To support unscheduled, the app will have to not use the session store of library and instead operate on blob store (network/disk) on every message.
What do you think?
It'd be nice to have such kind of feature, but no idea how it should be. I've seen some proprietary features to do that kind of thing in large telco vendors' system, but they manage the state by higher entity(not by each node, but by a whole EPC system with management-specific node... maybe for consistency). LTE(EPC) is designed as a kind of "monolith", and that makes it complicated to make it resilient enough. This is one of the biggest issues in EPC implementation and I want to resolve it in the project with CNF Testbed team by discussing with cloud-native specialists(see https://github.com/cncf/cnf-testbed/milestone/56, https://github.com/cncf/cnf-testbed/milestone/55, etc.).
So, how about joining us and discuss how it should be? https://cloud-native.slack.com (see #cnf-testbed-dev channel)
FYI, the only functionality described in the spec is restart counter in Echo messages, and I'm thinking about implementing it some day.
@wmnsk you will have to invite me according to the login page.
Oh OK, I sent an invitation to your email address shown in your commit message.
A GGSN/PGW/SMF that is not just a toy has to consider and implement 3GPP charging support and meet regulatory requirements when doing so. This ultimately leads to a highly volatile state in the user plane (traffic counters, timers, and so on). Moving that state between nodes is doable, but far from trivial. The real question is whether that is really necessary. GTP contexts will be reestablished by the UE very quickly once it has been notified that the context was terminated. It usually will enough to cleanly remove the contexts and let the UE handle the recovery. As long as the node is able to handle new requests, no one will notice that it was down for a moment or had a partial failure.
Do we want EPC node to come up with all old sessions back ? If yes then typically sessions are kept outside the process but on the same server or different server in the same data centre. this is called local redundancy. Other aspect is handling the data centre failure, in this case sessions are pushed to database & frequently updated the state changes in the database. This database may be located in some other data centre. This is geo-graphical redundancy.
We should note that with Volte coming in picture and if operator has to support emergency calling then operator needs to make sure that there is some resources available all the time for these kid of calls. Also operator needs to make sure that he does not have any revenue loss ( due to loss of charging records) due to any failure in EPC nodes..
Good news in this is very few calls are active at any time ( typically) and most of the subscribers just connected to LTE ( always on).