Safe recreation of topics
Currently hermes-management has a bug which is caused by deleting topic and then recreating it quickly
The story is similar every time:
- someone deletes a topic
- topic is recreated
- kafka producer has stale metadata for that topic
- kafka producer fails to send messages to the brokers
- messages are buffered in hermes frontend instances
- we need to restart frontend instances in order for the messages to be retransmitted
Issue was thought to be solved with upgrade to kafka client 2.8.2 but again appeared recently. We would like to have a workaround for this.
One of the proposed solutions is to introduce "grace period" for deleted topics. E.g. if someone deletes a topic we should block the creation of topic with same name for long enough so that cluster and kafka producers can be in consistent state. Probably > 5 minutes is enough because metadata is refreshed every 5 minutes.
Hey @szczygiel-m , is this up for grabs?
Hi, sure 😄 Assigned
hey it may be too late for hacktoberfest but I can take a look at this if for a couple of days, you can assign it to me if you want
Looking at this, high level proposal
-
model: add a list of recent deleted topics in group in ZK, topics contain deletion date any any other minimal metadata needed
-
ZookeeperTopicRepositorycreation- check if topic name exists in the list and is beyond the threshold or not
- if beyond threshold, remove the old node in the deleted list and create normally
- otherwise, throw a new exception, similar but not the same as
TopicAlreadyExistsException, block creation
-
ZookeeperTopicRepositorydeletion- add node to deleted list
- also check if it was deleted in the past to replace it
enhancements
- have a reaper process that cleans deleted topics in the list periodically
- have the threshold configurable
I'll have a look at this next and see if I can build a concept PR
Looks like there are 3 design approaches:
- change topic data model to have a deletion flag, and change existence and removal logic deeply (complex)
- add a folder to groups to hold this data, and explicitly call it deletion info
- add a folder to groups to hold group historical data, and add the code and data to hold deletion info
I am trying to do 2. at this time, though 3. is kinda the same work in reality, depepending on the philosophy of being more generic or more concise in the data model. 3 is more generic and will make it easier to add more features (without migration impact)
Have a PoC written, going through unit tests ATM
Hi, sorry for delay in responding, great that you were willing to take on this task force 😄 assigned. I like the approach which You proposed and which You are currently working on (the one with holding data about deleted topics in zk in "deletion info")
Feel free to wait until TODOs are completed but early feedback welcome.
See PR open for review, it'd be awesome if the hacktoberfest accepted tag can be added :) Meanwhile I will look at at some of the TODOs
I will check the remaining TODOs tomorrow