hermes icon indicating copy to clipboard operation
hermes copied to clipboard

Safe recreation of topics

Open szczygiel-m opened this issue 2 years ago • 2 comments

Currently hermes-management has a bug which is caused by deleting topic and then recreating it quickly

The story is similar every time:

  1. someone deletes a topic
  2. topic is recreated
  3. kafka producer has stale metadata for that topic
  4. kafka producer fails to send messages to the brokers
  5. messages are buffered in hermes frontend instances
  6. we need to restart frontend instances in order for the messages to be retransmitted

Issue was thought to be solved with upgrade to kafka client 2.8.2 but again appeared recently. We would like to have a workaround for this.

One of the proposed solutions is to introduce "grace period" for deleted topics. E.g. if someone deletes a topic we should block the creation of topic with same name for long enough so that cluster and kafka producers can be in consistent state. Probably > 5 minutes is enough because metadata is refreshed every 5 minutes. 

szczygiel-m avatar Sep 28 '23 09:09 szczygiel-m

Hey @szczygiel-m , is this up for grabs?

debanjanc01 avatar Oct 02 '23 17:10 debanjanc01

Hi, sure 😄 Assigned

szczygiel-m avatar Oct 17 '23 11:10 szczygiel-m

hey it may be too late for hacktoberfest but I can take a look at this if for a couple of days, you can assign it to me if you want

danigiri avatar Oct 28 '24 10:10 danigiri

Looking at this, high level proposal

  • model: add a list of recent deleted topics in group in ZK, topics contain deletion date any any other minimal metadata needed

  • ZookeeperTopicRepository creation

    • check if topic name exists in the list and is beyond the threshold or not
    • if beyond threshold, remove the old node in the deleted list and create normally
    • otherwise, throw a new exception, similar but not the same as TopicAlreadyExistsException, block creation
  • ZookeeperTopicRepository deletion

    • add node to deleted list
    • also check if it was deleted in the past to replace it

enhancements

  • have a reaper process that cleans deleted topics in the list periodically
  • have the threshold configurable

I'll have a look at this next and see if I can build a concept PR

danigiri avatar Oct 28 '24 12:10 danigiri

Looks like there are 3 design approaches:

  1. change topic data model to have a deletion flag, and change existence and removal logic deeply (complex)
  2. add a folder to groups to hold this data, and explicitly call it deletion info
  3. add a folder to groups to hold group historical data, and add the code and data to hold deletion info

I am trying to do 2. at this time, though 3. is kinda the same work in reality, depepending on the philosophy of being more generic or more concise in the data model. 3 is more generic and will make it easier to add more features (without migration impact)

danigiri avatar Oct 28 '24 16:10 danigiri

Have a PoC written, going through unit tests ATM

danigiri avatar Oct 28 '24 23:10 danigiri

Hi, sorry for delay in responding, great that you were willing to take on this task force 😄 assigned. I like the approach which You proposed and which You are currently working on (the one with holding data about deleted topics in zk in "deletion info")

szczygiel-m avatar Oct 29 '24 08:10 szczygiel-m

Feel free to wait until TODOs are completed but early feedback welcome.

danigiri avatar Oct 29 '24 11:10 danigiri

See PR open for review, it'd be awesome if the hacktoberfest accepted tag can be added :) Meanwhile I will look at at some of the TODOs

danigiri avatar Oct 31 '24 10:10 danigiri

I will check the remaining TODOs tomorrow

danigiri avatar Nov 27 '24 15:11 danigiri