pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[improve][broker]Ensure namespace deletion doesn't fail

Open eolivelli opened this issue 1 year ago • 3 comments

Motivation

Deleting a namespace sometimes fails and leaves znodes around, especially the "bundle owner" nodes /namespace/NAMESPACE/xxxxxx

Modifications

  • ensure that all the children of /namespace/NAMESPACE are deleted, especially the ephemeral nodes for ownership
  • ensure that on the code path that handles the deletion of the namespace we make progress, the main fact is about not relying on "cache.exists" before using a "delete": the new code is more pessimistic that goes with "deleteIfExists", handling "NotFound" gracefully
  • add more logs about the delete namespace execution path

With this patch

  • delete namespace seems to consistently work
  • there are no more spammy logs about errors around the topic policies event reader (changelog_events...)
  • in case we see an error we log at which point the procedure was (because we log all the paths that are deleted from zk)

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • [ ] Dependencies (add or upgrade a dependency)
  • [ ] The public API
  • [ ] The schema
  • [ ] The default values of configurations
  • [ ] The threading model
  • [ ] The binary protocol
  • [ ] The REST endpoints
  • [ ] The admin CLI options
  • [ ] The metrics
  • [ ] Anything that affects deployment

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository: https://github.com/eolivelli/pulsar/pull/27

eolivelli avatar May 01 '24 11:05 eolivelli

This script reproduces MANY problems with namespace deletions, just run it in a loop on a Pulsar cluster with a few brokers (like 6) and more than one ZK server (I am using 3 nodes)

reproduce.sh

#! /bin/bash
export PATH=$PATH:/pulsar/bin
set -x
pulsar-admin topics delete-partitioned-topic non-persistent://T1/N1/P1 -f
bin/pulsar zookeeper-shell -server pulsar-zookeeper-ca:2181 ls -R /T1/N1 | grep namespace
pulsar-admin namespaces delete T1/N1
pulsar-admin tenants delete T1

pulsar-admin tenants create T1
pulsar-admin namespaces create T1/N1 -b 64
pulsar-admin topics create-partitioned-topic non-persistent://T1/N1/P1 -p 20

exit 0

Execute it in a loop

while true ; do ./reproduce.sh ; done

eolivelli avatar May 01 '24 11:05 eolivelli

@lhotari I think that it is worth to port this fix to all the active branches, maybe up to 3.0. Wdyt?

eolivelli avatar May 02 '24 07:05 eolivelli

@lhotari I think that it is worth to port this fix to all the active branches, maybe up to 3.0. Wdyt?

@eolivelli makes sense. Btw. there's a rabbit hole in namespace deletion. Currently concurrency limits are missing and a namespace deletion will attempt to concurrency delete all possible topics in the namespace. This is something that would have to be addressed to improve the reliability of namespace deletion. More details in #22541 comments such as https://github.com/apache/pulsar/pull/22541#issuecomment-2071621213 and https://github.com/apache/pulsar/pull/22541#issuecomment-2071687995 .

lhotari avatar May 02 '24 07:05 lhotari

Codecov Report

Attention: Patch coverage is 43.13725% with 29 lines in your changes are missing coverage. Please review.

Project coverage is 32.47%. Comparing base (bbc6224) to head (23c39c1). Report is 261 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #22627       +/-   ##
=============================================
- Coverage     73.57%   32.47%   -41.10%     
+ Complexity    32624       87    -32537     
=============================================
  Files          1877     1745      -132     
  Lines        139502   133886     -5616     
  Branches      15299    14673      -626     
=============================================
- Hits         102638    43482    -59156     
- Misses        28908    84033    +55125     
+ Partials       7956     6371     -1585     
Flag Coverage Δ
inttests 27.52% <43.13%> (+2.94%) :arrow_up:
systests 24.80% <5.88%> (+0.47%) :arrow_up:
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
.../service/SystemTopicBasedTopicPoliciesService.java 52.26% <100.00%> (-21.93%) :arrow_down:
...he/pulsar/metadata/impl/AbstractMetadataStore.java 55.31% <100.00%> (-29.25%) :arrow_down:
...ulsar/broker/resources/LocalPoliciesResources.java 44.82% <0.00%> (-31.04%) :arrow_down:
...he/pulsar/broker/resources/NamespaceResources.java 53.90% <25.00%> (-26.25%) :arrow_down:
...apache/pulsar/broker/resources/TopicResources.java 39.58% <33.33%> (-57.03%) :arrow_down:
.../org/apache/pulsar/metadata/api/MetadataStore.java 60.00% <50.00%> (-20.00%) :arrow_down:
.../apache/pulsar/broker/resources/BaseResources.java 55.85% <50.00%> (-16.88%) :arrow_down:
...pache/pulsar/broker/admin/impl/NamespacesBase.java 15.22% <0.00%> (-57.86%) :arrow_down:

... and 1517 files with indirect coverage changes

codecov-commenter avatar May 13 '24 06:05 codecov-commenter