ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-6663. remove container reference from scm if its state is DELETED

Open JacksonYao287 opened this issue 2 years ago • 8 comments

What changes were proposed in this pull request?

when the state of a container in scm is deleting , and the all the replica of this container is deleted from the corresponding datanode, a CLEANUP event will be fired in scm, and the state of the container in scm will be changed to DELETED. when the state of a container is DELETED, the reference should be removed from scm.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6663

How was this patch tested?

UT

JacksonYao287 avatar Apr 28 '22 01:04 JacksonYao287

@sodonnel @ChenSammi PTAL, thanks!

JacksonYao287 avatar Apr 28 '22 01:04 JacksonYao287

@JacksonYao287 what is the motivation behind this change? Maintaining a record of all deleted containers in SCM has the following benefits:

  • If a dead node comes back online after a container is deleted and starts reporting a deleted container, SCM can delete it instead of falling back to the logic for unknown containers in ContainerReportHandler.
  • Easier debug-ability when working through potential HDDS bugs.

I do not see problems with keeping deleted container entries. Even if we had to track 10 billion containers during the system's lifetime (which I suspect might be pushing the limits of what RocksDB can handle) that would correspond to about 5gb * 10billion = 50 exabytes of data deleted from the cluster during its lifetime, which is a massive amount of churn. If we are concerned about SCM memory usage since it is tracking all the containers in memory, a better approach might be to track deleted containers in RocksDB only.

This is actually a somewhat major change in the way HDDS functions, so if we want to go through with this, I think we would need to answer some more questions:

  1. What are the specific problems/use cases where the current approach will cause problems?
  2. How serious are the problems with the existing approach compared to the potential problems of this new approach?
  3. What happens when a dead datanode comes back and starts reporting a container that was deleted and removed?
  4. How is case 3 reconciled with legitimate unknown containers that may be reported to SCM due to bugs or improper configuration?

errose28 avatar May 03 '22 23:05 errose28

thanks @errose28 for the review! the main motivation behind this change is the SCM memory usage. i originally thought it will take too much RAM if there are two many deleted containers. here, seems i have overthought it. will close this PR shortly.

a better approach might be to track deleted containers in RocksDB only.

can you please explain this in more detail? do you mean if we have to track a massive amount of containers, no matter put or get the state of container , we just hit rocksdb directly, but not maintain a state map in RAM?

JacksonYao287 avatar May 04 '22 09:05 JacksonYao287

I think it is reasonable to remove containers that are empty. There isn't a good reason to keep them beyond debugging and if a DN comes back up with a container on it, we should probably handle it as an unknown container and remove it. I know the default Ozone behaviour is to ignore unknown containers and not remove them. I understand why (we don't want to accidentality delete data), but I feel that over time the number of empty containers in a cluster will grow. Also if the cluster is out of safemode, it seems overly pessimistic to keep unknown containers.

It may seem unrealistic that we will end up with a log of empty containers, but if we have Hive workloads for example. They can have many phases and stage the temp data from between jobs on the cluster. This will fill containers and then they will be purged. A job could stage 100's of GB potentially in a large multi-table join.

Related to this, and a problem I foresee, is that a cluster running these Hive type workloads may end up filling containers partly with temp data and partly with real data. The containers get closed. The tmp data gets removed and we end up with lots of small containers that are never totally empty, but may have much less than 5GB of data in them. Then the SCM memory will get used with not just empty container references, but far more containers than we expected (small container problem).

We ultimately need to solve that small container problem by merging containers, but there is at least one major problem. We cannot easily update OM with the new containerID. One suggestion was to create a mapping in SCM. Another idea may be to find the keys associated with a container in Recon (reverse index of containerID -> key) and then update OM.

For this PR, perhaps removing the deleted container reference in SCM could be tied to the same config that lets us delete or not delete unknown containers, so it either logs a warning or deletes the SCM reference depending on the value?

sodonnel avatar May 04 '22 14:05 sodonnel

thanks @sodonnel for the comments, i agree with your opinion.

Then the SCM memory will get used with not just empty container references, but far more containers than we expected (small container problem).

this is a very good point.

One suggestion was to create a mapping in SCM. Another idea may be to find the keys associated with a container in Recon (reverse index of containerID -> key) and then update OM.

since recon may hold stale inform, creating a mapping in scm may be better.

perhaps removing the deleted container reference in SCM could be tied to the same config

yea, agree.

JacksonYao287 avatar May 05 '22 16:05 JacksonYao287

Another question comes to my mind. suppose we have contianer c1 and three replicas r1, r2,r3 on datanode d1, d2,d3 ,respectively. if d3 is dead, r3 will lost, so RM will find this situation and copy r2 to datanode d4 as r4. after a while, some block deletion commands comes to d1, d2,d4 to delete some blocks in r1, r2, r4 and the delete transaction will be removed from the deleteBlockLog. if d3 resurrects , r3 comes back and has more blocks than r1, r2 ,r4, since d3 does not receive those deletion commands. when d3 sends a container report to scm , the following function in containerReportHandler will change the container stats to the same as r3, since r3 has a bigger KeyCounts and UsedBytes.

  private long calculateUsage(ContainerInfo containerInfo, long lastValue,
      long thisValue) {
    if (containerInfo.getState().equals(HddsProtos.LifeCycleState.OPEN)) {
      // Open containers are generally growing in key count and size, the
      // overall size should be the min of all reported replicas.
      return Math.min(lastValue, thisValue);
    } else {
      // Containers which are not open can only shrink in size, so use the
      // largest values reported.
      return Math.max(lastValue, thisValue);
    }
  }

since those zombie blocks will not be deleted from r3(no deletion command for these blocks will be received) , the KeyCounts and UsedBytes will not be zero, and thus , the container replica at datanodes and the reference at scm will never be removed.

JacksonYao287 avatar May 06 '22 02:05 JacksonYao287

@sodonnel i have updated this patch, please take a look!

JacksonYao287 avatar Jul 28 '22 06:07 JacksonYao287

a better approach might be to track deleted containers in RocksDB only.

@errose28 My thought is also to do this. Some heavy load scenarios could be benefited from this. If the cluster runs for more than several months, it would have lots of deleted container info in memory, which is nearly not used anymore. Saving the record in the RDB, we could have a backup path to search the container. This backup solution would bring some query overhead but saves the memory.

Xushaohong avatar Aug 30 '22 08:08 Xushaohong

@sodonnel @errose28 do you have any other comments ?

JacksonYao287 avatar Oct 21 '22 03:10 JacksonYao287

IMO I think this change looks good. Perhaps we should add one more test that confirms the reference is not deleted if the config flag is false, just to catch any future changes which may break this.

I also believe that the best thing is to have this "delete empty container references" on by default as in the normal case, there is no real reason to want to keep an empty reference around.

We should probably log it is being removed however, so there is some trace or indication about why it disappeared.

@errose28 what do you think? You had some differing opinions on this before?

sodonnel avatar Oct 21 '22 09:10 sodonnel

Hi @JacksonYao287 @sodonnel I still have some concerns about the way this change is implemented.

No clear problem statement

Is the motive to reduce memory usage or RocksDB usage in SCM? If it is memory usage, then we can leave the deleted containers in RocksDB only and not load them into the ContainerStateMap. If it is to reduce RocksDB footprint I think this might be a pre-mature optimization as I'm not aware of any RocksDB scalability issues encountered on SCM currently.

Changing the config key fragments the container delete path

Containers deleted before this change is applied remain in SCM forever. Similarly if the config is turned off and on and containers are deleted, they will not be cleaned up when the config is restored to on. I guess this is more of a debugging inconvenience than an error, but will make examining existing clusters confusing none the less.

No integration with unknown container handling

  • This change only works when unknown container handling is set to delete. There would need to be some validation that the two configs are set in a way that works with each other, or they would need to be combined to a new config.

  • Even if the configs are unified, turning the unified config on then off will cause problems. SCM would log reports of unknown containers that were deleted while the config was on. Currently this log only happens when there is a bug in the system, but now there is no way to know whether the config changes or a bug caused the unknown containers.

Confusion about container delete vs. block delete

This is not really a comment on the code but a response to some of the discussion on this PR. The second half of this comment from Stephen and this comment from Jackson are definitely valid points, but they concern block deletion, not container deletion. Changes proposed here will not solve these problems. A change in this area will only affect SCM memory usage or rocksDB usage with potential debugging tradeoffs.

IMO the best thing to do is:

  • Do not load the deleted containers into SCM memory. Leave them in RocksDB to be queried if needed.
  • No config key for this behavior.
  • Leave unknown container handling as is.
    • The code path would only be triggered if there is a bug in the system.
    • In this case the config can be used to resolve the issue if it is determined the containers are not relevant.

errose28 avatar Oct 21 '22 19:10 errose28

1 No clear problem statement the main purpose of this patch is to reduce memory usage. in some scenario, like spark , hive and mapreduce, a large number of temporary files will be deleted after computing task is completed. if we keep the reference of deleted container, memory will be exhausted quickly. yea , we can leave the deleted containers in RocksDB only and not load them into the ContainerStateMap. but i am not sure does it worth to keep those deleted container infos in DB for only infrequent debug case.

2 Changing the config key fragments the container delete path i suggest we always remove deleted container reference and remove the config key. i add this config key here only because we have not make a consensus , this is just a transitional solution.

3 No integration with unknown container handling for unknown container , since it is not recognized by scm, it is not a valid container. in production, we can just ignore it , or send a force delete command to datanode to delete it. in development, we can just add debug log to log out the detail info of this container to see why it happens. so i don`t think unknown container will be a problem。

thanks @errose28 for the concerns 。 generally, i agree with @stephen. since this patch is not urgent ,we can keep discussing it until we make a consensus

JacksonYao287 avatar Oct 25 '22 13:10 JacksonYao287

I can kind of see this both ways.

If we take it to one extreme - if we want to ensure stuff is kept around for debugging, why are we letting the empty replicas get deleted? On a real cluster blocks will get deleted all the time, and we will end up with empty containers. Some bug (which I think we already had) could erroneously delete replicas it should not, but its also not scalable to keep all empty replicas forever. They will grow slowly but surely to a large number over months and years and cause problems in the reporting and report processing.

So if we have deleted the replicas, what use is a ContainerInfo reference in SCM? Even if we have it, we cannot do anything with it - there is no replica for it. All it tells us is that a container once existed with that ID, and its empty.

Its bad that it will use up memory slowly over time. Replication Manager will also need to check it on each pass.

We can keep it in the Rocks DB table unloaded on SCM startup, but (I think) we will still need to read out the record and decide on whether or not to load it into memory. Even will millions of such references the overhead should be small - it will only slowdown the SCM startup by a tiny amount, but its still a small overhead.

I'm in favor of just removing them, as I see little value in keeping them. We should certainly log they have been removed - of course the logs will roll off, but at least that is some evidence they have been removed for a reason.

sodonnel avatar Oct 25 '22 13:10 sodonnel

@errose28 does this make sense to you? or do you have any other points?

JacksonYao287 avatar Oct 27 '22 07:10 JacksonYao287

Ok we can continue with the proposal to remove the deleted containers from RocksDB. Does this description line up with what is being proposed?

  • When replication manager finds an empty container to delete, it will send the delete commands to replicas and remove the container from in memory and the DB.
    • If datanodes miss the command, the container will be deleted if empty on the next FCR by the container report handler.
  • When SCM gets a report for a container that it does not have in the DB it will delete it if it is empty (block count is 0), otherwise it will handle the container based on the value of hdds.scm.unknown-container.action.
  • DELETED containers already in SCM DB will remain there forever.
    • If we want to get rid of them on startup, we would need a layout feature for this as it would affect downgrades otherwise.
  • This flow will not be behind a config key.

The catch here is that the a container will get removed from SCM if all replicas in the current replica set are empty, but SCM cannot know about all replicas that may have a copy of the container. As @JacksonYao287 pointed out in this comment, an old replica whose container has been deleted could resurface, and have blocks that already got deleted and cleared. If we are only deleting unknown empty containers, then this container would not get deleted. However, deleting unknown non-empty containers seems dangerous. Removal of these containers might have to wait until we have orphan block cleanup on Recon.

errose28 avatar Oct 31 '22 22:10 errose28

my plan is to complete this work in two steps. first, in this patch, DELETED container is removed from scm by default(this can be configured). and for those unknown container , no matter it is empty or not, we will deleted it by default(this can be configured). in production environment,which use the default configuration, no matter an unknown container is empty, it will not be referenced by any key, we can just delete it. so although Deleting a non-empty unknown container by default is an aggressive operation, i think it makes sense here. in development environment, we can set the default operation to log WARN, so it give us a chance to investigate the problem. and for those old DELETED container reference in rocksDB, we can just left them there and do nothing and later newly deployed cluster will not have this problem.

second, i will create a new patch to solve this problem.. my idea here is after container is closed, only deleteblockLog can shrink the container size, not calculateUsage in containerReport handler. the size of container is reduced only after the deleteTransaction is committed in deleteblockLog. so even if a container replica has some orphan blocks and a bigger size, scm has a definite size of the container. when the size of a container in scm is shrinked to zero, scm will delete all the replicas of the container . if a container is deleted , and scm receives a container report from a resurrected datanode, it will be treated as an unknown container , and will be deleted by scm by default.

@errose28 @sodonnel what do you think?

JacksonYao287 avatar Nov 04 '22 03:11 JacksonYao287

Hi @JacksonYao287 thanks for continuing the work on this. I do still have some concerns with this proposal though.

first, in this patch, DELETED container is removed from scm by default(this can be configured). and for those unknown container , no matter it is empty or not, we will deleted it by default(this can be configured).

This doesn't address my original point that these two configs conflict for some values. I also do not think that config keys should be used to fragment core internal operations like deletion. Ozone only needs one delete path and we should be confident that the one we choose works. If we are not confident in this delete container proposal and are trying to hide it behind a config then that is a sign we should not go forward with the change.

although Deleting a non-empty unknown container by default is an aggressive operation, i think it makes sense here. in development environment, we can set the default operation to log WARN, so it give us a chance to investigate the problem.

This assumes the issue shows up in dev first, which may not be the case. There's so many variables in an Ozone deployment it is possible that the dev environments work fine, but the perfect storm of issues does not happen until some production deployment. Having debugged multiple dev and prod Ozone deployments I can confirm this does happen for corner case bugs. This is why I have been advocating for a more cautious approach.

my idea here is after container is closed, only deleteblockLog can shrink the container size, not calculateUsage in containerReport handler. the size of container is reduced only after the deleteTransaction is committed in deleteblockLog. so even if a container replica has some orphan blocks and a bigger size, scm has a definite size of the container. when the size of a container in scm is shrinked to zero, scm will delete all the replicas of the container.

We should use block count to determine if the container is empty, as used bytes is not exact like you mentioned. Block count is currently used by SCM to determine if a container is empty. There are some invariants that I believe make it impossible to solve this problem without using Recon:

  • Deleted block log entries must eventually be deleted.
  • SCM cannot know all possible replicas at a given time. An older one could be resurrected at any time.
  • SCM's delete block API used by OM must be idempotent.

Given these properties I believe we cannot know the definite size of the container once blocks start getting deleted. We can only know the current size of each replica. For example, say a block is deleted from all known container replicas and SCM's "definite" block count decreases by 1. SCM removes the entry from the block log. Now maybe OM missed the ack for that delete so it resends it. SCM would repeat the same steps, incorrectly decrementing the "definite" block count again. It has no permanent record of which blocks were already deleted. Since old datanodes can be resurrected, they are also not a source of truth for which block deletions have already been issued by SCM.

In summary, I feel that my proposal has the following advantages:

  • No config changes Existing config keys remain with their original defaults and functionality. The only difference is that hdds.scm.unknown-container.action now only applies to unknown and empty containers. No new configs need to be added. No need to define behavior for unintuitive configuration combinations.

  • Only confirmed empty containers are deleted. This minimizes the risk of data loss due to current code bugs or future regressions that could for any reason cause SCM to "forget" a container, or datanodes to misreport a container.

The disadvantage of my proposal is that for the block deletion corner case discussed earlier, containers with those blocks will remain in the system. However this is true even today with the current container delete flow, since all replicas must have zero block count for deletion to happen. The orphan block problem is a block problem, not a container problem. IMO this means it should not be solved as an extension of container deletion by deleting non-empty containers. I think orphan block cleanup with Recon is the correct answer to this, even if it may not be implemented in the near future.

errose28 avatar Nov 09 '22 01:11 errose28