hdf5 icon indicating copy to clipboard operation
hdf5 copied to clipboard

Avoid attempting to close connector with zero references

Open mattjala opened this issue 1 year ago • 6 comments

Fix for #4634 (H5VL_t not freed properly if closed with no references).

I don't think an assertion would make sense here, since it's possible for regular internal library failures to elicit this issue.

mattjala avatar Jul 09 '24 21:07 mattjala

The previous solution left open the possibility for a double-free of a connector, since the routines that tried to create the first VOL object could fail, free the VOL connector, and be unable to propagate the fact that the connector was freed to their caller.

The current solution is to initialize H5VL_t with a 'dummy' reference count so it starts with nrefs = 1. After the first object is successfully created, the nrefs count will be manually decremented. This is not a perfect solution, but it seems cleaner than having subfunctions return pointers to the connector.

In every case, this should be safe:

  • Failures prior to the creation of the first object will properly free it with a single decrement operation
  • Failures between first object creation and the end of the function that created the connector will properly free it due to releasing the object (which decrements the reference count of the connector from 2 to 1) and then the function that created the connector will free it (1 -> 0)
  • The dummy reference is removed at the end of the creation process, so failures that should result in the destruction of the connector can't occur after this point.

mattjala avatar Jul 16 '24 18:07 mattjala

I can iterate on this with you to get to a better pattern. What is the failure you are attempting to address?

qkoziol avatar Jul 16 '24 18:07 qkoziol

I can iterate on this with you to get to a better pattern. What is the failure you are attempting to address?

If a failure occurs after the creation of a VOL object but before the conclusion of the routine that invoked H5VL_new_connector(), and H5VL_t starts off with a reference count of zero, then the release of the VOL object during cleanup will indirectly free the connector before the connector is freed directly (in H5ESinsert_request() or H5VL_register_using_vol_id()), resulting in a double free. This wasn't addressed by the previous solution of extracting the connector's actual release into its own routine.

The callers need to directly free the connector on failure in at least some cases - since failure can occur before or after the VOL object is created. Since the routines that eventually try to create the VOL object can fail before or after creating it, just checking the failure status is not enough to determine how the connector needs to be treated.

One potential solution to this is to update the routines which might result in the provided connector being freed due to the first VOL object creation failing (H5VL__new_vol_obj(), and H5VL_register()) to take an H5VL_t ** parameter which may be set to NULL if the connector is freed. This could then be checked to avoid double frees.

Another solution - the one currently implemented in this PR - is to use a 'dummy' ref count to avoid the possibility of a new connector being freed 'early' in subroutines.

mattjala avatar Jul 16 '24 19:07 mattjala

I can iterate on this with you to get to a better pattern. What is the failure you are attempting to address?

@mattjala - I traced this around a little more today and have some ideas. Would you like to set up a call to talk about it?

qkoziol avatar Aug 31 '24 17:08 qkoziol

@mattjala - I traced this around a little more today and have some ideas. Would you like to set up a call to talk about it?

Yes. We could pencil this in for discussion during tomorrow's working group, or set up a one on one discussion.

mattjala avatar Sep 04 '24 14:09 mattjala

@mattjala - I traced this around a little more today and have some ideas. Would you like to set up a call to talk about it?

Yes. We could pencil this in for discussion during tomorrow's working group, or set up a one on one discussion.

It's likely to be a fairly coding-intense discussion, so probably not a great fit for the community meeting. I'll send you an email for setting up a 1:1 meeting and we can try to work it forward.

qkoziol avatar Sep 04 '24 17:09 qkoziol

Quincey and I had a discussion about the best way to resolve this. We came to the conclusion that the root cause of this error is bad usage of VOL structures between different layers of the library, and that the solution is to:

  1. Replace usage of the VOL class H5I id with the class itself (H5VL_class_t) in internal routines.
  2. Instead of registering the VOL class structure with H5I, register each VOL connector instance (H5VL_t) with H5I.
  3. Connector creation should be pushed down to the same layer as object creation when possible, to remove the potential for free calls from two different layers during a failure, and in turn prevent the VOL connector from being passed around with a reference count of zero.

Quincey has started work on this set of changes, so I'm closing this PR for now.

mattjala avatar Sep 16 '24 14:09 mattjala