hail icon indicating copy to clipboard operation
hail copied to clipboard

[hailtop.batch] Restructure and use better types for resources

Open danking opened this issue 4 months ago • 0 comments

This was spawned by https://github.com/hail-is/hail/pull/14310 which slipped into 0.2.126 because we do not have types on some of our inner functions. In my attempt to type everything, I discovered a really perplexing function _add_resource_to_set.

I was able to give it a type but it was really complicated, involved @overload, and did not seem to help me reason about that function. I presumed the root cause was that the Resource class had an insufficient interface.

Here is how I understand what that function was trying to achieve: We need to track the set of resources that are produced by, produced-by-and-later-consumed, and consumed by a job:

  1. "Produced by" corresponds to Job._valid.
  2. "Produced-by-and-later-consumed" corresponds to Job._internal_outputs and Job._external_outputs.
  3. "Consumed by" corresponds to Job._inputs.

There is also Job._mentioned which I do not fully understand but which does not use _add_resource_to_set.

There is an important distinction between the latter two and the first kind of resource set. The latter two must be what I am now calling SingleResources. These are actual single files that need to be uploaded or downloaded. In contrast the "produced by" set (and, I think Job._mentioned) might include ResourceGroups which are composite: one or more SingleResources that must be transmitted as a group.

For example, a VCF file and its TBI index file must always be transmitted as a group, even if a job only references one of those two files. That's the essential functionality of a ResourceGroup.

I introduced three operations to Resource which I think make this system simpler:

  1. get_resource_group: if a Resource is a group or a member thereof, return it.

  2. component_resources: if a Resource is composite, return the components, otherwise just return the resource itself.

  3. bonded_resources: the minimum set of resources that must be transmitted if this resource is transmitted. In particular, a member of a resource group returns itself and the other resources in the group. A resource group just returns its components.

We can now add a type annotation which requires (2) and (3) to be sets of SingleResource. In particular, (2) and (3) are concrete files that can be copied as opposed to groups thereof.

danking avatar Feb 16 '24 23:02 danking