hail
hail copied to clipboard
[hailtop.batch] Restructure and use better types for resources
This was spawned by https://github.com/hail-is/hail/pull/14310 which slipped into 0.2.126 because we do not have types on some of our inner functions. In my attempt to type everything, I discovered a really perplexing function _add_resource_to_set
.
I was able to give it a type but it was really complicated, involved @overload
, and did not seem to help me reason about that function. I presumed the root cause was that the Resource class had an insufficient interface.
Here is how I understand what that function was trying to achieve: We need to track the set of resources that are produced by, produced-by-and-later-consumed, and consumed by a job:
- "Produced by" corresponds to
Job._valid
. - "Produced-by-and-later-consumed" corresponds to
Job._internal_outputs
andJob._external_outputs
. - "Consumed by" corresponds to
Job._inputs
.
There is also Job._mentioned
which I do not fully understand but which does not use _add_resource_to_set
.
There is an important distinction between the latter two and the first kind of resource set. The latter two must be what I am now calling SingleResource
s. These are actual single files that need to be uploaded or downloaded. In contrast the "produced by" set (and, I think Job._mentioned
) might include ResourceGroup
s which are composite: one or more SingleResource
s that must be transmitted as a group.
For example, a VCF file and its TBI index file must always be transmitted as a group, even if a job only references one of those two files. That's the essential functionality of a ResourceGroup
.
I introduced three operations to Resource
which I think make this system simpler:
-
get_resource_group
: if aResource
is a group or a member thereof, return it. -
component_resources
: if aResource
is composite, return the components, otherwise just return the resource itself. -
bonded_resources
: the minimum set of resources that must be transmitted if this resource is transmitted. In particular, a member of a resource group returns itself and the other resources in the group. A resource group just returns its components.
We can now add a type annotation which requires (2) and (3) to be sets of SingleResource
. In
particular, (2) and (3) are concrete files that can be copied as opposed to groups thereof.