Weight selection of `AlchemicalNetwork`s for `Task` claiming by `Transformation` count?
It is currently the case that when multiple users make heavy use of an alchemiscale instance with very limited compute, one or both of the following will occur:
- users will compete with each other for compute, likely by bumping up the
weights of theirAlchemicalNetworks to the max values until all are set to1and all of them get equal attention. - users will have to communicate with each other to coordinate priority via their
AlchemicalNetworkweights, requiring human time and attention.
This is partially caused by alchemiscale's model for sharing compute among multiple AlchemicalNetworks: if user A has 10 networks with 10 Transformations each, and user B has 1 network with 100 Transformations, user A's Transformations will on average receive about 10x the amount of compute attention assuming equal weights on all networks. We don't consider this to be necessarily fair, even if the model itself makes a lot of practical sense.
To improve the fairness of compute allocation based on network size, does it make sense to include Transformation count for each AlchemicalNetwork as a factor in their weighted selection for Task claiming? In other words, the larger the AlchemicalNetwork, the more likely its Tasks are to receive attention?
From discussion today:
- would be helpful to write clearly how
Taskclaiming works currently, and what the selection probability currently looks like - from there, we can debate alternative formulations that perhaps align better with user expectations, such as:
- the proposal above, scaling
AlchemicalNetworkselection weight byTransformationcount (i.e. size) - scaling selection based on completed
Taskcounts or fraction relative to incomplete - etc.
- the proposal above, scaling
@JenkeScheen do you have thoughts on this one based on your usage over the last couple years?
@IAlibay do you have thoughts as well?
To improve the fairness of compute allocation based on network size, does it make sense to include Transformation count for each AlchemicalNetwork as a factor in their weighted selection for Task claiming? In other words, the larger the AlchemicalNetwork, the more likely its Tasks are to receive attention?
So this would be an improvement over the current status quo, but in my opinion there are still things we should try to address.
I'll add some rough thoughts here:
Having one high priority Task should get more attention than lots of lower priority tasks.
In practice, the idea of "I have one or two really urgent things I need to get through" is very common, particularly when you're always trying to catch up with the next design meeting with your med chemists.
Having the ability to go up in the queue and getting that job through before the sea of other tasks is very important. From what I understand the current Task priority method, and the one proposed here, doesn't allow you to do that?
Priority should be scaled by the amount of compute you are using
Someone that has 200 Tasks running is in a completely different situation than someone that has 2 Tasks running. Having a kind of "decreasing returns" scaling where folks with fewer currently running tasks can get some jobs through would be very helpful. It would effectively prevent folks from getting stuck behind a wave of high priority tasks. It also means that if everyone has lots of equivalent priority tasks, then the optimal state is that the compute pool is shared equally between each user/project (maybe with some kind of scaling based on project contribution to compute).
Faster Tasks should have priority over slower ones
This is something that might not be very easy to do right now, but as we develop new Protocols we are going to be in a situation where we will have some really fast jobs (e.g. SFEs) and much slower ones (e.g. ABFEs). When that happens, we need to avoid a case where someone comes with lots of very slow jobs and clogs up the entire resource pool. If we can estimate the average Task execution time for a network, we might be able to avoid this issue by deprioritizing Tasks from slow networks to allow for higher throughput tasks to clear through.
From discussion, I believe we could proceed with the following:
- Propose and perform simulated experiments with
AlchemicalNetworkselection based on<network weight> x f(<transformation count>) x g(<actioned tasks), where we may wantAlchemicalNetworks with moreTransformations and fewer actionedTasks to get more attention on average to combat the issues noted above. - Scaling the likelihood of
Taskselection based on the number of currently runningTasks created by the same user. This would function as a memoryless form of fair-share scheduling.