alchemiscale Weight selection of `AlchemicalNetwork`s for `Task` claiming by `Transformation` count?

It is currently the case that when multiple users make heavy use of an alchemiscale instance with very limited compute, one or both of the following will occur:

users will compete with each other for compute, likely by bumping up the weights of their AlchemicalNetworks to the max values until all are set to 1 and all of them get equal attention.
users will have to communicate with each other to coordinate priority via their AlchemicalNetwork weights, requiring human time and attention.

This is partially caused by alchemiscale's model for sharing compute among multiple AlchemicalNetworks: if user A has 10 networks with 10 Transformations each, and user B has 1 network with 100 Transformations, user A's Transformations will on average receive about 10x the amount of compute attention assuming equal weights on all networks. We don't consider this to be necessarily fair, even if the model itself makes a lot of practical sense.

To improve the fairness of compute allocation based on network size, does it make sense to include Transformation count for each AlchemicalNetwork as a factor in their weighted selection for Task claiming? In other words, the larger the AlchemicalNetwork, the more likely its Tasks are to receive attention?

Apr 01 '25 06:04 dotsdl

From discussion today:

would be helpful to write clearly how Task claiming works currently, and what the selection probability currently looks like
from there, we can debate alternative formulations that perhaps align better with user expectations, such as:
- the proposal above, scaling AlchemicalNetwork selection weight by Transformation count (i.e. size)
- scaling selection based on completed Task counts or fraction relative to incomplete
- etc.

Apr 01 '25 16:04 dotsdl

@JenkeScheen do you have thoughts on this one based on your usage over the last couple years?

Apr 01 '25 16:04 dotsdl

@IAlibay do you have thoughts as well?

Apr 01 '25 16:04 dotsdl

To improve the fairness of compute allocation based on network size, does it make sense to include Transformation count for each AlchemicalNetwork as a factor in their weighted selection for Task claiming? In other words, the larger the AlchemicalNetwork, the more likely its Tasks are to receive attention?

So this would be an improvement over the current status quo, but in my opinion there are still things we should try to address.

I'll add some rough thoughts here:

Having one high priority Task should get more attention than lots of lower priority tasks.

In practice, the idea of "I have one or two really urgent things I need to get through" is very common, particularly when you're always trying to catch up with the next design meeting with your med chemists.

Having the ability to go up in the queue and getting that job through before the sea of other tasks is very important. From what I understand the current Task priority method, and the one proposed here, doesn't allow you to do that?

Priority should be scaled by the amount of compute you are using

Someone that has 200 Tasks running is in a completely different situation than someone that has 2 Tasks running. Having a kind of "decreasing returns" scaling where folks with fewer currently running tasks can get some jobs through would be very helpful. It would effectively prevent folks from getting stuck behind a wave of high priority tasks. It also means that if everyone has lots of equivalent priority tasks, then the optimal state is that the compute pool is shared equally between each user/project (maybe with some kind of scaling based on project contribution to compute).

Faster Tasks should have priority over slower ones

This is something that might not be very easy to do right now, but as we develop new Protocols we are going to be in a situation where we will have some really fast jobs (e.g. SFEs) and much slower ones (e.g. ABFEs). When that happens, we need to avoid a case where someone comes with lots of very slow jobs and clogs up the entire resource pool. If we can estimate the average Task execution time for a network, we might be able to avoid this issue by deprioritizing Tasks from slow networks to allow for higher throughput tasks to clear through.

Apr 18 '25 17:04 IAlibay

From discussion, I believe we could proceed with the following:

Propose and perform simulated experiments with AlchemicalNetwork selection based on <network weight> x f(<transformation count>) x g(<actioned tasks), where we may want AlchemicalNetworks with more Transformations and fewer actioned Tasks to get more attention on average to combat the issues noted above.
Scaling the likelihood of Task selection based on the number of currently running Tasks created by the same user. This would function as a memoryless form of fair-share scheduling.

May 13 '25 16:05 dotsdl