airavata icon indicating copy to clipboard operation
airavata copied to clipboard

Airavata 3348 final cut

Open vivekshresta opened this issue 3 years ago • 2 comments

This PR includes changes to validate the storage limit for a user. Airavata retrieves the size of a user given an experiment and validates against the UserStorageQuota in StoragePreference

vivekshresta avatar Aug 21 '20 09:08 vivekshresta

Hi @machristie ,

Thanks for reviewing the code.

  • From our last conversation in the dev mailing list, I assumed we agreed on Airavata worrying about the storage limit makes sense since, in the future, Airavata can choose between multiple StoragePreferences or choose a storage preference mentioned in GatewayResourceProfile or UserStoragePreference(which is about to be deprecated) when the storage preference id given by the gateway is invalid. I guess the gateway too can easily achieve these functionalities.

I did want the validation to happen internally, but the problem I faced was, during the experiment creation phase in Airavata, the experiment model does not have any data related to the StoragePreference in which the experiment is being created. Changing the createExperiment() method to accept another parameter would mean changes across all the gateways. And in my previous discussions with the team, I came to know that in the future, similar to choosing compute preferences for an experiment during the experiment creation phase, we're gonna develop a new functionality where the user gets to choose the StoragePreference in which he/she is going to create the experiment. With that thought process, I created a new API that can be invoked by any gateway, if they choose to use this feature. But I just verified that, by calling '_set_storage_id_and_data_dir(experiment)' before creating an experiment, I can set the storageId and experiment data directory removing the need for passing the storageId explicitly.

Basically, the public API can be changed to an internal API now. Will make those changes soon.

  • I did consider this. The problems with this approach are:
    1. We will check the size limit only after the experiment creation is done.
    2. When we know the size limit is exceeded, helix needs to communicate back to APIServer for deleting the created experiment entries and if needed, deleting the experiment directory. Considering these and after discussing with Dimuthu, I thought this might be the better approach when we use 'StorageResourceAdaptor', but this does seem to complicate things.

Even if I remove the new public API and integrate it with createExperiment(), this approach would still be consuming APIServer's resources(though we're using pooled resources instead of creating a new SSH connection every time). Does it make sense to just stick with the original approach - the gateway worrying about the storage quotas? Also can you please elaborate a little on transient network failure in helix.

vivekshresta avatar Aug 28 '20 02:08 vivekshresta

@vivekshresta,

Regarding validating internally, I would add the check to launchExperiment instead of createExperiment.

We will check the size limit only after the experiment creation is done.

That's true, but the current approach only checks before the experiment runs and so doesn't account for experiment output files.

When we know the size limit is exceeded, helix needs to communicate back to APIServer for deleting the created experiment entries and if needed, deleting the experiment directory. Considering these and after discussing with Dimuthu, I thought this might be the better approach when we use 'StorageResourceAdaptor', but this does seem to complicate things.

Well, my two cents, but I don't think the experiment needs to be deleted. We just need to set a flag in the database that the user is over quote on that storage resource and then prevent further file uploads/experiments on that storage resource.

Also can you please elaborate a little on transient network failure in helix.

Sure, Helix's task framework has builtin fault tolerance support, for example retrying in the case of failure: https://helix.apache.org/0.8.0-docs/tutorial_task_framework.html. By transient network failure I mean some sort of transient network failure between Airavata and the SSH host that prevents the SSH connection from establishing or completing successfully.

machristie avatar Aug 31 '20 19:08 machristie