HPCCloud icon indicating copy to clipboard operation
HPCCloud copied to clipboard

LSF scheduler integration

Open robertsawko opened this issue 6 years ago • 25 comments

Hi,

My colleague and I are working on LSF integration. We have created and adapted relevant files: lsf.py and lsf.sh and LSF is displaying as "Scheduler" in cluster settings. Unfortunately we are still getting "Unsupported scheduler" when we press "Save" button.

Looking at the code we can see that queue in cluster.py still doesn't find the type even though we added relevant content in dictionary type_to_adapter in queue/__init__.py.

Any advice would welcome, thanks.

robertsawko avatar Aug 22 '18 13:08 robertsawko

Only add that we also modified the constants.py file for the cumulus plugin, and the hpccloud files (LSF.js, index.js and RunCluster.js) to see the LSF scheduler.

carpemonf-zz avatar Aug 22 '18 13:08 carpemonf-zz

The message will be produced if this is False. Are you using the same string to identify the queue as the one registered?

cjh1 avatar Aug 22 '18 13:08 cjh1

Can we just double check with you: where do we put the string to identify the queue and and wehere do we register it?

robertsawko avatar Aug 22 '18 13:08 robertsawko

Its the key used when you add to the type_to_adapter dict.

cjh1 avatar Aug 22 '18 14:08 cjh1

I think we have the same string for the LSF queue. We have the following definition for /opt/hpccloud/cumulus/cumulus/constants.py:

class QueueType:
    SGE = 'sge'
    PBS = 'pbs'
    SLURM = 'slurm'
    LSF = 'lsf'
    NEWT = 'newt'

and for /opt/hpccloud/hpccloud/src/panels/SchedulerConfig/index.js:

            <option value="sge">Sun Grid Engin</option>
            <option value="pbs">PBS</option>
            <option value="slurm">SLURM</option>
            <option value="lsf">LSF</option>

carpemonf-zz avatar Aug 22 '18 14:08 carpemonf-zz

@carpemonf In that case are you sure your are running the updated code? May be add a quick print statement to ensure your changes are being loaded on the server.

cjh1 avatar Aug 22 '18 14:08 cjh1

@cjh1 , sorry we are a bit new to Girder and python->WebUI conversion. Could you tell us how to print or or that data and where to find the standard output or log file? Thanks.

robertsawko avatar Aug 23 '18 10:08 robertsawko

Thanks, I have just added a modified statement through ValidationException message in /opt/hpccloud/cumulus/girder/cumulus/server/models/cluster.py. It also includes, for testing, a message when the queue is supported.

        if not queue.is_valid_type(scheduler_type):
            raise ValidationException('Unsupported scheduler: %s.' % scheduler_type, 'type')
        else:
            raise ValidationException('Supported scheduler: %s.' % scheduler_type, 'type')

The new messages appear, so at least this piece of code is updated.

Next, I modified the is_valid_type(type) function in /opt/hpccloud/cumulus/cumulus/queue/__init__.py to fix the return value to False to see if the function was checking the LSF queue properly, but It always give me a True statement for the default queues.

    valid = False
    return valid
    #return type in type_to_adapter

carpemonf-zz avatar Aug 23 '18 11:08 carpemonf-zz

Sorry for bumping this thread, but we are still struggling with this! So what we did right now was to comment out the exception throwing and we actually managed to move forward. This kind of shows that the code is compiling but maybe not all of it? Any advice would be welcome.

robertsawko avatar Aug 28 '18 12:08 robertsawko

@robertsawko Sorry I have away a conference. Will try to take a look today.

cjh1 avatar Aug 28 '18 14:08 cjh1

@robertsawko Are your code changes pushed somewhere so I could take look?

cjh1 avatar Aug 28 '18 14:08 cjh1

@cjh1 We had some problems with the cumulus dependencies compiling HPCCloud for development, we are playing in the meanwhile with the prebuilt VMs on HPCCloud-deploy/prebuilt/hpccloud-server/. I have attached a patch: lsf.txt. This should work for this VM.

Please let me know if it works for you.

carpemonf-zz avatar Aug 28 '18 17:08 carpemonf-zz

So I was able take your server changes ( the new adapter ) and apply it locally and was then able to create a LSF cluster using following POST ( outside the web app ):

{
  "config": {
    "scheduler": {"type": "lsf"},
    "ssh": {"user": "test"},
"host": "test"
  },
  "name": "test3",
  "type": "trad"
}

@jourdain Can you check that client side is constructing the appropriate JSON.

cjh1 avatar Aug 28 '18 20:08 cjh1

@cjh1 Could you please confirm me that applying the following to cumulus/cumulus/queue/__init__.py, a cluster with PBS, SGE, SLURM or LSF can not be created? I can still create them with these changes:

-    return type in type_to_adapter
+    valid = False
+    return valid
+    #return type in type_to_adapter

carpemonf-zz avatar Aug 30 '18 11:08 carpemonf-zz

@carpemonf If I return False from is_valid_type, I am unable to create a cluster, I get "Unsupported scheduler." as expected.

cjh1 avatar Aug 31 '18 14:08 cjh1

Thank you @cjh1, I probably did something wrong then.

carpemonf-zz avatar Sep 03 '18 18:09 carpemonf-zz

We're back to working on this topic and we have now our own ansible deployment with lsf integration added as patch. When we confirm it's working as expected I will make sure we can share it with you, but for now I would like to ask you a question without showing the full patch.

It seems that in the script lsf.sh which we modelled on slurm.sh some variables are not passed at all e.g. queue name and number of slots. Whatever value we set in the web UI, the script from cumulus generates no queue line or fixes #BSUB -n (no of slots) to 1.

Could you perhaps point us to the right place in the code where we may have missed something? Which file/component is responsible for extracting these variables from the web forms?

robertsawko avatar Jan 22 '19 08:01 robertsawko

@robertsawko I am not super famliar with the frontend code, but I think place you need to look is here. @jourdain can correct me if I am wrong :smile:

cjh1 avatar Jan 23 '19 16:01 cjh1

Yes you right and make sure you register it in the index.js.

jourdain avatar Jan 23 '19 19:01 jourdain

Hi @jourdain @cjh1

Sorry for bringing this back up again, but we are still having issues for passing variables from the web interface to the queue, for example "Number of slots" or "Max runtime". For instance, these variables are empty when the corresponding lsf.sh is executed.

Since it was failing with our custom LSF implementation, I decided to test the prebuilt compute-node VMs provided in this repo and configured with a SGE scheduler. I modified /opt/hpccloud/cumulus/cumulus/templates/schedulers/sge.sh for printing the following variables:

  • job.name: properly displayed
  • job._id: properly displayed
  • maxWallTime: empty
  • numberOfSlots: always shows 1 independently of the "Number of slots" specified in the front end.

Taking a deeper look into the cumulus taskflows, I realised that in /opt/hpccloud/hpccloud/server/taskflows/hpccloud/taskflow/openfoam/tutorial.py the numberOfSlots seems to be hardcoded (the same for windtunnel.py):

    ## slots
    job['params']['numberOfSlots'] = 1

However, commenting this line results in an empty numberOfSlots variables. Can you please provide any help about this?

carpemonf-zz avatar Apr 24 '19 10:04 carpemonf-zz

Can you check that all the info are properly sent to the server here?

If that's the case, I'm wondering what may trim down some informations.

jourdain avatar Apr 24 '19 13:04 jourdain

Thanks @jourdain. If I check payload in that function all the parameters are right.

Coming back to the taskflows scripts tutorial.py and windtunnel.py: job['params'] = {} is initialised, but numberOfSlots is never linked with the values specified by the user in the web UI. I can see that for PyFR there's something like this:

number_of_procs = kwargs.get('numberOfSlots')

Doing the same for tutorial.py or windtunnel.py worked for me and allows numberOfSlots being visible by the queue script:

-job['params']['numberOfSlots'] = 1
+job['params']['numberOfSlots'] = kwargs.get('numberOfSlots')

I still have the problem for the wall time. Does the same apply for maxWallTime? Should it be in job['params']?

carpemonf-zz avatar Apr 24 '19 15:04 carpemonf-zz

On the Python side, what else do you have in the kwargs?

jourdain avatar Apr 24 '19 16:04 jourdain

I'm using the default prebuilt VMs for HPCCloud server and compute-node. Iterating over kwargs, I get:

[17:26:37.973] INFO: numberOfGpusPerNode
[17:26:37.988] INFO: 0
[17:26:37.994] INFO: numberOfSlots
[17:26:38.000] INFO: 1
[17:26:38.008] INFO: image_spec
[17:26:38.014] INFO: {
  "owner": "695977956746",
  "tags": {
    "openfoam": "1612"
  }
}
[17:26:38.021] INFO: next
[17:26:38.027] INFO: {
  "args": [],
  "chord_size": null,
  "immutable": false,
  "kwargs": {},
  "options": {},
  "subtask_type": null,
  "task": "hpccloud.taskflow.openfoam.tutorial.create_openfoam_job"
}
[17:26:38.035] INFO: queue
[17:26:38.042] INFO: 
[17:26:38.049] INFO: maxWallTime
[17:26:38.055] INFO: {
  "hours": "1",
  "minutes": 0,
  "seconds": 0
}
[17:26:38.062] INFO: input

[17:26:38.068] INFO: {
  "folder": {
    "id": "5cc08e170640fd00e5065cd7"
  },
  "shFile": {
    "id": "5cc08e320640fd00e5065cf4"
  }
}
[17:26:38.075] INFO: output
[17:26:38.083] INFO: {
  "folder": {
    "id": "5cc08e170640fd00e5065cd6"
  }
}

So, does this need something like job['params']['maxWallTime'] = kwargs.get('maxWallTime')?

carpemonf-zz avatar Apr 24 '19 16:04 carpemonf-zz

Yes that was my thoughts. All the info get passed to the kwargs and it's up to the job to pick them up and attached them to the job params.

jourdain avatar Apr 24 '19 16:04 jourdain