dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

Document common cluster specific (but not job scheduler specific) quirks with work-arounds if available

Open lesteve opened this issue 6 years ago • 3 comments

I wish there were some list of cluster configuration quirks (that are not job scheduler specific) and possible work-arounds (when there are some) somewhere in the doc (I was not aware of limitation of TCP/IP connection limitations between login and compute nodes in some clusters until a few days ago). Here are a rough list off the top of my head:

  • submit_command not available on the compute nodes, e.g. #333. Possible work-around: https://github.com/dask/dask-jobqueue/issues/333#issuecomment-530263090 (I never tried it myself). This is the case for all the OAR clusters I know about, i.e. the submit command is never available on the compute nodes so in principle I could test this idea.
  • TCP/IP restrictions between login and compute nodes e.g. #354 and #355. Possible work-around: start the main script / notebook in an interactive node with all the additional pain and limitations this entails, see https://github.com/dask/dask-jobqueue/issues/354#issuecomment-542879534 for the one I know about.
  • non uniform network interfaces on login and compute nodes. I guess same work-around as TCP/IP restriction would work but not a great work-around.

Please add more if you know more off the top of your head.

cc @mrocklin @guillaumeeb @jhamman

lesteve avatar Oct 17 '19 03:10 lesteve

There are a few here: https://blog.dask.org/2019/08/28/dask-on-summit

On Wed, Oct 16, 2019 at 10:22 PM Loïc Estève [email protected] wrote:

I wish there were some list of cluster configuration quirks (that are not job scheduler specific) and possible work-arounds (when there are some) somewhere in the doc (I was not aware of limitation of TCP/IP connection limitations between login and compute nodes in some clusters until a few days ago). Here are a rough list off the top of my head:

Please add more if you know more off the top of your head.

cc @mrocklin https://github.com/mrocklin @guillaumeeb https://github.com/guillaumeeb @jhamman https://github.com/jhamman

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/356?email_source=notifications&email_token=AACKZTHMD46MLHRBJ4XMU2LQO7LAXA5CNFSM4JBTULS2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HSKVIJQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTHPA26ZL37UUGCRI3DQO7LAXANCNFSM4JBTULSQ .

mrocklin avatar Oct 17 '19 12:10 mrocklin

~I reread your blog post to be sure, but I don't think the ones in your blogpost are the one about bsub < job_script vs bsub job_script unless I missed something (aka use_stdin=True vs `use_stdin=False)~

Sorry wrong issue, ignore this.

lesteve avatar Oct 17 '19 14:10 lesteve

There are a few here: blog.dask.org/2019/08/28/dask-on-summit

Good point!

lesteve avatar Oct 17 '19 14:10 lesteve