dask-jobqueue
dask-jobqueue copied to clipboard
Document common cluster specific (but not job scheduler specific) quirks with work-arounds if available
I wish there were some list of cluster configuration quirks (that are not job scheduler specific) and possible work-arounds (when there are some) somewhere in the doc (I was not aware of limitation of TCP/IP connection limitations between login and compute nodes in some clusters until a few days ago). Here are a rough list off the top of my head:
- submit_command not available on the compute nodes, e.g. #333. Possible work-around: https://github.com/dask/dask-jobqueue/issues/333#issuecomment-530263090 (I never tried it myself). This is the case for all the OAR clusters I know about, i.e. the submit command is never available on the compute nodes so in principle I could test this idea.
- TCP/IP restrictions between login and compute nodes e.g. #354 and #355. Possible work-around: start the main script / notebook in an interactive node with all the additional pain and limitations this entails, see https://github.com/dask/dask-jobqueue/issues/354#issuecomment-542879534 for the one I know about.
- non uniform network interfaces on login and compute nodes. I guess same work-around as TCP/IP restriction would work but not a great work-around.
Please add more if you know more off the top of your head.
cc @mrocklin @guillaumeeb @jhamman
There are a few here: https://blog.dask.org/2019/08/28/dask-on-summit
On Wed, Oct 16, 2019 at 10:22 PM Loïc Estève [email protected] wrote:
I wish there were some list of cluster configuration quirks (that are not job scheduler specific) and possible work-arounds (when there are some) somewhere in the doc (I was not aware of limitation of TCP/IP connection limitations between login and compute nodes in some clusters until a few days ago). Here are a rough list off the top of my head:
- submit_command not available on the compute nodes, e.g. #333 https://github.com/dask/dask-jobqueue/issues/333. Possible work-around: #333 (comment) https://github.com/dask/dask-jobqueue/issues/333#issuecomment-530263090 (I never tried it myself). This is the case for all the OAR clusters I know about, i.e. the submit command is never available on the compute nodes
- TCP/IP restrictions between login and compute nodes e.g. #354 https://github.com/dask/dask-jobqueue/issues/354 and #355 https://github.com/dask/dask-jobqueue/issues/355. Possible work-around: start the main script / notebook in an interactive node with all the additional pain and limitations this entails, see #354 (comment) https://github.com/dask/dask-jobqueue/issues/354#issuecomment-542879534 for the one I know about.
- non uniform network interfaces on login and compute nodes. I guess same work-around as TCP/IP restriction would work but not a great work-around.
Please add more if you know more off the top of your head.
cc @mrocklin https://github.com/mrocklin @guillaumeeb https://github.com/guillaumeeb @jhamman https://github.com/jhamman
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/356?email_source=notifications&email_token=AACKZTHMD46MLHRBJ4XMU2LQO7LAXA5CNFSM4JBTULS2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HSKVIJQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTHPA26ZL37UUGCRI3DQO7LAXANCNFSM4JBTULSQ .
~I reread your blog post to be sure, but I don't think the ones in your blogpost are the one about bsub < job_script vs bsub job_script unless I missed something (aka use_stdin=True vs `use_stdin=False)~
Sorry wrong issue, ignore this.
There are a few here: blog.dask.org/2019/08/28/dask-on-summit
Good point!