StarCluster icon indicating copy to clipboard operation
StarCluster copied to clipboard

Issue with SGEPlugin

Open cancan101 opened this issue 10 years ago • 10 comments

This happened when adding new nodes:

!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - remote command 'source /etc/profile && cd /opt/sge6 &&
!!! ERROR - TERM=rxvt ./inst_sge_sc -x -noremote -auto ./ec2_sge.conf'
!!! ERROR - failed with status 1:
!!! ERROR - Reading configuration from file ./ec2_sge.conf
!!! ERROR - [H[2J

I did not see it again when I added more nodes.

cancan101 avatar Dec 22 '14 22:12 cancan101

I actually ran into the same issue yesterday. I removed all the nodes & deleted /opt/sge on the master. I rerun the sge plugin to get a clean sge install again. No issues since then, but still wondering why it happened in the first place...

meissnert avatar Dec 22 '14 22:12 meissnert

It does happen from time to time indeed.

FinchPowers avatar Dec 22 '14 22:12 FinchPowers

Is this some sort of race where the step above finishes too quickly? Perhaps some issue with NFS?

cancan101 avatar Dec 22 '14 22:12 cancan101

I tried to dig the issue but never managed to truly fix it. In the end, I simply run my repair commands and it eventually works.

FinchPowers avatar Dec 23 '14 13:12 FinchPowers

This looks like at least one race which is that the SGE plugin is run before the rest of the plugins. This means that the node is added to the cluster and can received work before potentially other set up plugins are done running.

To avoid this, you want to disable_queue and then explicitly add the plugin to the end of the list.

Although this race does not explain the issue that I reported. I wonder tho if starting NFS is asynchronous and thus there can be circumstances where it is not yet running when SGE starts.

cancan101 avatar Jan 14 '15 18:01 cancan101

I just run into this the other day again, and the issue was that my /tmp folder was full and no space left for SGE to write its temp files. Cleaned out /tmp and it worked again.

meissnert avatar Feb 27 '15 17:02 meissnert

I ran into the same error yesterday and cleaning out the /tmp totally solved the problem. My Starcluster controller is also the Master Node of one of my clusters. Webapp cache seemed to have filled /tmp so I got the error when I tried to add some nodes to my primary cluster. Cleaning out /tmp restored normal functionality. Thanks for the help!

ghost avatar Jun 02 '15 19:06 ghost

This may help: https://github.com/cancan101/StarCluster/commit/c2965063ea8b118761c42ba7c2b311d7df56937a. Also might want to add noatime.

cancan101 avatar Jul 06 '15 17:07 cancan101

Hi @cancan101, can you explain how to disable_queue in cluster.py? Also, did you mean to move the plugin to the very bottom of the starcluster/config file?

gtang31 avatar May 04 '16 00:05 gtang31

I am running into this issue quite frequently. Is there a code fix for this?

avnermay avatar Sep 26 '18 05:09 avnermay