StarCluster
StarCluster copied to clipboard
Issue with SGEPlugin
This happened when adding new nodes:
!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - remote command 'source /etc/profile && cd /opt/sge6 &&
!!! ERROR - TERM=rxvt ./inst_sge_sc -x -noremote -auto ./ec2_sge.conf'
!!! ERROR - failed with status 1:
!!! ERROR - Reading configuration from file ./ec2_sge.conf
!!! ERROR - [H[2J
I did not see it again when I added more nodes.
I actually ran into the same issue yesterday. I removed all the nodes & deleted /opt/sge on the master. I rerun the sge plugin to get a clean sge install again. No issues since then, but still wondering why it happened in the first place...
It does happen from time to time indeed.
Is this some sort of race where the step above finishes too quickly? Perhaps some issue with NFS?
I tried to dig the issue but never managed to truly fix it. In the end, I simply run my repair commands and it eventually works.
This looks like at least one race which is that the SGE plugin is run before the rest of the plugins. This means that the node is added to the cluster and can received work before potentially other set up plugins are done running.
To avoid this, you want to disable_queue
and then explicitly add the plugin to the end of the list.
Although this race does not explain the issue that I reported. I wonder tho if starting NFS is asynchronous and thus there can be circumstances where it is not yet running when SGE starts.
I just run into this the other day again, and the issue was that my /tmp folder was full and no space left for SGE to write its temp files. Cleaned out /tmp and it worked again.
I ran into the same error yesterday and cleaning out the /tmp totally solved the problem. My Starcluster controller is also the Master Node of one of my clusters. Webapp cache seemed to have filled /tmp so I got the error when I tried to add some nodes to my primary cluster. Cleaning out /tmp restored normal functionality. Thanks for the help!
This may help: https://github.com/cancan101/StarCluster/commit/c2965063ea8b118761c42ba7c2b311d7df56937a. Also might want to add noatime
.
Hi @cancan101, can you explain how to disable_queue
in cluster.py? Also, did you mean to move the plugin to the very bottom of the starcluster/config file?
I am running into this issue quite frequently. Is there a code fix for this?