Error stitching large sample
Bug report
Description of the problem
I was trying to stitch a large sample of 20 tiles, with each tile having [1920,1920,~2800] pixels. I kept getting a spark session time out error at different stages of the stitching pipeline.
For example, below is a case when the error came from the run_retile stage. But for the same data, it would sometime run through this stage but get the same session time out error at a later stage run_stitching.
This error is unique for large sample only. I have no problem running through a sample that is about ~10x smaller.
Log file(s)
Jun-28 10:26:01.924 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'stitching:stitch:run_retile:spark_start_app (1)'
Caused by:
Process `stitching:stitch:run_retile:spark_start_app (1)` terminated with an error exit status (1)
Command executed:
echo "Starting the spark driver"
SESSION_FILE="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId"
echo "Checking for $SESSION_FILE"
SLEEP_SECS=10
MAX_WAIT_SECS=7200
SECONDS=0
while ! test -e "$SESSION_FILE"; do
sleep ${SLEEP_SECS}
if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then
echo "Waiting for $SESSION_FILE"
SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} ))
else
echo "-------------------------------------------------------------------------------"
echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE "
echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster "
echo "-------------------------------------------------------------------------------"
exit 1
fi
done
if ! grep -F -x -q "dcfcb7c0-01b8-4119-90ec-8b3f63ab2c0e" $SESSION_FILE
then
echo "------------------------------------------------------------------------------"
echo "ERROR: session id in $SESSION_FILE does not match current session "
echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster"
echo "and that you are not running multiple pipelines with the same --spark_work_dir"
echo "------------------------------------------------------------------------------"
exit 1
fi
export SPARK_ENV_LOADED=
export SPARK_HOME=/spark
export PYSPARK_PYTHONPATH_SET=
export PYTHONPATH="/spark/python"
export SPARK_LOG_DIR="/u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1"
. "/spark/sbin/spark-config.sh"
. "/spark/bin/load-spark-env.sh"
SPARK_LOCAL_IP=`hostname -i | rev | cut -d' ' -f1 | rev`
echo "Use Spark IP: $SPARK_LOCAL_IP"
echo " /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt18
5_stitch/spark/r1/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP
} --master spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.
files.openCostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar
-i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish
/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --s
ize 64 "
/spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/s
park/r1/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP} --ma
ster spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.open
CostInBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar -i /u/ho
me/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_sti
tch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
&> /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/retileImages.log
Command exit status:
1
Command output:
Starting the spark driver
Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
Use Spark IP: 172.16.129.70
/spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf --conf spark.driver.host=172.16.129.70 --conf spark.driver.bindAddress=172.16.129.70 --master
spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
Command error:
INFO: Could not find any nv files on this host!
INFO: Converting SIF file to temporary sandbox...
Starting the spark driver
Checking for /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/spark/r1/.sessionId
Use Spark IP: 172.16.129.70
/spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stit
ch/spark/r1/spark-defaults.conf --conf spark.driver.host=172.16.129.70 --conf spark.driver.bindAddress=172.16.129.70 --master
spark://172.16.129.70:7077 --class org.janelia.stitching.ResaveAsSmallerTilesSpark --conf spark.executor.cores=16 --conf spark.files.openCost
InBytes=0 --conf spark.default.parallelism=16 --executor-memory 96g --conf spark.driver.cores=1 --driver-memory 12g /app/app.jar -i /u/home/f
/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c0-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/
outputs/r1/stitching/c2-n5.json -i /u/home/f/f7xiesnm/project-zipursky/easifish/lt185_stitch/outputs/r1/stitching/c3-n5.json --size 64
INFO: Cleaning up image..
Work dir:
/u/home/f/f7xiesnm/try_multifish/multifish/work/b3/b86ba5188c95fab8d05a827c510a56
Environment
- EASI-FISH Pipeline version: latest
- Nextflow version: 22.10.7
- Container runtime: Singularity
- Platform: Local cluster
- Operating system: Linux
Additional context
(Add any other context about the problem here)
Have you tried to use more workers or give more memory to a spark worker?
Thanks @cgoina! Yes I am now trying those slowly, as each trial takes ~12 hrs to turn around. Which option do you think might be more useful? More memory per worker or more workers?
@FangmingXie Either could work but that's only if the process is running out of memory. Your exit code is 1 which usually does not indicate a memory issue. Can you attach the contents of the retileImages.log so we can see the actual error? You'll see the path to retileImages.log in your output above.