If a GSProcessing job fails at the re-partition step follow-up jobs fail without clear reason why
When users set --do-repartition True for a GSProcessing job, there's a chance that job will fail.
Within GSP we don't fail the entire job when that happens, in order to not waste all the GSProcessing work that's been done before, but rather log an error/warning.
Without looking at the logs, an end user cannot know that they will need to run the follow-up re-partition job independently.
If they try to run the DistPart job, they will could get an error about a reshape operation failing, without a clear reason why.
We should provide persistent indicators for failed re-partition jobs that
- Allow users to know something is wrong when looking at the file output
- Allow us to fail the DistPart job early with a more descriptive error message.
Also we should include checks in the beginning of the DistPart job that checks for the expected shape of label/mask vectors and provide descriptive errors.