[SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark instances within the same second causes a failure
What changes were proposed in this pull request?
Problem If you attempt to start multiple Spark instances within a second there is a high likelihood that this spark-class-launcher-output file will have the same name for multiple instances, causing the Spark launcher to fail. The error will look something like this:
WARNING - The process cannot access the file because it is being used by another process.
WARNING - The system cannot find the file C:\Users\CURRENTUSER\AppData\Local\Temp\spark-class-launcher-output-21229.txt.
WARNING - The process cannot access the file because it is being used by another process.
Windows' %RANDOM% is seeded with 1-second granularity. We often start ~20 instances at the same time daily in Windows and encounter this bug on a weekly basis
Proposed Fix Instead of relying on %RANDOM% which has poor granularity, use Powershell to generate a GUID and append that to the end of the temp file name. We have been using this in production for around 2-3 months and have never encountered this bug since.
Why are the changes needed?
My team runs Spark on Windows and we boot up 20+ instances within a few seconds on a daily basis. We encountered this bug weekly and have taken steps to mitigate it without changing the Spark source code like adding a random sleep between 1-300 seconds before starting Spark. Even with a random sleep, 20+ instances have a likelihood of sleeping a similar amount of time and starting at the same time. Also, relying on a random sleep before starting Spark is clunky, unreliable, and not a deterministic way to avoid this issue.
Eventually our team went ahead and edited the code in this .cmd file with this fix. I figured I should make a pull request for this as well.
Does this PR introduce any user-facing change?
no
How was this patch tested?
You can pretty reliably recreate this bug by submitting 30 Spark jobs in Windows using spark-submit. Eventually the Spark launcher will overlap with another Spark launcher and fail.
You can pull my fixed spark-class2.cmd and try this again and there should be no incidence of this bug.
Was this patch authored or co-authored using generative AI tooling?
no
I enabled github actions on my forked repo but it's not clear to me how to re-kick the failed build now that I've done that. Perhaps a maintainer will know how to do this. I haven't used github's CI/CD before.
I also wanted to note someone attempted to fix this issue with this PR in 2019: https://github.com/apache/spark/pull/25076 and this did get merged, but does not appear to fix the issue. We are using Spark 3.2.1 which does have https://github.com/apache/spark/pull/25076 in the spark-class2.cmd but we encounter this bug all the same.
Wanted to chime in again and note that my team has been using this patched version of the spark-class2.cmd to launch our Spark instances for a couple weeks now and have not encountered the aforementioned bug anymore, or any other launch issues.
Obviously this change doesn't //fix// the underlying issue, as you could still launch 2 Spark instances within the same 10ms period and encounter this bug, but it's much less likely than launching 2 Spark instances within the same second. My team launches around 20 instances in Windows within a couple seconds of each other each day and were seeing this issue almost daily until this patch.
Let me ask the dev mailing list and see if we can have others test this patch
Sent https://lists.apache.org/thread/9y0xrxo08cv3nt27wt40jrpncgzz1kt1
friendly ping @panbingkun , if convenient, such as through offline communication, please help to verify this patch on Windows. Thanks ~
friendly ping @panbingkun , if convenient, such as through offline communication, please help to verify this patch on Windows. Thanks ~
Okay, I'll verify it a little later.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
I'm very sorry, I lost it. I will verify it on a Windows machine this weekend.
remove the Stale tag
Hello all, I'm going to do one final edit on this PR. We found that set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%TIME::=0%.txt still resulted in the occasional race condition error, as we launch tens of Spark instances at the same time each day on Windows. %RANDOM%%TIME::=0% still has a small chance of overlapping. Our team found that using GUIDs at the end of the temp file name completely fixed this bug and removed the need for %RANDOM% and %TIME%, entirely.
We've been using this new approach for months now without a single issue. Apologies for not updating the PR sooner.
Alright, that should be my final edit. Like I said, we've had no issues since applying this fix a few months ago and we have since increased the amount of concurrent Spark instances that we launch on Windows.
My Windows version is Windows 10
This can indeed generate a unique output file name (I added some print logs), as follows:
And spark-shell starts normally.
Bumping this PR - any other concerns?
I think there is no problem with the overall logic, but I am a little worried about the UUID generation.
I suggest using python to generate UUID, because in our native logic, we have used python to generate SPARK_HOME,
This is to eliminate dependence on the outside, WDYT? @HyukjinKwon @LuciferYang
https://github.com/apache/spark/blob/da92293f9ce0be1ac283c4a5d769af550abf7031/bin/find-spark-home.cmd#L21-L59
problem is that python is not bundled with Windows IIRC ..
Thanks folks, sorry for the delay, got sidetracked and kept forgetting to check this issue again. As mentioned, my company has been using this exact code in our production process for some months now which starts up tens of instances of Spark in parallel to run a host of parallel jobs. As far as I'm concerned, it's been tested thoroughly given this, but I'm not sure if you all have any other standard tests that you run. Given that this is a launch script that doesn't get edited often, I imagine it doesn't have a standard testing procedure.
Also I don't have permissions to merge this, so if one of the reviewers can merge it, that would be great. thanks!
Merged to master.