spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark instances within the same second causes a failure

Open Rafnel opened this issue 2 years ago • 16 comments

What changes were proposed in this pull request?

Problem If you attempt to start multiple Spark instances within a second there is a high likelihood that this spark-class-launcher-output file will have the same name for multiple instances, causing the Spark launcher to fail. The error will look something like this:

WARNING - The process cannot access the file because it is being used by another process.
WARNING - The system cannot find the file C:\Users\CURRENTUSER\AppData\Local\Temp\spark-class-launcher-output-21229.txt.
WARNING - The process cannot access the file because it is being used by another process.

Windows' %RANDOM% is seeded with 1-second granularity. We often start ~20 instances at the same time daily in Windows and encounter this bug on a weekly basis

Proposed Fix Instead of relying on %RANDOM% which has poor granularity, use Powershell to generate a GUID and append that to the end of the temp file name. We have been using this in production for around 2-3 months and have never encountered this bug since.

Why are the changes needed?

My team runs Spark on Windows and we boot up 20+ instances within a few seconds on a daily basis. We encountered this bug weekly and have taken steps to mitigate it without changing the Spark source code like adding a random sleep between 1-300 seconds before starting Spark. Even with a random sleep, 20+ instances have a likelihood of sleeping a similar amount of time and starting at the same time. Also, relying on a random sleep before starting Spark is clunky, unreliable, and not a deterministic way to avoid this issue.

Eventually our team went ahead and edited the code in this .cmd file with this fix. I figured I should make a pull request for this as well.

Does this PR introduce any user-facing change?

no

How was this patch tested?

You can pretty reliably recreate this bug by submitting 30 Spark jobs in Windows using spark-submit. Eventually the Spark launcher will overlap with another Spark launcher and fail.

You can pull my fixed spark-class2.cmd and try this again and there should be no incidence of this bug.

Was this patch authored or co-authored using generative AI tooling?

no

Rafnel avatar Nov 07 '23 21:11 Rafnel

I enabled github actions on my forked repo but it's not clear to me how to re-kick the failed build now that I've done that. Perhaps a maintainer will know how to do this. I haven't used github's CI/CD before.

Rafnel avatar Nov 07 '23 21:11 Rafnel

I also wanted to note someone attempted to fix this issue with this PR in 2019: https://github.com/apache/spark/pull/25076 and this did get merged, but does not appear to fix the issue. We are using Spark 3.2.1 which does have https://github.com/apache/spark/pull/25076 in the spark-class2.cmd but we encounter this bug all the same.

Rafnel avatar Nov 07 '23 22:11 Rafnel

Wanted to chime in again and note that my team has been using this patched version of the spark-class2.cmd to launch our Spark instances for a couple weeks now and have not encountered the aforementioned bug anymore, or any other launch issues.

Obviously this change doesn't //fix// the underlying issue, as you could still launch 2 Spark instances within the same 10ms period and encounter this bug, but it's much less likely than launching 2 Spark instances within the same second. My team launches around 20 instances in Windows within a couple seconds of each other each day and were seeing this issue almost daily until this patch.

Rafnel avatar Nov 20 '23 16:11 Rafnel

Let me ask the dev mailing list and see if we can have others test this patch

HyukjinKwon avatar Nov 22 '23 00:11 HyukjinKwon

Sent https://lists.apache.org/thread/9y0xrxo08cv3nt27wt40jrpncgzz1kt1

HyukjinKwon avatar Nov 22 '23 01:11 HyukjinKwon

friendly ping @panbingkun , if convenient, such as through offline communication, please help to verify this patch on Windows. Thanks ~

LuciferYang avatar Nov 29 '23 11:11 LuciferYang

friendly ping @panbingkun , if convenient, such as through offline communication, please help to verify this patch on Windows. Thanks ~

Okay, I'll verify it a little later.

panbingkun avatar Dec 04 '23 01:12 panbingkun

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Mar 14 '24 00:03 github-actions[bot]

I'm very sorry, I lost it. I will verify it on a Windows machine this weekend.

panbingkun avatar Mar 14 '24 00:03 panbingkun

remove the Stale tag

LuciferYang avatar Mar 14 '24 16:03 LuciferYang

Hello all, I'm going to do one final edit on this PR. We found that set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%TIME::=0%.txt still resulted in the occasional race condition error, as we launch tens of Spark instances at the same time each day on Windows. %RANDOM%%TIME::=0% still has a small chance of overlapping. Our team found that using GUIDs at the end of the temp file name completely fixed this bug and removed the need for %RANDOM% and %TIME%, entirely.

We've been using this new approach for months now without a single issue. Apologies for not updating the PR sooner.

Rafnel avatar Mar 14 '24 17:03 Rafnel

Alright, that should be my final edit. Like I said, we've had no issues since applying this fix a few months ago and we have since increased the amount of concurrent Spark instances that we launch on Windows.

Rafnel avatar Mar 14 '24 17:03 Rafnel

My Windows version is Windows 10 This can indeed generate a unique output file name (I added some print logs), as follows: image

image

And spark-shell starts normally.

panbingkun avatar Mar 16 '24 14:03 panbingkun

Bumping this PR - any other concerns?

Rafnel avatar Apr 29 '24 20:04 Rafnel

I think there is no problem with the overall logic, but I am a little worried about the UUID generation. I suggest using python to generate UUID, because in our native logic, we have used python to generate SPARK_HOME, This is to eliminate dependence on the outside, WDYT? @HyukjinKwon @LuciferYang https://github.com/apache/spark/blob/da92293f9ce0be1ac283c4a5d769af550abf7031/bin/find-spark-home.cmd#L21-L59

panbingkun avatar Apr 30 '24 02:04 panbingkun

problem is that python is not bundled with Windows IIRC ..

HyukjinKwon avatar Apr 30 '24 02:04 HyukjinKwon

Thanks folks, sorry for the delay, got sidetracked and kept forgetting to check this issue again. As mentioned, my company has been using this exact code in our production process for some months now which starts up tens of instances of Spark in parallel to run a host of parallel jobs. As far as I'm concerned, it's been tested thoroughly given this, but I'm not sure if you all have any other standard tests that you run. Given that this is a launch script that doesn't get edited often, I imagine it doesn't have a standard testing procedure.

Also I don't have permissions to merge this, so if one of the reviewers can merge it, that would be great. thanks!

Rafnel avatar May 28 '24 15:05 Rafnel

Merged to master.

HyukjinKwon avatar May 29 '24 02:05 HyukjinKwon