PSReadLine icon indicating copy to clipboard operation
PSReadLine copied to clipboard

Simultaneous pwsh instance start Mutex.CreateMutexCore IOException race condition

Open darthwalsh opened this issue 3 years ago • 20 comments

Environment

PSReadLine: 2.1.0
PowerShell: 7.1.3
OS: Darwin 20.5.0 Darwin Kernel Version 20.5.0: Sat May  8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
BufferWidth: 176
BufferHeight: 91

Exception report

iTerm session 3:

System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_2886463743'
   at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
   at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
   at Microsoft.PowerShell.PSConsoleReadLine.DelayedOneTimeInitialize()
   at Microsoft.PowerShell.PSConsoleReadLine.Initialize(Runspace runspace, EngineIntrinsics engineIntrinsics)
   at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics, CancellationToken cancellationToken)

iTerm session 4:

System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_2886463743'
   at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
   at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
   at Microsoft.PowerShell.PSConsoleReadLine.DelayedOneTimeInitialize()
   at Microsoft.PowerShell.PSConsoleReadLine.Initialize(Runspace runspace, EngineIntrinsics engineIntrinsics)
   at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics, CancellationToken cancellationToken)

Steps to reproduce

(Steps that shouldn't matter: My macbook was asleep in clamshell mode, I unplugged the display connections, and the OS crashed. I logged in and submitted the bug report to Apple, again.)

I saw this message on two tabs when iTerm2 was trying to restore a terminal window with four tab. The first two tabs did not have any error.

iTerm session restore caused 4 pwsh instances to launch at the same time, and got these two crashes from different pwsh instances.

Expected behavior

Launching multiple pwsh instances at the exact same time should not show error.

Actual behavior

Tabs 3 and 4 each showed the exception message above.


Running some debug commands, there are a lot of PSReadLineHistoryFile_2886463743 files created at nearly the exact same millisecond:

➜ ~ gci -r /tmp/.dotnet/shm

    Directory: /tmp/.dotnet/shm

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----            7/8/2021 10:59 AM                session1999
d----            7/8/2021 12:55 PM                session5595
d----            7/8/2021 10:15 AM                session705
d----            7/8/2021 10:15 AM                session710
d----            7/8/2021 10:15 AM                session715
d----            7/8/2021 10:15 AM                session721

    Directory: /tmp/.dotnet/shm/session715

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:15 AM           4096 PSReadLineHistoryFile_2886463743

    Directory: /tmp/.dotnet/shm/session705

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:15 AM           4096 PSReadLineHistoryFile_2886463743

    Directory: /tmp/.dotnet/shm/session5595

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 12:44 PM           4096 PSReadLineHistoryFile_2886463743
-----            7/8/2021 12:55 PM           4096 PSReadLineHistoryFile_3946503603

    Directory: /tmp/.dotnet/shm/session1999

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:59 AM           4096 PSReadLineHistoryFile_2886463743

    Directory: /tmp/.dotnet/shm/session721

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:15 AM           4096 PSReadLineHistoryFile_2886463743

    Directory: /tmp/.dotnet/shm/session710

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:15 AM           4096 PSReadLineHistoryFile_2886463743
➜ ~ gci -r -file /tmp/.dotnet/shm | % { $_.LastWriteTime.ToString("MM/dd/yyyy hh:mm:ss.ffffff tt") } | sort-object
07/08/2021 10:15:57.403067 AM
07/08/2021 10:15:57.403165 AM
07/08/2021 10:15:57.425071 AM
07/08/2021 10:15:57.425950 AM
07/08/2021 10:59:23.698350 AM
07/08/2021 12:44:43.666927 PM
07/08/2021 12:55:35.784985 PM

My 2 cents:

  • This is not a duplicate of #1464. My macOS permissions on /private/tmp are fine, and manually opening a pwsh instances always works.
    • Changing the exception message in dotnet/runtime#36823 won't fix the problem of not showing exception message.
  • Whatever code that is in charge of picking mutex names is using the same mutex name.
  • On Windows, I've seen System.Threading.Mutex throw occasionally if the mutex name is contended.
    • The .ctor() documentation says IOException can happen when some other error happens.
  • Maybe pwsh could make the mutex name more unique across instances?
  • Maybe pwsh could use a retry loop around mutex code that sometimes throws?

darthwalsh avatar Jul 08 '21 20:07 darthwalsh

(Issue happened again, after upgrading macOS to 11.5. Had 4 terminal windows that started pwsh simultaneously, and two threw the same way IOException on PSReadLineHistoryFile_2886463743.)

I tried to reproduce the failure by spawning several pwsh from zsh

for i in {1..39}; do; pwsh -nop -c '([datetime]::now).TimeOfDay.TotalMilliseconds' >> b.txt &; done

but even though a few pairs of processes started within the same millisecond, I don't get any IOException. I'm not sure that test loads PSReadLine though. Or, maybe the problem is related to trying to acquire a mutex soon after system startup.

darthwalsh avatar Jul 22 '21 15:07 darthwalsh

It is dup https://github.com/PowerShell/PSReadLine/issues/1464#issuecomment-631941511

iSazonov avatar Jul 22 '21 15:07 iSazonov

@iSazonov sorry I don't understand why you think it is a duplicate. Here's why I said it is not a duplicate:

  • This is not a duplicate of #1464. My macOS permissions on /private/tmp are fine, and manually opening a pwsh instances always works.
    • Changing the exception message in dotnet/runtime#36823 won't fix the problem of not showing exception message.

darthwalsh avatar Jul 26 '21 18:07 darthwalsh

My bad.

Perhaps it is "by design" in .Net. https://github.com/dotnet/runtime/blob/d019e70d2b7c2f7cd1137fac084dbcdc3d2e05f5/src/libraries/System.Private.CoreLib/src/System/Threading/WaitSubsystem.Unix.cs#L169-L174 The comment says they addresses only thread race conditions but not inter process ones.

I believe it makes sense to open new issue in .Net Runtime repo and ask there.

iSazonov avatar Jul 27 '21 04:07 iSazonov

Oh, good find. Do you have a sense of whether dotnet runtime might fix that bug? Agree it seems likely "by design," but I could open the same bug there.

(I was having some trouble understanding which resource the mutex is guarding. Before I was thinking the resource is i.e. the file /tmp/.dotnet/shm/session5595/PSReadLineHistoryFile_2886463743 -- then the mutex name hash could include the session ID in the name of the mutex, and that should further decrease contention and avoid this error. But I don't see any PSReadLine code interacting with this file, I'm guessing that's literally the dotnet implementation of the named mutex library. Instead 2886463743 looks to be the hash of /Users/walshca/.local/share/powershell/PSReadLine/ConsoleHost_history.txt so my suggestion about making the mutex use a more unique name is wrong.)

I see the other code interacting with the mutex has retry loops. If we understand mutex creation has by-design some chance of throwing, should PSReadLine also use a retry loop when creating the mutex?

darthwalsh avatar Jul 27 '21 15:07 darthwalsh

As for PSRL @daxian-dbw can add more info and make a conclusion about using a loop in the code. I think it makes sense to open request in .Net Runtime repository. They could add a loop too.

iSazonov avatar Jul 27 '21 16:07 iSazonov

@darthwalsh and @iSazonov Thank you both for the investigation so far. I will need to get a macOS to try reproducing the issue. A couple clarification:

  1. the mutex is to guard the read/write access of the history file, which is shared by all pwsh instances for a given user.
  2. the re-try we have is for abandoned mutex, which we understand why a retry is needed in that case -- the thread owns the mutex, however, it was abandoned by the previous owner thread, so releasing and then retry is needed. But for this IOException, we have no idea what causes it and thus I don't want to blindly add the retry logic.

I'd love to understand the root cause before making any changes in the code.

daxian-dbw avatar Sep 20 '21 18:09 daxian-dbw

This issue is quite annoying for dev testing powershell builds on macOS. Workaround is to use -noninteractive so that PSReadLine doesn't automatically get loaded.

It seems the problem is the way a mutex is implemented is different across Windows, Linux, and macOS. On macOS it's a file lock: https://github.com/dotnet/runtime/issues/5211#issuecomment-620113032. So if access to that file can't be retrieved, you get the IOException.

SteveL-MSFT avatar Jul 12 '22 03:07 SteveL-MSFT

I think an alternative solution is if the mutex can't be created, just have that instance not save to history and put out a warning message.

SteveL-MSFT avatar Jul 12 '22 04:07 SteveL-MSFT

I have a potential fix, but no longer able to repro the issue. Probably because it's a race condition and my changes have changed the timing, but until I can verify it improves it, I'll hold off on a PR

SteveL-MSFT avatar Jul 27 '22 16:07 SteveL-MSFT

@SteveL-MSFT Printing a warning message that this powershell instance will not save history isn't much of an improvement (in my opinion) -- if I saw that, I would always manually close and reopen that pwsh instance -- which is what I did after seeing the IOException.

But yeah, being able to reproduce the problem on demand is something I wish was easier. What the steps seem to be:

  • having iTerm2 set to reopen tabs after system restart, and 4 tabs open (it could help to open dozens of tabs)
  • either crash the operating system, OR do an operating system upgrade
  • sometimes this triggers the IOException

(I retried my earlier "tight loop" but with a C# app just calling new Mutex( -- launched 4000 times at once, but didn't have any errors.)

darthwalsh avatar Jul 27 '22 23:07 darthwalsh

@darthwalsh PSReadLine currently already has some retry logic. So if that fails, I'm not sure what we can do other than increasing the number of retries, but that's not guaranteed to succeed. Currently if the file lock cannot be obtained, PSReadLine just repeatedly informs you and is not usable. I thought I was more easily repro'ing it with a nested process, but it no longer repros for me. I don't believe I've seen the issue with opening a extra tab in iTerm2 either. If we can figure out a consistent repro, the proper fix would need to be in .NET.

SteveL-MSFT avatar Jul 28 '22 03:07 SteveL-MSFT

@SteveL-MSFT I agree a proper fix in .NET seems much better, but to be sure I want to know what happens in a retry loop. Something like: if retrying after 1 ms brings the failure chance from 1% to 0.01%, then to me it makes sense to put the retry in Mutex. But if macOS just refuses to allow new mutex creation for a full 2 second period after system upgrade, I wouldn't want my framework code to block in the Mutex ctor waiting for that.

daxian-dbw pointed out the existing PSReadLine retry logic is for abandoned mutex, not for IOException.

darthwalsh avatar Jul 29 '22 13:07 darthwalsh

@darthwalsh that is true, perhaps we could try adding to the retry logic with an IOException, but I can't verify such a change right now since it's no longer repro'ing on my system.

SteveL-MSFT avatar Aug 01 '22 04:08 SteveL-MSFT

After I upgraded macOS to 12.6, only one out of 6 of my iTerm pwsh instances failed. Basically identical:

OS: Darwin 21.6.0 Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64

System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_2886463743'

I think the best way to get a consistent repro would be to snapshot an older macOS in a VM, upgrade the guest OS, with iTerm reopening dozens of pwsh tabs. But I'm not familiar the tools to do this.

darthwalsh avatar Oct 19 '22 17:10 darthwalsh

Ok, I ran another experiment with the upgrade to macOS 12.6.1 and I was able to repro this failure without pwsh or PSReadLine; if we ask the owner of dotnet BCL what the Muetx exception policy should be I hope this added context would help.


Repo Program.cs with a net6.0 console app:

try {
  using (var m = new Mutex(false, "darthwalsh_PSReadLine_issues_2658")) {
    Console.Write("created but didn't take Mutex darthwalsh_PSReadLine_issues_2658   ...");
    var line = Console.ReadLine();
    Console.WriteLine("QUITTING: " + line);
  }
} catch (Exception e) {
  Console.WriteLine(e.ToString());

  var line = Console.ReadLine();
  Console.WriteLine("QUITTING: " + line);
}

Built with dotnet publish -c Release --self-contained -a x64

I created a new iTerm profile that just runs command /Users/walshca/code/temp/MutexThrow/bin/Release/net6.0/osx-x64/publish/MutexThrow and opened 30 tabs with this. Then I upgraded the macOS version to 12.6.1, and after reboot only 1 tab had the exception:

System.IO.IOException: The system cannot open the device or file specified. : 'darthwalsh_PSReadLine_issues_2658'
   at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
   at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
   at Program.<Main>$(String[] args) in /Users/walshca/code/temp/MutexThrow/Program.cs:line 2

(I expected about a quarter, so something about this minimal program doesn't match what pwsh is doing).

I'll open some more tabs of this, and leave it running for a few more OS reboots to see if I can repro it without OS upgrade.

darthwalsh avatar Oct 31 '22 16:10 darthwalsh

@darthwalsh did you already open an issue in https://github.com/dotnet/runtime?

SteveL-MSFT avatar Dec 06 '22 05:12 SteveL-MSFT

@SteveL-MSFT thanks for the nudge, I had been putting off rebooting but I just updated parallels today...

Created dotnet/runtime#79375

darthwalsh avatar Dec 08 '22 01:12 darthwalsh

I was going to raise a new bug but I think it's the same issue. In which case I have an easier way to reproduce it (on linux).

Log on to system as a user su to root Run pwsh as root su to another user run pwsh again

Example given below:

[fastdruid@localhost ~]$ sudo su - root [sudo] password for fastdruid: [root@localhost ~]# pwsh PowerShell 7.3.4 PS /root> su - fastdruid [fastdruid@localhost ~]$ pwsh PowerShell 7.3.4 PS /home/fastdruid> Oops, something went wrong. Please report this bug with ALL the details below, including both the 'Environment' and 'Exception' sections. Please report on GitHub: https://github.com/PowerShell/PSReadLine/issues/new?template=Bug_Report.yaml Thank you!

Environment

PSReadLine: 2.2.6 PowerShell: 7.3.4 OS: Linux 4.18.0-425.3.1.el8.x86_64 #1 SMP Wed Nov 9 20:13:27 UTC 2022 BufferWidth: 153 BufferHeight: 46

Last 0 Keys:

Exception

System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_1202693972' at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew) at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name) at Microsoft.PowerShell.PSConsoleReadLine.DelayedOneTimeInitialize() at Microsoft.PowerShell.PSConsoleReadLine.Initialize(Runspace runspace, EngineIntrinsics engineIntrinsics) at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics, CancellationToken cancellationToken, Nullable`1 lastRunStatus)

Fastdruid avatar Apr 25 '23 09:04 Fastdruid

@Fastdruid that looks like it's a different issue: #3692

darthwalsh avatar Dec 22 '23 22:12 darthwalsh