PSReadLine
PSReadLine copied to clipboard
Simultaneous pwsh instance start Mutex.CreateMutexCore IOException race condition
Environment
PSReadLine: 2.1.0
PowerShell: 7.1.3
OS: Darwin 20.5.0 Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
BufferWidth: 176
BufferHeight: 91
Exception report
iTerm session 3:
System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_2886463743'
at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
at Microsoft.PowerShell.PSConsoleReadLine.DelayedOneTimeInitialize()
at Microsoft.PowerShell.PSConsoleReadLine.Initialize(Runspace runspace, EngineIntrinsics engineIntrinsics)
at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics, CancellationToken cancellationToken)
iTerm session 4:
System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_2886463743'
at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
at Microsoft.PowerShell.PSConsoleReadLine.DelayedOneTimeInitialize()
at Microsoft.PowerShell.PSConsoleReadLine.Initialize(Runspace runspace, EngineIntrinsics engineIntrinsics)
at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics, CancellationToken cancellationToken)
Steps to reproduce
(Steps that shouldn't matter: My macbook was asleep in clamshell mode, I unplugged the display connections, and the OS crashed. I logged in and submitted the bug report to Apple, again.)
I saw this message on two tabs when iTerm2 was trying to restore a terminal window with four tab. The first two tabs did not have any error.
iTerm session restore caused 4 pwsh instances to launch at the same time, and got these two crashes from different pwsh instances.
Expected behavior
Launching multiple pwsh instances at the exact same time should not show error.
Actual behavior
Tabs 3 and 4 each showed the exception message above.
Running some debug commands, there are a lot of PSReadLineHistoryFile_2886463743
files created at nearly the exact same millisecond:
➜ ~ gci -r /tmp/.dotnet/shm
Directory: /tmp/.dotnet/shm
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 7/8/2021 10:59 AM session1999
d---- 7/8/2021 12:55 PM session5595
d---- 7/8/2021 10:15 AM session705
d---- 7/8/2021 10:15 AM session710
d---- 7/8/2021 10:15 AM session715
d---- 7/8/2021 10:15 AM session721
Directory: /tmp/.dotnet/shm/session715
Mode LastWriteTime Length Name
---- ------------- ------ ----
----- 7/8/2021 10:15 AM 4096 PSReadLineHistoryFile_2886463743
Directory: /tmp/.dotnet/shm/session705
Mode LastWriteTime Length Name
---- ------------- ------ ----
----- 7/8/2021 10:15 AM 4096 PSReadLineHistoryFile_2886463743
Directory: /tmp/.dotnet/shm/session5595
Mode LastWriteTime Length Name
---- ------------- ------ ----
----- 7/8/2021 12:44 PM 4096 PSReadLineHistoryFile_2886463743
----- 7/8/2021 12:55 PM 4096 PSReadLineHistoryFile_3946503603
Directory: /tmp/.dotnet/shm/session1999
Mode LastWriteTime Length Name
---- ------------- ------ ----
----- 7/8/2021 10:59 AM 4096 PSReadLineHistoryFile_2886463743
Directory: /tmp/.dotnet/shm/session721
Mode LastWriteTime Length Name
---- ------------- ------ ----
----- 7/8/2021 10:15 AM 4096 PSReadLineHistoryFile_2886463743
Directory: /tmp/.dotnet/shm/session710
Mode LastWriteTime Length Name
---- ------------- ------ ----
----- 7/8/2021 10:15 AM 4096 PSReadLineHistoryFile_2886463743
➜ ~ gci -r -file /tmp/.dotnet/shm | % { $_.LastWriteTime.ToString("MM/dd/yyyy hh:mm:ss.ffffff tt") } | sort-object
07/08/2021 10:15:57.403067 AM
07/08/2021 10:15:57.403165 AM
07/08/2021 10:15:57.425071 AM
07/08/2021 10:15:57.425950 AM
07/08/2021 10:59:23.698350 AM
07/08/2021 12:44:43.666927 PM
07/08/2021 12:55:35.784985 PM
My 2 cents:
- This is not a duplicate of #1464. My macOS permissions on /private/tmp are fine, and manually opening a pwsh instances always works.
- Changing the exception message in dotnet/runtime#36823 won't fix the problem of not showing exception message.
- Whatever code that is in charge of picking mutex names is using the same mutex name.
- On Windows, I've seen
System.Threading.Mutex
throw occasionally if the mutex name is contended.- The .ctor() documentation says
IOException
can happen when some other error happens.
- The .ctor() documentation says
- Maybe pwsh could make the mutex name more unique across instances?
- Maybe pwsh could use a retry loop around mutex code that sometimes throws?
(Issue happened again, after upgrading macOS to 11.5. Had 4 terminal windows that started pwsh simultaneously, and two threw the same way IOException
on PSReadLineHistoryFile_2886463743
.)
I tried to reproduce the failure by spawning several pwsh
from zsh
for i in {1..39}; do; pwsh -nop -c '([datetime]::now).TimeOfDay.TotalMilliseconds' >> b.txt &; done
but even though a few pairs of processes started within the same millisecond, I don't get any IOException
. I'm not sure that test loads PSReadLine though. Or, maybe the problem is related to trying to acquire a mutex soon after system startup.
It is dup https://github.com/PowerShell/PSReadLine/issues/1464#issuecomment-631941511
@iSazonov sorry I don't understand why you think it is a duplicate. Here's why I said it is not a duplicate:
- This is not a duplicate of #1464. My macOS permissions on /private/tmp are fine, and manually opening a pwsh instances always works.
- Changing the exception message in dotnet/runtime#36823 won't fix the problem of not showing exception message.
My bad.
Perhaps it is "by design" in .Net. https://github.com/dotnet/runtime/blob/d019e70d2b7c2f7cd1137fac084dbcdc3d2e05f5/src/libraries/System.Private.CoreLib/src/System/Threading/WaitSubsystem.Unix.cs#L169-L174 The comment says they addresses only thread race conditions but not inter process ones.
I believe it makes sense to open new issue in .Net Runtime repo and ask there.
Oh, good find. Do you have a sense of whether dotnet runtime might fix that bug? Agree it seems likely "by design," but I could open the same bug there.
(I was having some trouble understanding which resource the mutex is guarding. Before I was thinking the resource is i.e. the file /tmp/.dotnet/shm/session5595/PSReadLineHistoryFile_2886463743
-- then the mutex name hash could include the session ID in the name of the mutex, and that should further decrease contention and avoid this error. But I don't see any PSReadLine code interacting with this file, I'm guessing that's literally the dotnet implementation of the named mutex library. Instead 2886463743
looks to be the hash of /Users/walshca/.local/share/powershell/PSReadLine/ConsoleHost_history.txt
so my suggestion about making the mutex use a more unique name is wrong.)
I see the other code interacting with the mutex has retry loops. If we understand mutex creation has by-design some chance of throwing, should PSReadLine also use a retry loop when creating the mutex?
As for PSRL @daxian-dbw can add more info and make a conclusion about using a loop in the code. I think it makes sense to open request in .Net Runtime repository. They could add a loop too.
@darthwalsh and @iSazonov Thank you both for the investigation so far. I will need to get a macOS to try reproducing the issue. A couple clarification:
- the mutex is to guard the read/write access of the history file, which is shared by all pwsh instances for a given user.
- the re-try we have is for abandoned mutex, which we understand why a retry is needed in that case -- the thread owns the mutex, however, it was abandoned by the previous owner thread, so releasing and then retry is needed. But for this
IOException
, we have no idea what causes it and thus I don't want to blindly add the retry logic.
I'd love to understand the root cause before making any changes in the code.
This issue is quite annoying for dev testing powershell builds on macOS. Workaround is to use -noninteractive
so that PSReadLine doesn't automatically get loaded.
It seems the problem is the way a mutex is implemented is different across Windows, Linux, and macOS. On macOS it's a file lock: https://github.com/dotnet/runtime/issues/5211#issuecomment-620113032. So if access to that file can't be retrieved, you get the IOException.
I think an alternative solution is if the mutex can't be created, just have that instance not save to history and put out a warning message.
I have a potential fix, but no longer able to repro the issue. Probably because it's a race condition and my changes have changed the timing, but until I can verify it improves it, I'll hold off on a PR
@SteveL-MSFT Printing a warning message that this powershell instance will not save history isn't much of an improvement (in my opinion) -- if I saw that, I would always manually close and reopen that pwsh instance -- which is what I did after seeing the IOException.
But yeah, being able to reproduce the problem on demand is something I wish was easier. What the steps seem to be:
- having iTerm2 set to reopen tabs after system restart, and 4 tabs open (it could help to open dozens of tabs)
- either crash the operating system, OR do an operating system upgrade
- sometimes this triggers the IOException
(I retried my earlier "tight loop" but with a C# app just calling new Mutex(
-- launched 4000 times at once, but didn't have any errors.)
@darthwalsh PSReadLine currently already has some retry logic. So if that fails, I'm not sure what we can do other than increasing the number of retries, but that's not guaranteed to succeed. Currently if the file lock cannot be obtained, PSReadLine just repeatedly informs you and is not usable. I thought I was more easily repro'ing it with a nested process, but it no longer repros for me. I don't believe I've seen the issue with opening a extra tab in iTerm2 either. If we can figure out a consistent repro, the proper fix would need to be in .NET.
@SteveL-MSFT I agree a proper fix in .NET seems much better, but to be sure I want to know what happens in a retry loop. Something like: if retrying after 1 ms brings the failure chance from 1% to 0.01%, then to me it makes sense to put the retry in Mutex
. But if macOS just refuses to allow new mutex creation for a full 2 second period after system upgrade, I wouldn't want my framework code to block in the Mutex ctor waiting for that.
daxian-dbw pointed out the existing PSReadLine retry logic is for abandoned mutex, not for IOException
.
@darthwalsh that is true, perhaps we could try adding to the retry logic with an IOException
, but I can't verify such a change right now since it's no longer repro'ing on my system.
After I upgraded macOS to 12.6, only one out of 6 of my iTerm pwsh
instances failed. Basically identical:
OS: Darwin 21.6.0 Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64
System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_2886463743'
I think the best way to get a consistent repro would be to snapshot an older macOS in a VM, upgrade the guest OS, with iTerm reopening dozens of pwsh tabs. But I'm not familiar the tools to do this.
Ok, I ran another experiment with the upgrade to macOS 12.6.1 and I was able to repro this failure without pwsh or PSReadLine; if we ask the owner of dotnet BCL what the Muetx exception policy should be I hope this added context would help.
Repo Program.cs
with a net6.0
console app:
try {
using (var m = new Mutex(false, "darthwalsh_PSReadLine_issues_2658")) {
Console.Write("created but didn't take Mutex darthwalsh_PSReadLine_issues_2658 ...");
var line = Console.ReadLine();
Console.WriteLine("QUITTING: " + line);
}
} catch (Exception e) {
Console.WriteLine(e.ToString());
var line = Console.ReadLine();
Console.WriteLine("QUITTING: " + line);
}
Built with dotnet publish -c Release --self-contained -a x64
I created a new iTerm profile that just runs command /Users/walshca/code/temp/MutexThrow/bin/Release/net6.0/osx-x64/publish/MutexThrow
and opened 30 tabs with this. Then I upgraded the macOS version to 12.6.1, and after reboot only 1 tab had the exception:
System.IO.IOException: The system cannot open the device or file specified. : 'darthwalsh_PSReadLine_issues_2658'
at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
at Program.<Main>$(String[] args) in /Users/walshca/code/temp/MutexThrow/Program.cs:line 2
(I expected about a quarter, so something about this minimal program doesn't match what pwsh is doing).
I'll open some more tabs of this, and leave it running for a few more OS reboots to see if I can repro it without OS upgrade.
@darthwalsh did you already open an issue in https://github.com/dotnet/runtime?
@SteveL-MSFT thanks for the nudge, I had been putting off rebooting but I just updated parallels today...
Created dotnet/runtime#79375
I was going to raise a new bug but I think it's the same issue. In which case I have an easier way to reproduce it (on linux).
Log on to system as a user su to root Run pwsh as root su to another user run pwsh again
Example given below:
[fastdruid@localhost ~]$ sudo su - root [sudo] password for fastdruid: [root@localhost ~]# pwsh PowerShell 7.3.4 PS /root> su - fastdruid [fastdruid@localhost ~]$ pwsh PowerShell 7.3.4 PS /home/fastdruid> Oops, something went wrong. Please report this bug with ALL the details below, including both the 'Environment' and 'Exception' sections. Please report on GitHub: https://github.com/PowerShell/PSReadLine/issues/new?template=Bug_Report.yaml Thank you!
Environment
PSReadLine: 2.2.6 PowerShell: 7.3.4 OS: Linux 4.18.0-425.3.1.el8.x86_64 #1 SMP Wed Nov 9 20:13:27 UTC 2022 BufferWidth: 153 BufferHeight: 46
Last 0 Keys:
Exception
System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_1202693972' at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew) at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name) at Microsoft.PowerShell.PSConsoleReadLine.DelayedOneTimeInitialize() at Microsoft.PowerShell.PSConsoleReadLine.Initialize(Runspace runspace, EngineIntrinsics engineIntrinsics) at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics, CancellationToken cancellationToken, Nullable`1 lastRunStatus)
@Fastdruid that looks like it's a different issue: #3692