PowerShell icon indicating copy to clipboard operation
PowerShell copied to clipboard

Add Get-FileEncoding cmdlet or function.

Open thezim opened this issue 9 years ago • 21 comments

This is common task I see across many PowerShell modules and think it would add value for cross platform tasks.

thezim avatar Sep 17 '16 16:09 thezim

Do you mean this? http://poshcode.org/2059 https://gist.github.com/jpoehls/2406504

This suggests that need the following cmdlets: Convert-FileEncoding and Convert-StringEncoding

And the RFC is required.

iSazonov avatar Sep 28 '16 13:09 iSazonov

@iSazonov Yes. The additional cmdlets are nice to haves as well.

thezim avatar Sep 28 '16 23:09 thezim

This is common task I see across many PowerShell modules @thezim Could you give examples of such modules?

iSazonov avatar Sep 29 '16 11:09 iSazonov

I investigated this field. It is questionable. We need the reference algorithm from experts in the field. Sample http://gnuwin32.sourceforge.net/packages/file.htm

iSazonov avatar Oct 06 '16 10:10 iSazonov

For compatibility we need to use the ported file utility. Can we rewrite it on C# and include in the repo as cmdlet?

iSazonov avatar Dec 07 '16 19:12 iSazonov

Posted by @sdwheeler in our Community Call, this is a version from Lee: http://poshcode.org/2153

joeyaiello avatar Dec 08 '16 17:12 joeyaiello

@PowerShell/powershell-committee discussed this and recommendation is to have a cmdlet that supports this capability instead of adding to FileInfo. Usage will be more common now that we are cross platform and should be part of the Utility module. Get-FileEncoding and Convert-FileEncoding makes sense from a discovery standpoint. Seems we can just review the parameters at PR time rather than requiring RFC for this one.

SteveL-MSFT avatar Dec 08 '16 17:12 SteveL-MSFT

@joeyaiello If we do a different algorithm then file, it may be misleading Unix users.

@SteveL-MSFT Could you please clarify about the possibility of porting of file utility?

iSazonov avatar Dec 08 '16 18:12 iSazonov

@iSazonov porting file as a cmdlet makes sense (assuming appropriate licensing). alternatively since I see the file is ported to Windows already, perhaps it's not worth the effort to port file to c# and instead just wrap it in a cmdlet?

SteveL-MSFT avatar Dec 08 '16 20:12 SteveL-MSFT

Our conclusion on this issue was specifically about wanting better support for encodings, nothing more.

I think we also questioned the value in porting file to PowerShell because extensions are the primary way of understanding file types on Windows.

lzybkr avatar Dec 08 '16 23:12 lzybkr

@SteveL-MSFT We cannot expect that there is the file utility on each Unix system especially on OsX.

Today I am more deeply researched how file utility works. Encoding detection is very simple (yes, file type detection is overkill for us) and can be easily ported to C#. Thus we can easily achieve compliance with the de facto Unix standard. The bad news is that the code is very old and should be brought into line with modern standards (from FSS-UTF (1992) / UTF-8 (1993) to UTF8 (2003)).

Another bad news is that this utility does not detect codepages. Do we want to make detection of codepages? If so, do we want high-speed heuristics (sample) or will use simpler but slower ways?

Now about the conversion. Simple test:

[text.encoding]::GetEncodings().count

return in Powershell 5.1 - 140 codepages in Powershell 6.0 (alfa 13) - 8 codepages (Unix iconv - ~300 codepages)

Should we completely rely on .Net Core in the expectation that there will be support for multiple charsets? Or should we make our implementation?

iSazonov avatar Dec 09 '16 18:12 iSazonov

@SteveL-MSFT for me I was just looking for detection of encodings that existing cmdlets currently accept such as Out-File. No code page usage. I do see the value in a full set of encoding cmdlets though.

thezim avatar Dec 09 '16 23:12 thezim

Opened - Initial discussion about encoding cmdlets https://github.com/PowerShell/PowerShell-RFC/issues/67

iSazonov avatar Feb 06 '17 10:02 iSazonov

@iSazonov: As an aside re:

We cannot expect that there is the file utility on each Unix system especially on OsX.

file is POSIX-mandated utility and therefore available on most (all?) modern Unix platforms, including macOS (OS X).

That said, the focus of the POSIX file utility spec is on classifying files by content - encodings aren't even mentioned.

In practice, however, both the GNU and the BSD/macOS implementations do report a text file's encoding, including the presence/absence of the UTF-8 pseudo-BOM.

mklement0 avatar Mar 02 '17 13:03 mklement0

@mklement0 Thank you mentioned this utility as POSIX. In most cases, however, it is installed as part of a separate package. This should encourage us to require the installation of this utility when installing PowerShell Core. I believe it is unacceptable for us. I recently did a little review of GNU file utility and found that its code is too out of date. I suppose we should not rely on it. Perhaps there is a more modern version, but I don't known about it.

And welcome to discussion https://github.com/PowerShell/PowerShell-RFC/issues/67

iSazonov avatar Mar 02 '17 14:03 iSazonov

I'm not (nearly) as advanced a PowerShell user as you guys, and I have a weak understanding of file encoding (I don't have a clue what the point of a BOM is honestly) but once every year or two, I get stung by file encoding, and the last time (a few days ago), cost us a Production migration as we were scratching our heads why our automation tool could not run batch scripts (the reason was that the batch scripts were generated by PowerShell which defaults to UTF-8 which made the batch scripts broken, but the errors made us think that it was the automation tool that was failing in some way). Such a scenario might all be very trivial/obvious to you guys, but it is not to most users (a "text file" has no deeper complexity than "text file" to most people, most of the time).

Both required tools (Get-FileEncoding and Convert-FileEncoding in https://github.com/PowerShell/PowerShell-RFC/issues/67) are long-overdue as core components of PowerShell. Get- would greatly enhance appreciation of file encoding issues (and the more information the better in my mind, codepages etc), while Convert- becomes more and more important in making PowerShell a useful cross-platform tool. Would really appreciate if this two-years-since-last-comment thread was un-mothballed?

roysubs avatar Nov 21 '20 07:11 roysubs

Would really appreciate if this two-years-since-last-comment thread was un-mothballed?

@roysubs This was approved and you can grab the work.

iSazonov avatar Nov 21 '20 12:11 iSazonov

I really wish that I had the ability to do that @iSazonov !

I know that @mklement0 has a very deep understanding of file encoding, I'm hoping that he might have the time to build this... 🙂

roysubs avatar Nov 21 '20 12:11 roysubs

@mklement0 is a great analytic but not a fan of coding :-)

Implementation is simple with using StreamReader.CurrentEncoding . Of cause later we could make the cmdlet more "powershel-ly" smart with an heuristics.

iSazonov avatar Nov 21 '20 13:11 iSazonov

Sounds great, and I'll help if I can, but presumably you'd have to do this in C# (I'm more of just a SysAdmin / DevOps type scripter, I just use PowerShell and Python to manage some tasks on my work environments). I want to see PowerShell take over on Linux though, it's just a much better language imo 🙂.

roysubs avatar Nov 21 '20 18:11 roysubs

This issue has not had any activity in 6 months, if there is no further activity in 7 days, the issue will be closed automatically.

Activity in this case refers only to comments on the issue. If the issue is closed and you are the author, you can re-open the issue using the button below. Please add more information to be considered during retriage. If you are not the author but the issue is impacting you after it has been closed, please submit a new issue with updated details and a link to this issue and the original.