Generic/LowercasedFilename: sniff doesn't handle non-ANSII characters properly

Open rodrigoprimo opened this issue 1 year ago • 1 comments

Describe the bug

While working on improving code coverage for the Generic.Files.LowercasedFilename sniff (#681), I noticed that it fails to properly handle file names that contain uppercase non-ANSII characters as it uses strtolower() to check if the filename is all lowercase. strtolower() ignores non-ANSII characters.

https://github.com/PHPCSStandards/PHP_CodeSniffer/blob/26ddb35f4684760b27ad48d3c420afb2c636cc1b/src/Standards/Generic/Sniffs/Files/LowercasedFilenameSniff.php#L51

Code sample

<?php

To reproduce

Steps to reproduce the behavior:

Create a file called tÉst.php with the code sample above.
Run phpcs tÉst.php --standard=Generic --sniffs=Generic.Files.LowercasedFilename
No error message is displayed.

Expected behavior

PHPCS should display the following error message:

----------------------------------------------------------------------------------
FOUND 1 ERROR AFFECTING 1 LINE
----------------------------------------------------------------------------------
 1 | ERROR | Filename "tÉst.php" doesn't match the expected filename "tést.php"
----------------------------------------------------------------------------------

Versions (please complete the following information)


Operating System	Ubuntu 24.04
PHP version	8.3
PHP_CodeSniffer version	master
Standard	Generic
Install type	git clone

Please confirm

[x] I have searched the issue list and am not opening a duplicate issue.
[x] I have read the Contribution Guidelines and this is not a support question.
[x] I confirm that this bug is a bug in PHP_CodeSniffer and not in one of the external standards.
[x] I have verified the issue still exists in the master branch of PHP_CodeSniffer.

Nov 13 '24 21:11 rodrigoprimo

@rodrigoprimo Thanks for finding and reporting this issue.

While this is an interesting issue from a technical perspective, I consider this issue a low priority issue unless and until end-users of PHPCS would report they are running into it.

I wonder how common it is to have non-ASCII characters in file names ? I also have a gut-feeling files like that may not always be portable cross-OS, but this would need to be researched and confirmed/debunked first. If my gut-feeling would turn out to be correct, I can imagine non-ASCII characters in file names might deserve their own sniff (to forbid this).

I also wonder how we could detect this reliably as, while the file contents has an encoding, I don't know how we could figure out the encoding for the file name. I imagine the encoding might be based on the OS ? File name vs encoding is a curiosity which I've never dug into, so I'd be very interested to hear from someone who has and who can shed more light on this.

Nov 24 '24 18:11 jrfnl