Generic/LowercasedFilename: sniff doesn't handle non-ANSII characters properly
Describe the bug
While working on improving code coverage for the Generic.Files.LowercasedFilename sniff (#681), I noticed that it fails to properly handle file names that contain uppercase non-ANSII characters as it uses strtolower() to check if the filename is all lowercase. strtolower() ignores non-ANSII characters.
https://github.com/PHPCSStandards/PHP_CodeSniffer/blob/26ddb35f4684760b27ad48d3c420afb2c636cc1b/src/Standards/Generic/Sniffs/Files/LowercasedFilenameSniff.php#L51
Code sample
<?php
To reproduce
Steps to reproduce the behavior:
- Create a file called
tÉst.phpwith the code sample above. - Run
phpcs tÉst.php --standard=Generic --sniffs=Generic.Files.LowercasedFilename - No error message is displayed.
Expected behavior
PHPCS should display the following error message:
----------------------------------------------------------------------------------
FOUND 1 ERROR AFFECTING 1 LINE
----------------------------------------------------------------------------------
1 | ERROR | Filename "tÉst.php" doesn't match the expected filename "tést.php"
----------------------------------------------------------------------------------
Versions (please complete the following information)
| Operating System | Ubuntu 24.04 |
| PHP version | 8.3 |
| PHP_CodeSniffer version | master |
| Standard | Generic |
| Install type | git clone |
Please confirm
- [x] I have searched the issue list and am not opening a duplicate issue.
- [x] I have read the Contribution Guidelines and this is not a support question.
- [x] I confirm that this bug is a bug in PHP_CodeSniffer and not in one of the external standards.
- [x] I have verified the issue still exists in the
masterbranch of PHP_CodeSniffer.
@rodrigoprimo Thanks for finding and reporting this issue.
While this is an interesting issue from a technical perspective, I consider this issue a low priority issue unless and until end-users of PHPCS would report they are running into it.
I wonder how common it is to have non-ASCII characters in file names ? I also have a gut-feeling files like that may not always be portable cross-OS, but this would need to be researched and confirmed/debunked first. If my gut-feeling would turn out to be correct, I can imagine non-ASCII characters in file names might deserve their own sniff (to forbid this).
I also wonder how we could detect this reliably as, while the file contents has an encoding, I don't know how we could figure out the encoding for the file name. I imagine the encoding might be based on the OS ? File name vs encoding is a curiosity which I've never dug into, so I'd be very interested to hear from someone who has and who can shed more light on this.