How does pyline handle filenames containing \n or \0?
From https://github.com/westurner/pyline/issues/24#issuecomment-226976408 :
Note that this can be dangerous because pyline (like other POSIX CLI tools) splits on
\nand filenames can contain\nand\0
- Pyline splits on
\n.
From https://github.com/westurner/pyline/commit/6c3f658c428def4644f10dc819e8d65adaf1b014#comments :
This bugfix causes a regression for what are probably the most frequent uses of the path options and may require multiple additional commandline options to fix.
- A filename may consist of only spaces (
\s); which .rstrip() strips - A filename may contain newlines (
\n,\r\n) and carriage returns (\r); which .rstrip() strips
For example:
$ echo -e 'README.rst\n' | pyline -m path 'str(__import__("path").Path(l.rstrip()).exists())'
True
$ echo -e 'README.rst\n' | pyline -m path 'str(__import__("path").Path(l).exists())'
False
$ echo -e 'README.rst\n' | pyline -m path 'str(p.exists())'
False
The problem fixed here is/was that line.rstrip() strips all trailing '\r', '\n', ' ';
which reduces files named, for example, \n\n\n, (\s\s\s), and \r\n\r\n (which are valid file names, by the way) to the empty string.
In most cases, it seemed safe to assume that -- for the --pathpy and --pathlib options which populate path = p = Path(line OR line.rstrip()) -- stripping the trailing newline from a line is a safe thing to do. It's the most convenient thing to do; and it may or may not be safe to assume that, in this case, that's what the user expects: a file of records delimited by either newline or carriage-return-newline.
So, at first, the simplest solution seems to be to write an rstrip_one() which strips only one \n. And then, for DOS line endings instead strip only one \r\n two-character line-ending combination. However, because pyline is a stream-processing tool, auto-detecting the line ending from the first line from a list of files which may or may not contain newlines \n and may or may not end in \n (Unix) or \r\n (DOS) may be dangerous because valid filenames would cause line-ending detection to fail unless explicitly specified:
# This would be autodetected as a Unix line ending
one\n1\r\ntwo\n2\r\n
# This would be autodetected as a DOS line ending
one 1\r\ntwo 2\r\n
I think the correct solution is:
- to add a
--dos-eoloption which explicitly sets the EOL (end-of-line) to\r\n; - default to
\n: specify that as--unix-eol(default:True); - and write
rstrip_one()so that files consisting of all spaces, carriage returns, or newlines aren't reduced to empty string for the path options.
and it may or may not be safe to assume that, in this case, that's what the user expects: a file of records delimited by either newline or carriage-return-newline.
This is the trouble:
- EOL sequences within a quoted string MAY NOT be considered field or record delimiters.
- EOL sequences within a quoted string SHOULD NOT be considered field or record delimiters.
- EOL sequences within a quoted string MUST NOT be considered field or record delimiters.
Pyline already has a --shlex option which does words = w = shlex.split(line). This may be dangerous if the input data contains newlines within the quoted fields because pyline (and many/most? other line-based tools tools) don't differentiate between quoted and non-quoted EOL sequences which serve as record delimiters.
https://en.wikipedia.org/wiki/Newline
This is the actual trouble: newlines are control characters which are also presentational characters. Naieve line-based parsers (such as Pyline v0.3.x) do not differentiate between EOL line feed control characters within quoted strings and EOL record delimiters.
Test cases:
"one\n"\n"two"
"thre\ne"\n"four\n"
"fiv\r\ne"\r\n"six\n"
"sev\nen"\r\n"eight\r\n"\r\n
The pyline function processes an iterable (an object with __iter__() and next()).
The pyline commandline script passes either a sys.stdin or a codecs.EncodedFile as the iterable iterable.
In order to fix this in Pyline, there would need to be a different method of processing "lines" from sys.stdin or a file: one which splits on the EOL record-delimiter only if it's not in a quoted field/record. These delimiters would need to be explicitly specified as CLI args.
Is this a bug / vulnerability / weakness which should be reported?
IDK how often people do things like this without expecting a path to contain \0 or \n or \r\n at the beginning, middle, or end:
# These are wrong/bad
ls | xargs stat
find . | xargs stat
# This is a safer way
find . -print0 | xargs -0 stat
# Pyline doesn't support -0 (split files on \0 as the record delimiter)
# Python 2 doesn't support -0 (split files on \0 as the record delimiter)
# Python 3 open() supports a newline= parameter
There are CWE codes for these, so I guess it's more of an operator error awareness issue and an unsafe defaults issue:
- "CWE-144: Improper Neutralization of Line Delimiters" https://cwe.mitre.org/data/definitions/144.html
- "CWE-143: Improper Neutralization of Record Delimiters" https://cwe.mitre.org/data/definitions/143.html
- "CWE-117: Improper Output Neutralization for Logs" https://cwe.mitre.org/data/definitions/117.html
- "CWE-93: Improper Neutralization of CRLF Sequences ('CRLF Injection')" https://cwe.mitre.org/data/definitions/93.html
- https://www.owasp.org/index.php/OWASP_Periodic_Table_of_Vulnerabilities_-_Null_Byte_Injection
Maybe the record delimiter should be an optional argument to file.__iter__, codecs.EncodedFile.__iter__, and sys.stdin.__iter__ or their respective constructors? I'm not sure how
str.splitlines()returns a list, not a generatorstr.splitlines("\n")returns[]- sys.stdin doesn't have a constructor; but could take a PYTHONSTDINDELIM environ variable
- https://docs.python.org/2/library/stdtypes.html#bltin-file-objects
- https://docs.python.org/2/library/functions.html#open
- https://docs.python.org/3/library/stdtypes.html#bltin-file-objects
- https://docs.python.org/3/library/filesys.html
- https://docs.python.org/3/library/functions.html#open (
newline=was added in )- "It can be None, '', '\n', '\r', and '\r\n'"
- https://docs.python.org/3/library/functions.html#open (
- https://docs.python.org/3/library/fileinput.html
- fileinput accepts openhook=hook_encoded(encoding, errors) which doesn't yet take a
newline=kwarg
- fileinput accepts openhook=hook_encoded(encoding, errors) which doesn't yet take a
- https://docs.python.org/3/glossary.html#term-universal-newlines
- "PEP-278: Universal Newlines Support" https://www.python.org/dev/peps/pep-0278/
- "PEP 3116 -- New I/O"
https://www.python.org/dev/peps/pep-3116/#text-i-o
https://www.python.org/dev/peps/pep-3116/#the-open-built-in-function
newline='\0'is not yet supported; so would need to be implemented in basically same way that \n, \r, and \r\n are implemented for "universal newlines"
These would read the whole file into RAM (and swap):
file().read().split('\0')[:-1]
sys.stdin.read().split('\0')[:-1]