pyline How does pyline handle filenames containing \n or \0?

From https://github.com/westurner/pyline/issues/24#issuecomment-226976408 :

Note that this can be dangerous because pyline (like other POSIX CLI tools) splits on \n and filenames can contain \n and \0

Pyline splits on \n.

Sep 28 '16 00:09 westurner

From https://github.com/westurner/pyline/commit/6c3f658c428def4644f10dc819e8d65adaf1b014#comments :

This bugfix causes a regression for what are probably the most frequent uses of the path options and may require multiple additional commandline options to fix.

A filename may consist of only spaces (\s); which .rstrip() strips
A filename may contain newlines (\n, \r\n) and carriage returns (\r); which .rstrip() strips

For example:

$ echo -e 'README.rst\n' | pyline -m path 'str(__import__("path").Path(l.rstrip()).exists())'
True

$ echo -e 'README.rst\n' | pyline -m path 'str(__import__("path").Path(l).exists())'
False
$ echo -e 'README.rst\n' | pyline -m path 'str(p.exists())'
False

The problem fixed here is/was that line.rstrip() strips all trailing '\r', '\n', ' '; which reduces files named, for example, \n\n\n, (\s\s\s), and \r\n\r\n (which are valid file names, by the way) to the empty string.

In most cases, it seemed safe to assume that -- for the --pathpy and --pathlib options which populate path = p = Path(line OR line.rstrip()) -- stripping the trailing newline from a line is a safe thing to do. It's the most convenient thing to do; and it may or may not be safe to assume that, in this case, that's what the user expects: a file of records delimited by either newline or carriage-return-newline.

So, at first, the simplest solution seems to be to write an rstrip_one() which strips only one \n. And then, for DOS line endings instead strip only one \r\n two-character line-ending combination. However, because pyline is a stream-processing tool, auto-detecting the line ending from the first line from a list of files which may or may not contain newlines \n and may or may not end in \n (Unix) or \r\n (DOS) may be dangerous because valid filenames would cause line-ending detection to fail unless explicitly specified:

# This would be autodetected as a Unix line ending
one\n1\r\ntwo\n2\r\n
# This would be autodetected as a DOS line ending
one 1\r\ntwo 2\r\n

I think the correct solution is:

to add a --dos-eol option which explicitly sets the EOL (end-of-line) to \r\n;
default to \n: specify that as --unix-eol (default:True);
and write rstrip_one() so that files consisting of all spaces, carriage returns, or newlines aren't reduced to empty string for the path options.

and it may or may not be safe to assume that, in this case, that's what the user expects: a file of records delimited by either newline or carriage-return-newline.

This is the trouble:

EOL sequences within a quoted string MAY NOT be considered field or record delimiters.
EOL sequences within a quoted string SHOULD NOT be considered field or record delimiters.
EOL sequences within a quoted string MUST NOT be considered field or record delimiters.

Pyline already has a --shlex option which does words = w = shlex.split(line). This may be dangerous if the input data contains newlines within the quoted fields because pyline (and many/most? other line-based tools tools) don't differentiate between quoted and non-quoted EOL sequences which serve as record delimiters.

https://en.wikipedia.org/wiki/Newline

This is the actual trouble: newlines are control characters which are also presentational characters. Naieve line-based parsers (such as Pyline v0.3.x) do not differentiate between EOL line feed control characters within quoted strings and EOL record delimiters.

Feb 03 '18 04:02 westurner

Test cases:

"one\n"\n"two"

"thre\ne"\n"four\n"

"fiv\r\ne"\r\n"six\n"

"sev\nen"\r\n"eight\r\n"\r\n

Feb 03 '18 04:02 westurner

The pyline function processes an iterable (an object with __iter__() and next()). The pyline commandline script passes either a sys.stdin or a codecs.EncodedFile as the iterable iterable.

In order to fix this in Pyline, there would need to be a different method of processing "lines" from sys.stdin or a file: one which splits on the EOL record-delimiter only if it's not in a quoted field/record. These delimiters would need to be explicitly specified as CLI args.

Feb 03 '18 05:02 westurner

Is this a bug / vulnerability / weakness which should be reported? IDK how often people do things like this without expecting a path to contain \0 or \n or \r\n at the beginning, middle, or end:


# These are wrong/bad
ls | xargs stat
find . | xargs stat

# This is a safer way
find . -print0 | xargs -0 stat

# Pyline doesn't support -0 (split files on \0 as the record delimiter)
# Python 2 doesn't support -0 (split files on \0 as the record delimiter) 
# Python 3 open() supports a newline= parameter

There are CWE codes for these, so I guess it's more of an operator error awareness issue and an unsafe defaults issue:

"CWE-144: Improper Neutralization of Line Delimiters" https://cwe.mitre.org/data/definitions/144.html
"CWE-143: Improper Neutralization of Record Delimiters" https://cwe.mitre.org/data/definitions/143.html
"CWE-117: Improper Output Neutralization for Logs" https://cwe.mitre.org/data/definitions/117.html
"CWE-93: Improper Neutralization of CRLF Sequences ('CRLF Injection')" https://cwe.mitre.org/data/definitions/93.html
https://www.owasp.org/index.php/OWASP_Periodic_Table_of_Vulnerabilities_-_Null_Byte_Injection

Maybe the record delimiter should be an optional argument to file.__iter__, codecs.EncodedFile.__iter__, and sys.stdin.__iter__ or their respective constructors? I'm not sure how

str.splitlines() returns a list, not a generator
str.splitlines("\n") returns []
sys.stdin doesn't have a constructor; but could take a PYTHONSTDINDELIM environ variable
https://docs.python.org/2/library/stdtypes.html#bltin-file-objects
- https://docs.python.org/2/library/functions.html#open
https://docs.python.org/3/library/stdtypes.html#bltin-file-objects
https://docs.python.org/3/library/filesys.html
- https://docs.python.org/3/library/functions.html#open (newline= was added in )
  - "It can be None, '', '\n', '\r', and '\r\n'"
https://docs.python.org/3/library/fileinput.html
- fileinput accepts openhook=hook_encoded(encoding, errors) which doesn't yet take a newline= kwarg
https://docs.python.org/3/glossary.html#term-universal-newlines
- "PEP-278: Universal Newlines Support" https://www.python.org/dev/peps/pep-0278/
- "PEP 3116 -- New I/O" https://www.python.org/dev/peps/pep-3116/#text-i-o https://www.python.org/dev/peps/pep-3116/#the-open-built-in-function
  - newline='\0' is not yet supported; so would need to be implemented in basically same way that \n, \r, and \r\n are implemented for "universal newlines"

Feb 03 '18 05:02 westurner

These would read the whole file into RAM (and swap):

file().read().split('\0')[:-1]
sys.stdin.read().split('\0')[:-1]

Feb 03 '18 08:02 westurner