ripgrepy Fix line splitting from ripgrep --json output

ripgrep --json results may contain characters that are considered newlines by str.splitlines, because it only treats \n or \r\n as newlines (this file is an example where that happens). ripgrep's output itself is separated by standard newlines.

Tested with dependent SeaGOAT.

Sep 22 '23 16:09 danipozo

@danipozo could you give an example of how you expect ripgreps response to look like with the --json flag for the example file you provided vs how ripgrepy is currently outputing the result?

Sep 24 '23 07:09 securisec

ripgrepy currently throws an exception with the --json flag for the example file:

>>> Ripgrepy('.', '0').json().run().as_dict
ERROR:root:
Traceback (most recent call last):
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 21, in l
    o = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 67, in as_dict
    data = loads(line)
           ^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 21, in l
    o = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 67, in as_dict
    data = loads(line)
           ^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

This exception is caused by splitting the output of ripgrep --json in a newline character inside a JSON object. This (fragment) match object returned by ripgrep:

{
  "type": "match",
  "data": {
    "path": {
      "text": "01666_blns_long.reference"
    },
    "lines": {
      "text": "['undefined','undef','null','NULL','(null)','nil','NIL','true','false','True','False','TRUE','FALSE','None','hasOwnProperty','then','\\\\','\\\\\\\\','0','1','1.00','$1.00','1/2','1E2','1E02','1E+02','-1','-1.00','-$1.00','-1/2','-1E2','-1E02','-1E+02','1/0','0/0','-2147483648/-1','-9223372036854775808/-1','-0','-0.0','+0','+0.0','0.00','0..0','.','0.0.0','0,00','0,,0',',','0,0,0','0.0/0','1.0/0.0','0.0/0.0','1,0/0,0','0,0/0,0','--1','-','-.','-,','999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999','NaN','Infinity','-Infinity','INF','1#INF','-1#IND','1#QNAN','1#SNAN','1#IND','0x0','0xffffffff','0xffffffffffffffff','0xabad1dea','123456789012345678901234567890123456789','1,000.00','1 000.00','1\\'000.00','1,000,000.00','1 000 000.00','1\\'000\\'000.00','1.000,00','1 000,00','1\\'000,00','1.000.000,00','1 000 000,00','1\\'000\\'000,00','01000','08','09','2.2250738585072011e-308',',./;\\'[]\\\\-=','<>?:\"{}|_+','!@#$%^&*()`~','\\\\u0001\\\\u0002\\\\u0003\\\\u0004\\\\u0005\\\\u0006\\\\u0007\\b\\\\u000e\\\\u000f\\\\u0010\\\\u0011\\\\u0012\\\\u0013\\\\u0014\\\\u0015\\\\u0016\\\\u0017\\\\u0018\\\\u0019\\\\u001a\\\\u001b\\\\u001c\\\\u001d\\\\u001e\\\\u001f\u007f','<U+0080><U+0081><U+0082><U+0083>
<U+0084><U+0086><U+0087><U+0088><U+0089><U+008A><U+008B><U+008C><U+008D><U+008E><U+008F><U+0090><U+0091><U+0092><U+0093><U+0094><U+0095><U+0096><U+0097>
<U+0098><U+0099><U+009A><U+009D><U+009E><U+009F>','\\t\\\\u000b\\f <U+0085>             <U+2028><U+2029>  　','؀؁؂؃؄؅؜۝܏᠎

is broken at the <U+0085> character, therefore rendering an invalid JSON object, which causes the exception above. The JSON objects representing matches are themselves only separated by standard newlines, which shouldn´t appear in matches because ripgrep does line by line processing.

Sep 24 '23 09:09 danipozo

I am not confident that the PR will solve this issue besides on systems that uses \n like linux. It may create an issue on windows in its current implementation. Furthermore, the sample data also seems to cause issues with anything written in python. i.e. kitty terminal which hangs on rg '0' against the file.

I think a better PR would be to pass the split_by character via a function paramater for as_dict and as_json.

For example:

def as_dict(self, split_by: Union[str, None] = None):
    ...
    if split_by is not None:
        out = self._output.split(split_by)
    else:
        out = self._output.splitlines()
    ...

You have to help me understand for line in out[:-1]: in the PR also, as to why we are ignoring the last line.

Sep 24 '23 15:09 securisec

I believe I've encountered this problem as I too am getting decoding errors. It seems like a bug if ripgrep is emitting newlines within JSON though?

What about an approach where it splits by newline, and then tries to decode each time it hits a newline. If it fails, it stores it in a running undecoded_partial_json variable (adding the stripped newline character back too), and then tries re-decoding the undecoded_partial_json each time it gets a new chunk?

May 27 '24 08:05 RecRanger

ripgrepy ripgrepy copied to clipboard

Fix line splitting from ripgrep --json output

ripgrepy
ripgrepy copied to clipboard