ripgrepy
ripgrepy copied to clipboard
Fix line splitting from ripgrep --json output
ripgrep --json
results may contain characters that are considered newlines by str.splitlines
, because it only treats \n
or \r\n
as newlines (this file is an example where that happens). ripgrep
's output itself is separated by standard newlines.
Tested with dependent SeaGOAT.
@danipozo could you give an example of how you expect ripgreps response to look like with the --json
flag for the example file you provided vs how ripgrepy is currently outputing the result?
ripgrepy currently throws an exception with the --json
flag for the example file:
>>> Ripgrepy('.', '0').json().run().as_dict
ERROR:root:
Traceback (most recent call last):
File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 21, in l
o = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 67, in as_dict
data = loads(line)
^^^^^^^^^^^
File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 21, in l
o = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/dani/git-external/ripgrepy/ripgrepy/__init__.py", line 67, in as_dict
data = loads(line)
^^^^^^^^^^^
File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)
This exception is caused by splitting the output of ripgrep --json
in a newline character inside a JSON object. This (fragment) match object returned by ripgrep:
{
"type": "match",
"data": {
"path": {
"text": "01666_blns_long.reference"
},
"lines": {
"text": "['undefined','undef','null','NULL','(null)','nil','NIL','true','false','True','False','TRUE','FALSE','None','hasOwnProperty','then','\\\\','\\\\\\\\','0','1','1.00','$1.00','1/2','1E2','1E02','1E+02','-1','-1.00','-$1.00','-1/2','-1E2','-1E02','-1E+02','1/0','0/0','-2147483648/-1','-9223372036854775808/-1','-0','-0.0','+0','+0.0','0.00','0..0','.','0.0.0','0,00','0,,0',',','0,0,0','0.0/0','1.0/0.0','0.0/0.0','1,0/0,0','0,0/0,0','--1','-','-.','-,','999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999','NaN','Infinity','-Infinity','INF','1#INF','-1#IND','1#QNAN','1#SNAN','1#IND','0x0','0xffffffff','0xffffffffffffffff','0xabad1dea','123456789012345678901234567890123456789','1,000.00','1 000.00','1\\'000.00','1,000,000.00','1 000 000.00','1\\'000\\'000.00','1.000,00','1 000,00','1\\'000,00','1.000.000,00','1 000 000,00','1\\'000\\'000,00','01000','08','09','2.2250738585072011e-308',',./;\\'[]\\\\-=','<>?:\"{}|_+','!@#$%^&*()`~','\\\\u0001\\\\u0002\\\\u0003\\\\u0004\\\\u0005\\\\u0006\\\\u0007\\b\\\\u000e\\\\u000f\\\\u0010\\\\u0011\\\\u0012\\\\u0013\\\\u0014\\\\u0015\\\\u0016\\\\u0017\\\\u0018\\\\u0019\\\\u001a\\\\u001b\\\\u001c\\\\u001d\\\\u001e\\\\u001f\u007f','<U+0080><U+0081><U+0082><U+0083>
<U+0084><U+0086><U+0087><U+0088><U+0089><U+008A><U+008B><U+008C><U+008D><U+008E><U+008F><U+0090><U+0091><U+0092><U+0093><U+0094><U+0095><U+0096><U+0097>
<U+0098><U+0099><U+009A><U+009D><U+009E><U+009F>','\\t\\\\u000b\\f <U+0085> <U+2028><U+2029> ','
is broken at the <U+0085>
character, therefore rendering an invalid JSON object, which causes the exception above. The JSON objects representing matches are themselves only separated by standard newlines, which shouldn´t appear in matches because ripgrep does line by line processing.
I am not confident that the PR will solve this issue besides on systems that uses \n
like linux. It may create an issue on windows in its current implementation. Furthermore, the sample data also seems to cause issues with anything written in python. i.e. kitty terminal which hangs on rg '0'
against the file.
I think a better PR would be to pass the split_by
character via a function paramater for as_dict
and as_json
.
For example:
def as_dict(self, split_by: Union[str, None] = None):
...
if split_by is not None:
out = self._output.split(split_by)
else:
out = self._output.splitlines()
...
You have to help me understand for line in out[:-1]:
in the PR also, as to why we are ignoring the last line.
I believe I've encountered this problem as I too am getting decoding errors. It seems like a bug if ripgrep
is emitting newlines within JSON though?
What about an approach where it splits by newline, and then tries to decode each time it hits a newline. If it fails, it stores it in a running undecoded_partial_json
variable (adding the stripped newline character back too), and then tries re-decoding the undecoded_partial_json
each time it gets a new chunk?