pyparsing icon indicating copy to clipboard operation
pyparsing copied to clipboard

set_parse_action handler receives seemingly erroneous loc argument

Open bernd-wechner opened this issue 1 year ago • 6 comments

Essentially, I am using set_parse_action() to determine to location of parse elements within the source string (if there's another way, I'm all ears). The parse action defined in the documentation as having a fingerprint of:

Each parse action fn is a callable method with 0-3 arguments, called as fn(s, loc, toks) , fn(loc, toks) , fn(toks) , or just fn() , where:

    s = the original string being parsed
    loc = the location of the matching substring
    toks = a list of the matched tokens, packaged as a ParseResults object

And so I write me one to test this:

def test_value_location(s, loc, toks):
    pfx = "\t\t"
    print(f"{pfx}{s=}")
    print(f"{pfx}{loc=}")
    print(f"{pfx}{toks=}")
    value1 = ''.join(toks)
    value2 = toks.asDict()['value']
    value3 = s[loc:loc+len(value1)]
    print(f"{pfx}{value1=}")
    print(f"{pfx}{value2=}")
    print(f"{pfx}{value3=}")

and implement a parser. Only I find that the loc that is reported is inconsistent and in one scenario IMHO broken (either a bug, or surprisingly unintuitive and hard to comprehend behaviour that warrants clear documentation - which is may even have and I've not found it).

The issue seems to be that if value follows a white space element be that Empty(*), White() or '' loc is out by 1!

To confirm this, I wrote a test script (attached as: pyparsing-bug.py.zip)

In synopsis, it defines two test lines to parse, and 4 scenarios to test. The first line is of the form "setting = value" the second of the form "setting value", and the scenarios define an assignment operator as '=', Empty(), White() or ''. In each scenario are parse action handler prints its arguments and three versions of the value parsed (that handler is above). The first values if from toks as a list, the second from toks as a dict, the third from the source string at loc.

I expect all the values to be identical in all successfully parsed scenarios and I expect the first line to parse only in the first scenario and the second line in the remaining three scenarios. And that is exactly what I find except that the third value (from the source string at loc) demonstrates an inconsistency. That value is at loc in the empty scenarios seems right and out by one in the non-empty scenario.

Surely the whole point of loc, is to point to the value parsed regardless?

The full output in which you can see value3 is wrong in Line 0 scenario 1:

Line 0: setting_1 = value-1 # comment 1
	Scenario 0: '='
		s='setting_1 = value-1 # comment 1\n'
		loc=11
		toks=ParseResults(['value-1'], {'value': 'value-1'})
		value1='value-1'
		value2='value-1'
		value3=' value-'
		Result:
			['setting_1', '=', 'value-1', '#', ' comment 1']
			- name: 'setting_1'
			- trailing_comment: 			['#', ' comment 1']
			- value: 'value-1'

	Scenario 1: Empty
	Failed to parse: Expected {quoted string using single or double quotes | W:(-.0-9A-Za-z)}, found '='  (at char 10), (line:1, col:11)

	Scenario 2: <SP><TAB><CR><LF>
	Failed to parse: Expected {quoted string using single or double quotes | W:(-.0-9A-Za-z)}, found '='  (at char 10), (line:1, col:11)

	Scenario 3: ''
	Failed to parse: Expected {quoted string using single or double quotes | W:(-.0-9A-Za-z)}, found '='  (at char 10), (line:1, col:11)

Line 1: setting_2 value-2 # comment 2
	Scenario 0: '='
	Failed to parse: Expected '=', found 'value'  (at char 10), (line:1, col:11)

	Scenario 1: Empty
		s='setting_2 value-2 # comment 2\n'
		loc=10
		toks=ParseResults(['value-2'], {'value': 'value-2'})
		value1='value-2'
		value2='value-2'
		value3='value-2'
		Result:
			['setting_2', 'value-2', '#', ' comment 2']
			- name: 'setting_2'
			- trailing_comment: 			['#', ' comment 2']
			- value: 'value-2'

	Scenario 2: <SP><TAB><CR><LF>
		s='setting_2 value-2 # comment 2\n'
		loc=10
		toks=ParseResults(['value-2'], {'value': 'value-2'})
		value1='value-2'
		value2='value-2'
		value3='value-2'
		Result:
			['setting_2', ' ', 'value-2', '#', ' comment 2']
			- name: 'setting_2'
			- trailing_comment: 			['#', ' comment 2']
			- value: 'value-2'

	Scenario 3: ''
		s='setting_2 value-2 # comment 2\n'
		loc=10
		toks=ParseResults(['value-2'], {'value': 'value-2'})
		value1='value-2'
		value2='value-2'
		value3='value-2'
		Result:
			['setting_2', 'value-2', '#', ' comment 2']
			- name: 'setting_2'
			- trailing_comment: 			['#', ' comment 2']
			- value: 'value-2'

bernd-wechner avatar May 27 '24 10:05 bernd-wechner

I haven't dug too deep into your off-by-one bits yet, but here are a couple of things to take a look at:

  • Located class to wrap an expression and return its parsed contents as a named "value" result, and start and end locations
  • trace_parse_action decorator for debugging parse actions
  • with_line_numbers in pyparsing.testing to display strings with line and column numbers

ptmcg avatar May 28 '24 16:05 ptmcg

Thanks for the tips. I will have to look at the more closely later (though would suggest when providing them to add links). A quick review is all I have time for now to add value to your (already helpful) pointers and tips :

  • Located might well be a workaround, thanks. This would not change the bug status of the erroneous supply of loc to a parse_action handler mind you.
  • trace_parse_action might be useful for me to drill down and see if this is a bug or indeed some (unimaginably at present) strangely justifiable behaviour. In short to contribute to triage of this issue myself. Thanks.
  • with_line_numbers could also offer a workaround and diagnostic path. Interesting it is documented with: "Line and column numbers are 1-based." which provides a tantalising clue as to how this reported issue may arise (confusion in some code path between a 0-based and 1-based loc! notoriously common bugs to encounter and to guard against when flipping between the two contexts).

bernd-wechner avatar May 29 '24 00:05 bernd-wechner

Had a quick look:

  • with_line_numbers is indeed just a debugging tool and has nothing to do with the parser per se. The doc is rather poor and it takes some drilling into the code to work out how to use it but a demo below:
    import pyparsing as pp
    pp.testing.with_line_numbers(line)
    
    which prints to the console the line (a string) a bit like this:
                1         2         3         4         5         6         7
       1234567890123456789012345678901234567890123456789012345678901234567890
     1:existing_string_setting = 'existing_string'      #   trailing comment|
    
    mildly handy indeed. And when comparing against zero based string indices easy enough to insert a space in the first two rows if you can edit the console output (as I can in my debugger)
  • Located exhibits exactly the same bug as in this report and has more bugs to boot, and is no joy to use. If you have a compound parser (multiple elements/tokens) with multiple tokens then if Located is used to decorate more than one the as_dict() is broken. I haven't time right now to report the full details but essentially it mis-associates names with results (has the value of one token under the key of another) and only the last of the Located tokens locations (as the dict uses locn_start and locn_es keys in the dict - so more than one set can't be supplied). The as_list() version inserts the entries for the start and end location before and after each token so could be read with foreknowledge of where to expect Located() annotations but rather pollutes the list IMHO (ought really be a list of tokens that have a str() that is just the token and attributes for start and end IMHO). But Located() is clearly designed only to use on one element/token in a compound parser (that adds many) and moreover exhibits precisely the same locating bug that the original issue reports.
  • trace_parse_action I have yet to play with to drill down a bit and determine whether I might discover this apparent bug is a bizarre feature with an explanation ;-). Later.

bernd-wechner avatar Jun 01 '24 23:06 bernd-wechner

Pyparsing does not automatically group tokens, so that you can assemble your parser a bit at a time and still get a simple linear collection of matched parsed results. When you use results names, then pyparsing's default behavior to similar to the standard Python dict behavior - if the same key is assigned to multiple times, then the last assignment overwrites any prior one. Since Located uses results names to capture the start and end locations, multiple Locateds in the same expression will do this same thing.

The solution when using the same results name multiple times is to wrap the expression in Group. This will keep separate the different values.

Please look at the following example that shows the effects of using results names and Locateds with and without Groups:

import pyparsing as pp

name_word = pp.Word(pp.alphas).set_name("name_word")
first_name = name_word("first")
second_name = name_word("second")

for name in (
    first_name + second_name,
    pp.Located(first_name + second_name),
    pp.Group(first_name + second_name),
    pp.Group(pp.Located(first_name + second_name)),
):
    print(name)
    # parse 0 or more names (uses [...] notation, this is equivalent to 
    # ZeroOrMore, but less typing and fewer imports)
    names = name[...]

    names.run_tests("""\
        Paul McGuire
        Bernd Wechner
        Paul McGuire Bernd Wechner
        """)
    print()

I did notice that your prefix consists of 2 tabs. By default pyparsing detabs the input string before parsing it. Call parse_with_tabs() to disable this.

I think I'll add an option to run_tests to display the input string using with_line_numbers. Note that most pyparsing exception results messages are 1-based, so with_line_numbers also marks using 1-based.

ptmcg avatar Jul 24 '24 18:07 ptmcg

No time right now explore this, but up front a big thanks for getting back on this with something to chew on and I'll take a closer look when time permits.

bernd-wechner avatar Jul 24 '24 23:07 bernd-wechner