parsley icon indicating copy to clipboard operation
parsley copied to clipboard

Request: Parsed structures note their locations in the input text

Open davidsantiago opened this issue 12 years ago • 7 comments

As far as I can tell, and I'm embarrassed to admit that I'm quite over my head in your amazing sci-fi incremental parser code, there's no way you can make parsley generate the parse tree so that it notes the location in the text of each production. For example, it would be nice to be able to have each node point to the location of the first and last char that got parsed into that production. Or even just the first character's location. I don't think it's possible to do this with make-node, as that information does not get passed into the arguments.

Is this something that could be done to the parser? I'm not sure if something about its incremental parsing would prevent this.

davidsantiago avatar Jun 01 '12 18:06 davidsantiago

Hi,

Because of incrementalism a node can't know it's start/end offset in the character stream -- if it could, an insertion would require to update this piece of data in all nodes to the right of the insertion. However there's an alternative design: each node store the relative end offsets of its children. Thus given a global offset you can start from the root and recurse your way up (at each step you find the children which contains the offset, subtract its start offset to get the remaining offset relative to it and recurse).

Assuming your lreaves are just strings, something like that should work.

(defn length [node]
  (if (string? node)
    (count node)
    (peek (:offsets node))))

(defn make-node [tag content]
  {:tag tag :content content 
   :offsets (vec (reductions + (map length content)))})

(untested code)

Does it help?

cgrand avatar Jun 01 '12 19:06 cgrand

Yeah, I was kinda thinking it could do that update on everything to the right when it needed to, but I think you are also aiming to keep a logarithmic runtime on the incremental parsing, so that would ruin it. I had not thought of the scheme you are suggesting, because I thought that the content nodes would possibly throw away information during the parse, especially with the - suffix. If the parse tree can always be used to recreate the input, then something like this would definitely suit me. I'll try it out and see if it works.

davidsantiago avatar Jun 01 '12 20:06 davidsantiago

Parsley keeps all characters. The - suffix only instructs the parser to not create a node and to inline its children in its parent node -- so only inner nodes are elided, but not their children which are reparented.

cgrand avatar Jun 01 '12 20:06 cgrand

I see. Sorry for the false issue, I only took a real look at parsley this morning, and it's different from what I'm used to in a number of ways, so I haven't digested it all yet. I do think it might be nice to have that offsets scheme you mentioned built in, or more readily available to users at least, so I'll leave the issue open if you want to talk about doc updates or code tweaks related to that, or you can go ahead and close it if not.

davidsantiago avatar Jun 01 '12 20:06 davidsantiago

Well, offsets are a special case of what I call "views" (aggregated properties on nodes/leaves) which I'm at last working on at the moment (because of sjacket) -- I helped @laurentpetit to introduce them in an ad hoc manner in paredit.clj but now they are going to be an existing facility of parsley.

cgrand avatar Jun 02 '12 09:06 cgrand

Oh, excellent. I was thinking it'd be a shame if something like offsets was something that people had to reinvent/discover on their own every time someone needed it, but it sounds like you're working on a more general facility.

davidsantiago avatar Jun 02 '12 09:06 davidsantiago

finding a node by its offset: https://github.com/cgrand/parsley/blob/master/test/net/cgrand/parsley/test.clj#L91 Rignt now only functional trees support views but regular trees will support them too.

cgrand avatar Jun 04 '12 11:06 cgrand