org-parser How should the final transformed parse structure look like?

How should the final transformed parse structure look like?

Open schoettl opened this issue 3 years ago • 6 comments

Hi @branch14 @munen ,

According to the worg spec, the (transformed) parse tree should look like this:

(document
 (section)
 (headline
  (section)
  (headline)
  (headline
   (headline))))

As for as I remember, this is different from the Organice parser:

The Organice parser keeps a flat list of headlines
The Organice parser does not allow a section (aka "content") above the first headline

I suggest that stick to the orgmode spec, i.e. allowing a section above the first headline and keeping a hierarchical structure of headlines. document would then be our S symbol.

That could be implemented in the transformers in PR #27 .

It will be more work later, to implement org-parser in Organice, but we get a general orgmode parser :-)

May 13 '21 17:05 schoettl

On the other hand, a flat list of headlines and sections seem to be very pragmatic. It makes it easier to change headline level and order.

[[:section]
 [:headline]
 [:section]
 [:headline]
 …
 ]

How to keep headline and section together here? This is not pragmatic. Section should belong to the headline.

How about this?

[[:section]
 [:headline
  [:section]]
 [:headline
  [:section]]
 …
 ]

May 13 '21 17:05 schoettl

A couple of months ago, I've had at look with @munen at what organice expects as a data structure.

From what I recall and based on the discussion in #27 I want to suggest the following (at least for depth 1 and 2):

{;; "In-buffer Settings", see https://orgmode.org/manual/In_002dbuffer-Settings.html
 :settings ...
 ;; Let's call text before the first headline the preamble. As each headline introduces a
 ;; new section the content before the first headline is a section that does not belong
 ;; to any headline.
 :preamble
 {:section {:raw ...
            :ast ...}}
 ;; a flat list of headlines with their associated sections
 :headlines
 [{:headline {:level 1
              :title "hello world"
              ...}
   :section
   {:raw "this is the first section\nthis line has *bold text*\n"
    :ast [[:text [:text-normal "this is the first section"]]
          [:text
           [:text-normal "this line has "]
           [:text-styled
            [:text-sty-bold [:text-inside-sty-normal "bold text"]]]]]}}
  ...]}

May 17 '21 07:05 branch14

branch14 and I just double checked this suggestion. It looks fine to me.

As for the 'hierarchical' vs 'nested' structure of headlines: We think that having a flat list is easier to work with for the consumer. For those who need/want a nested structure, transforming from flat to nested is a simple reduce, so it shouldn't make a big difference to what org-parser actually provides.

May 17 '21 08:05 munen

branch14 and I just double checked this suggestion. It looks fine to me.

As for the 'hierarchical' vs 'nested' structure of headlines: We think that having a flat list is easier to work with for the consumer. For those who need/want a nested structure, transforming from flat to nested is a simple reduce, so it shouldn't make a big difference to what org-parser actually provides.

I agree. A flat list for headlines is fine. And having the same structure (:section) in the preamble is good.

It's possible that, for some element transformations, we should keep the transformed "sub-ast" and a "sub-raw" form. Or maybe better a pair of indexes pointing to the position in the section raw string? Anyway, it might make sense in some cases, to allow re-export without discarding whitespace.

May 17 '21 10:05 schoettl

It's possible that, for some element transformations, we should keep the transformed "sub-ast" and a "sub-raw" form. Or maybe better a pair of indexes pointing to the position in the section raw string? Anyway, it might make sense in some cases, to allow re-export without discarding whitespace.

In order to preserve whitespace we should either (a) include whitespace in the parsed text or (b) retain whitespace in the AST as we do with empty lines. (a) is how it is currently done.

Example input: *bold text* text

Example ast (a): [:text [:text-sty-bold "bold text"] [:text-normal " text"]]

Example ast (b): [:text [:text-sty-bold "bold text"] [:whitespace " "] [:text-normal "text"]]

@schoettl Do you have examples for "discarding" whitespace?

Passing raw for some elements is IMHO a convenience for consumers that cannot handle all elements, but it will be tricky to balance, as we cannot account for future use cases.

May 27 '21 07:05 branch14

If you search in EBNF for regex \bs\b – that are examples where whitespace is parsed but not stored in the AST. It's mostly leading or trailing whitespace. Leading whitespace can often be discarded and re-computed for export/rendering. Trailing whitespace can often be omitted.

list-item-line has currently problems because leading whitespace is discarded but it's important in nested lists.
block-begin-line discards whitespace between the begin marker and arguments
headline discards whitespace inbetween its components
clock lines discard whitespace inbetween its components

I've go through it and I think that only the list-item-line must be fixed. For the rest, we can discard the whitespace or just re-compute them for export. Verbatim blocks are already parsed verbatim, i.e. not discarding trailing whitespace.

Passing raw for some elements is IMHO a convenience for consumers that cannot handle all elements, but it will be tricky to balance, as we cannot account for future use cases.

Maybe the instaparse meta information about position/span can still be used in the resulting transformed structure? Then we don't need any additional raw values and still have can provide all original information.

May 27 '21 12:05 schoettl

org-parser org-parser copied to clipboard

How should the final transformed parse structure look like?

org-parser
org-parser copied to clipboard