parsimonious icon indicating copy to clipboard operation
parsimonious copied to clipboard

Support \n etc. more easily

Open erikrose opened this issue 11 years ago • 10 comments

It's awkward to express LFs, CRs, etc. in grammars, because Python tends to replace them with actual newlines, which are no-ops. It works in the grammar DSL's grammar because they're wrapped in regexes, but that shouldn't be required. Ford's original PEG grammar supports \n\r\t'"{}\ and some numerics. We should probably go that way.

erikrose avatar Aug 06 '14 18:08 erikrose

You mean just go with Ford's grammar?

But come on, you will end up reinventing it anyway. Just like it was with / precedence.

keleshev avatar Aug 17 '14 20:08 keleshev

Yep, I want to have Ford's, or at least a superset of it.

erikrose avatar Aug 18 '14 01:08 erikrose

:+1:

keleshev avatar Aug 19 '14 11:08 keleshev

Is there a workaround for parsing newlines that is better than just escaping the newline character?

JamesPHoughton avatar Feb 11 '15 00:02 JamesPHoughton

There might be some escaping dance you can do to get it into a Literal, or you can do what I do in grammar.py and stick it in a regex:

comment = ~r"#[^\r\n]*"

erikrose avatar Feb 12 '15 06:02 erikrose

What is the current recommended way to match \n?

timlyo avatar May 05 '16 10:05 timlyo

After much fooling around I was able to C-style multiline comments working with the following

        comment = ws* ~r"/\*.*?\*/"s ws*
        ws = ~r"\s*"i 

Is there an easier way?

Michael-F-Ellis avatar Oct 25 '17 19:10 Michael-F-Ellis

That looks correct and concise. You could probably make it faster by using inverted character classes. In general, non-greedy quantifiers like *? are slow because they create a lot of backtracking. Instead you could try something like this (which matches double-quoted strings with backslash escapes) for speed:

~"u?r?\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\""is

Sorry about all the backslashes. Anyway, notice how I scan quickly ahead for anything that couldn't possibly be an ending quote or a backslash, using [^\"\\\\]*, then go looking for actual special things with the (?:\\\\.[^\"\\\\]*)*. Of course, it's not nearly as readable as your spelling.

erikrose avatar Oct 27 '17 00:10 erikrose

Thanks, that's definitely worth knowing. I did some benchmarking to see how much comments are costing in processing time.

I started with an 85 measure bass part I'd recently transcribed that had multiple comments amounting to 38% of the total characters in the file. I made it into two larger benchmark files -- one with and one without comments -- by replicating the original 20 times. So that's 1700 measures of music -- more or less equivalent to a score in all parts for a small orchestral movement.

$ wc benchmark.tbn nocommentbenchmark.tbn
    1342   13132   49229 benchmark.tbn
     880    8760   30400 nocommentbenchmark.tbn

The processing time, including midi file creation, on my 2012 Mac Mini was ~6.5 seconds in either case. That's about 4 ms per measure. The processing overhead for the comments was just over 2%. I think I can live with that :-)

$ time tbon -q nocommentbenchmark.tbn
Processing nocommentbenchmark.tbn
Created nocommentbenchmark.mid

real	0m6.572s
user	0m6.405s
sys	0m0.163s

$ time tbon -q benchmark.tbn
Processing benchmark.tbn
Created benchmark.mid

real	0m6.717s
user	0m6.547s
sys	0m0.166s

Michael-F-Ellis avatar Oct 27 '17 15:10 Michael-F-Ellis

Great! Benchmarking is always the best answer. :-)

erikrose avatar Nov 04 '17 15:11 erikrose