code-maat icon indicating copy to clipboard operation
code-maat copied to clipboard

Cannot work with newlines with latest parsing update

Open LogicalChaos opened this issue 10 years ago • 7 comments
trafficstars

I forked to implement a Perforce parser. I have it completed, but when I went to merge with your latest which splits the parsing into chunks, I'm unable to get it functioning again. I've found if I remove all newlines except those between change sets, I can get it parsing again. But, that involves quite a bit of data massaging to clean up the perforce log. Do you have any suggestions? If you look on my perforce branch, you can see my grammar.

LogicalChaos avatar Jan 22 '15 15:01 LogicalChaos

Sounds cool with a Perforce parser - would definitely be a good addition. First some background on my latest change: Instaparse is quite memory hungry. When I parsed the complete grammar in one pass, Code Maat run out of memory on larger logfiles. That's why I chose to split the log into smaller parts and feed those to Instaparse one by one (you'd run into the same problem with your current Perforce parser).

I see the you re-used the hiccup-based-parser. As you probably noticed, that's the one that does the chunking. In the current version, I split the log on each blank line (see function extend-when-complete). That works fine for both Git and Mercurial that don't have any blank lines within their entries. But, it won't work for Perforce that includes several blank lines in each entry. I'd suggest that you identify a different criterion that's capable of identifying the end of a Perforce entry. Then you have to parameterize the hiccup-based-parser with that criterion (end-of-log-entry?perhaps). Does that sound resonable?

adamtornhill avatar Jan 22 '15 21:01 adamtornhill

What you said makes perfect sense, but is beyond me :-) I changed the log generation to make it consistent with the the others re blank lines ... | xargs -I commitid -n1 sh -c 'p4 describe -s commitid | grep -v "^\s*$" && echo "'. If you're up for it (I'd need major help), I'd like to add churn capabilities. The output Perforce spits out adds the following to each change set described.

Differences ...
==== //depot/project/Command.cpp#9 (text) ====
add 1 chunks 10 lines
deleted 0 chunks 0 lines
changed 0 chunks 0 / 0 lines

Thoughts?

LogicalChaos avatar Jan 23 '15 19:01 LogicalChaos

Hmmm... With the new parser, I'm getting Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, which is obviously different from the previous OOM problems. This is with change logs that failed previously. I'm looking to see if I can pinpoint what in the data is causing this. I suspect it's a change set with ~26000 files associated with it from a copy/merge operation.

LogicalChaos avatar Jan 23 '15 20:01 LogicalChaos

I've seen the GC overhead limit exception as well on the earlier version of Code Maat before the memory optimization. Did you manage to get the chunking working now? That should solve this issue as well. I'll have a look at your pull request during next week - thanks for the contribution!

adamtornhill avatar Jan 24 '15 16:01 adamtornhill

Yes, I got the chunking working. The problem occurs with a change list of ~1400 with 35k lines when any individual change list goes over ~50 files. I can privately send you a problem file if you want.

LogicalChaos avatar Jan 24 '15 18:01 LogicalChaos

Yes, please do that and I'll have a look. You can contact me at adam at adamtornhill dot com

adamtornhill avatar Jan 25 '15 16:01 adamtornhill

I've sent two files, ~1MB compressed total.

LogicalChaos avatar Jan 25 '15 23:01 LogicalChaos