To do this, I reached for the sensible first tool--Damian Conway's oft-celebrated Parse::RecDescent module. I'd considered lex and yacc, but that would require either doing the entire compiler in C (a language I really loathe for text processing) or write a two-pass thing with the yacc compiler emitting some intermediate tree and then write a perl program to transform it to perl code. Yacc would be faster, but since this is a one-off thing for each source module it doesn't (at least right now) seem worth it. Besides, whatever grammar I come up with for P::RD will more or less be usable for yacc, so there's not much to lose going this way to start.
Anyway, I'm not a parser guy--parsing languages involves syntax, and the long-standing joke is that I just don't do syntax. Still, you do what you need to, so I sat down with O'Reilly's "lex & yacc" book and the (quite excellent) docs for P::RD a few evenings to get the feel for grammars, and started in.
Now, P::RD, and recursive descent parsers in general, have a repuatation as being slow, or at least slower than their alternatives. I knew that going in, and my first succesful run against the largest source file I had handy (250K of source text) seemed to bear that out. I wrote a test driver program that takes the grammar filename and source filenames as parameters, does a little preprocessing (throwing out blank lines and combining lines with continuation characters), parses the file building a tree automatically using P::RD's autotree directive, and dumps that tree to disk with Data::Dumper. A first pass at parsing this file took 7 minutes 44 seconds of user time. Not good. (The dump, FWIW, takes about 4 seconds of user time) Time to optimize, and herein lies the tale.
The base grammar for this language was very simple. I went through the manual and did a straightforward translation from what was valid syntax to a P::RD grammar. Mostly constant text, and a few alternatives for some of the commands and statements. In those cases where there are alternatives, since P::RD does first-match not longest-match, I went with the:
rule: foo bar baz |
foo bar
rather than the more terse
rule: foo bar maybebaz
maybebaz: baz | #empty thing
for clarity. Sections are in the order they're in the reference manual
for the language, and commands for each section are in alphabetically,
as that's the way they're in the manual.
Since nearly 8 minutes of user time was far too long, it was time to optimize the grammar to see what I could see. Since not running a rule is faster than running it, I was shooting to reduce the number of failed rules that had to be tried before we hit the successful one. The assumption through this is that this monster program, the second-largest in the set, would have the closest to average frequency distribution of constructs and statements, so tweaking for it would be OK for everything else. (Something that's not been tested yet, but seems a good assumption)
First off, I changed the comment rule. It was /\..*$/, so
I changed it to '.' /.*$/ as that was a simple thing, and
I figured that comments didn't really warrant much time. That, alas,
took the runtime up, to 7:56. Bad, definitely bad.
The next thing to try was some reordering of the section rules, to see if putting the most commonly seen section first would help. That helped, though less than you might expect, taking the runtime down to 7:55. Not much help.
Next a pass through the source to look for more reordering, at which point a bug was found--I was using the same rule name twice, for different things. I hadn't caught that since I'm not doing anything with the output yet. (The current assumption is that if it parses properly then the output is sufficiently OK to make it to the compilation part, which isn't written) This may or may not have been generating bogus output, but fixing it took the runtime down to 7:36. Better, but still horrid.
The single most common command in the source is assignment, so I moved the assignment rule to the head of the command list. That took the runtime down to 6:54. Woohoo!
In this language, there's no "assignment" statement as such, there's
really the let statement--let foo = bar except that the
let is optional. The rule was an alternation of let variable '='
expr and variable '=' expr. Splitting that into
two rules, assignment and let assignment, cut the runtime down to
6:14. (I told you there's a lot of assignment going on...)
Still, that's slow. Bleah. Since this language is pretty simple, it was possible to do a quick transform of the source, whip it through sort, cut, grep, and perl, to come up with a frequency count of all the keywords in the program. (Yes, it could've all been done in perl, but pipelines took less time to throw together for this) The frequency distribution was not alphabetic, so the alpha listing of the commands was suboptimal. Moving the 10 most frequently done things (assignment was #1, follwed by if, goto, and gosub) to the head fo the command list took the runtime down to 5:17. Almost a full minute off the previous time--not bad at all.
But not good enough. I like fast, and 5 minutes is just not fast. So, taking the optimizing advice of the P::RD docs and FAQ, I turned some of my nice clear rules to less nice rules that failed less. (See those two rules at the beginning of this post? I went from the clear to terse version) A good pass on this took the runtime down to 4:40. Terse and less maintainable for a 10% speedup--works for me, especially since the language grammar's been fixed for a decade or so.
In that simplification, I didn't address the 'ifelse' rule. Like most other languages, you can do either "if expr then statement else statement" or "if expr then statement" and lose the else. The rule was:
ifelse: 'if' expr 'then' statement 'else 'statement' |
'if' expr 'then' statement
changing that to
ifelse: 'if' expr 'then' statement elseclausecut the parse time down to 2:55. We do a lot of else-less ifs, apparently. Yay, much faster.elseclause: 'else' statement | # Or nothing, really
Can we go faster still? Yes, why yes we can. The one last big alternation rule that was in the grammar was the multistatement rule. It was
statement: command ':' statement |
command
That made sense, but changin it to:
statement: command colonstatementinstead, skipping the expensive alternation, cut runtime down to 1:23. So we ultimately went from 463 seconds to 83 seconds, cutting 82% off the time to parse with some relatively simple fixups. (And, for the record, putting the rules back in alphabetical order, sections as they appear in the manual, raises the runtime to 1:49, which is a pretty big jump)colonstatement: ':' statement | # or nothing
What does this tell us?
Well, first, failure is expensive. I know, I know, Duh!, but it bears repeating. Failure shows up both in rules that actually fail, and rules that succeed but after one or more alternate versions fail first. Scrutinize those rules! Make them as cheap as possible, since odds are (Unless you have a really skewed frequency distribution) that there will be far more failures than successes. Order the rules so you've got the best chance of actually hitting the correct rule first, up to a point. (If you have a rule that succeeds, say, 10% of the time but is horribly expensive to run, you may find it faster to put more likely to fail, but far cheaper, rules before it) Factor out common prefixes (like the "if expr then command" part of the ifelse rule, which is valid for both the if/then and if/then/else forms)
Don't accept lousy performance without a fight! I spent two days fiddling, and cut what was potentially 4.5 hours of user time per pass over the tree down to 48 minutes of user time. Or, if you prefer wall time, 9 hours to 1.5 hours, figuring about 50% CPU utilization. That only needs the equivalent of one and a half full passes over the source tree (and, as I don't have the actual compiler part written, I expect far more than that in run time during development) to make up for the time I took to fix the grammar. (Like it or not, 8 hours of time in office does not equal 8 hours of full-on work, alas)
Not to mention the reduction in complaints from folks when a machine
they use is bogged down for days on end, as this is running on a
shared machine, and it's nice to be stingy polite with shared
resources. (Besides, the compiler phase will, no doubt, eat up quite a
lot of time itself...)
Posted by Dan at August 25, 2003 12:04 PM
| TrackBack (0)
Wow--that's so cool, I wish I could think of a way I could make use of it right now! I'll have to file it away for a rainy day :-).
Posted by: Buzz Andersen at August 25, 2003 12:33 PMCool. Another couple things you might try are:
(1) pushing more things into regexps. The regex engine is much faster than P::RD, so e.g. alternative keywords can be turned into regex alternation instead of a P::RD one.
(2) doing as little work in actions as possible, then constructing the parse tree afterwards. Any work done in actions is usually wasted, since you'll throw it away when backtracking.
(3) taking a look at the hacked P::RD in the Parrot distribution (including the $::RD_NO_HITEM option). Quite a bit of tweaking went into improving the generated code.
Posted by: Sean O'Rourke at August 25, 2003 01:01 PMI wonder if you could write a P::RD grammar to parse a P::RD grammar and factor out alternations into common prefixes and generate new rules? It seems as if a lot of what you did is, automatable.
Assuming you got that working (to the point where it was trustworthy) you could continue to maintain the 'easy' grammar and pass it through the preprocessing stage before it got used. I'm sure people would be inordinately grateful for such a tool.
Posted by: Piers Cawley at August 25, 2003 01:06 PMInterestingly, switcing from P::RD alternation to regex alternation makes things run more slowly. Strange, but true. I presume that it has something to do with the otherwise trivial nature of my rules, such that the overhead of setting up /o(n|ff)/ swamps the speed win over P::RD's alternation of 'on' | 'off'. Go figure.
I should go see where we sped things up in Parrot, and if they're generally applicable (it's been ages, and I don't recall) harass Damian until they're in the actual distribution.
The grammar itself doesn't actually do any sort of production outside of what autotree gives--I'd much rather process the tree from outside, rather than inside, even if the tree P::RD provides isn't as nice as I'd like it to be. (All hash-based, though I may be missing a thump to it somewhere)
Posted by: Dan at August 25, 2003 01:45 PMHi Dan,
Probably a great deal of poor performance originates in the fact that your grammar is not LR(1). That is, it has rules of the following form:
stuff: foo bar shazam
foo bar baz
Parsing with such grammars has extremely high performance costs, since recursive backtracking grows exponentially as the number of look-ahead tokens grows linearly (i.e. if your grammar is LR(2), it is toughly 4 times slower than an equivalent LR(1) grammar.
Posted by: Laurent Sabarthez at August 26, 2003 02:05 AMI can see how multitoken lookahead could be a speed issu, something I'd not much contemplated before. Pity, as I really like multitoken lookahead for grammars. (But, as I've said, I'm not a parser guy. This would be one reason why... :)
Posted by: Dan at August 26, 2003 08:42 AMOkay, here's the summary from the mail I sent to Damian way back. Most of these don't make a lot of sense without looking at the P::RD code, but it's mostly a matter of delaying computation of debug and error info.
- Expectation is gone. Instead, expectation state is kept by suitably
localized variables in the Parse::RecDescent namespace.
expectation_message() looks at these, and expectation_failed() is
completely inlined.
- $expectation->{defexpected} is computed only when creating an error
message, instead of up front.
- if @itempos is not needed, $thisline is not tied. Instead the
necessary bits and pieces are kept around, and tied together when an
error message is created (see Parse::RecDescent::linenum()).
- $::RD_NO_HITEM removes support for %item. This is only a minor
win but, not completely insignificant.
Instead of using yacc to generate a C parser, you could have used byacc-perl which can output the parser directly in perl.
I looked at several methods of generating my Convert::ASN1 module and found this to be the fastest, without resorting to C
You can get it from http://www.cpan.org/src/misc/perl-byacc1.8.2.tar.gz
Posted by: Graham Barr at September 7, 2003 06:06 AMDan, you could also check out ANTLR. I tinkered with it once and seem to remember claims that it does LR(k) grammars (k > 1) pretty well.
Posted by: Melvin at October 17, 2003 12:47 AM