To do this, I reached for the sensible first tool--Damian Conway's oft-celebrated Parse::RecDescent module. I'd considered lex and yacc, but that would require either doing the entire compiler in C (a language I really loathe for text processing) or write a two-pass thing with the yacc compiler emitting some intermediate tree and then write a perl program to transform it to perl code. Yacc would be faster, but since this is a one-off thing for each source module it doesn't (at least right now) seem worth it. Besides, whatever grammar I come up with for P::RD will more or less be usable for yacc, so there's not much to lose going this way to start.
Anyway, I'm not a parser guy--parsing languages involves syntax, and the long-standing joke is that I just don't do syntax. Still, you do what you need to, so I sat down with O'Reilly's "lex & yacc" book and the (quite excellent) docs for P::RD a few evenings to get the feel for grammars, and started in.
Now, P::RD, and recursive descent parsers in general, have a repuatation as being slow, or at least slower than their alternatives. I knew that going in, and my first succesful run against the largest source file I had handy (250K of source text) seemed to bear that out. I wrote a test driver program that takes the grammar filename and source filenames as parameters, does a little preprocessing (throwing out blank lines and combining lines with continuation characters), parses the file building a tree automatically using P::RD's autotree directive, and dumps that tree to disk with Data::Dumper. A first pass at parsing this file took 7 minutes 44 seconds of user time. Not good. (The dump, FWIW, takes about 4 seconds of user time) Time to optimize, and herein lies the tale.
The base grammar for this language was very simple. I went through the manual and did a straightforward translation from what was valid syntax to a P::RD grammar. Mostly constant text, and a few alternatives for some of the commands and statements. In those cases where there are alternatives, since P::RD does first-match not longest-match, I went with the:
rule: foo bar baz | foo barrather than the more terse
rule: foo bar maybebaz maybebaz: baz | #empty thingfor clarity. Sections are in the order they're in the reference manual for the language, and commands for each section are in alphabetically, as that's the way they're in the manual.
Since nearly 8 minutes of user time was far too long, it was time to optimize the grammar to see what I could see. Since not running a rule is faster than running it, I was shooting to reduce the number of failed rules that had to be tried before we hit the successful one. The assumption through this is that this monster program, the second-largest in the set, would have the closest to average frequency distribution of constructs and statements, so tweaking for it would be OK for everything else. (Something that's not been tested yet, but seems a good assumption)
First off, I changed the comment rule. It was
I changed it to
'.' /.*$/ as that was a simple thing, and
I figured that comments didn't really warrant much time. That, alas,
took the runtime up, to 7:56. Bad, definitely bad.
The next thing to try was some reordering of the section rules, to see if putting the most commonly seen section first would help. That helped, though less than you might expect, taking the runtime down to 7:55. Not much help.
Next a pass through the source to look for more reordering, at which point a bug was found--I was using the same rule name twice, for different things. I hadn't caught that since I'm not doing anything with the output yet. (The current assumption is that if it parses properly then the output is sufficiently OK to make it to the compilation part, which isn't written) This may or may not have been generating bogus output, but fixing it took the runtime down to 7:36. Better, but still horrid.
The single most common command in the source is assignment, so I moved the assignment rule to the head of the command list. That took the runtime down to 6:54. Woohoo!
In this language, there's no "assignment" statement as such, there's
really the let statement--
let foo = bar except that the
let is optional. The rule was an alternation of
let variable '='
variable '=' expr. Splitting that into
two rules, assignment and let assignment, cut the runtime down to
6:14. (I told you there's a lot of assignment going on...)
Still, that's slow. Bleah. Since this language is pretty simple, it was possible to do a quick transform of the source, whip it through sort, cut, grep, and perl, to come up with a frequency count of all the keywords in the program. (Yes, it could've all been done in perl, but pipelines took less time to throw together for this) The frequency distribution was not alphabetic, so the alpha listing of the commands was suboptimal. Moving the 10 most frequently done things (assignment was #1, follwed by if, goto, and gosub) to the head fo the command list took the runtime down to 5:17. Almost a full minute off the previous time--not bad at all.
But not good enough. I like fast, and 5 minutes is just not fast. So, taking the optimizing advice of the P::RD docs and FAQ, I turned some of my nice clear rules to less nice rules that failed less. (See those two rules at the beginning of this post? I went from the clear to terse version) A good pass on this took the runtime down to 4:40. Terse and less maintainable for a 10% speedup--works for me, especially since the language grammar's been fixed for a decade or so.
In that simplification, I didn't address the 'ifelse' rule. Like most other languages, you can do either "if expr then statement else statement" or "if expr then statement" and lose the else. The rule was:
ifelse: 'if' expr 'then' statement 'else 'statement' | 'if' expr 'then' statementchanging that to
ifelse: 'if' expr 'then' statement elseclausecut the parse time down to 2:55. We do a lot of else-less ifs, apparently. Yay, much faster.
elseclause: 'else' statement | # Or nothing, really
Can we go faster still? Yes, why yes we can. The one last big alternation rule that was in the grammar was the multistatement rule. It was
statement: command ':' statement | commandThat made sense, but changin it to:
statement: command colonstatementinstead, skipping the expensive alternation, cut runtime down to 1:23. So we ultimately went from 463 seconds to 83 seconds, cutting 82% off the time to parse with some relatively simple fixups. (And, for the record, putting the rules back in alphabetical order, sections as they appear in the manual, raises the runtime to 1:49, which is a pretty big jump)
colonstatement: ':' statement | # or nothing
What does this tell us?
Well, first, failure is expensive. I know, I know, Duh!, but it bears repeating. Failure shows up both in rules that actually fail, and rules that succeed but after one or more alternate versions fail first. Scrutinize those rules! Make them as cheap as possible, since odds are (Unless you have a really skewed frequency distribution) that there will be far more failures than successes. Order the rules so you've got the best chance of actually hitting the correct rule first, up to a point. (If you have a rule that succeeds, say, 10% of the time but is horribly expensive to run, you may find it faster to put more likely to fail, but far cheaper, rules before it) Factor out common prefixes (like the "if expr then command" part of the ifelse rule, which is valid for both the if/then and if/then/else forms)
Don't accept lousy performance without a fight! I spent two days fiddling, and cut what was potentially 4.5 hours of user time per pass over the tree down to 48 minutes of user time. Or, if you prefer wall time, 9 hours to 1.5 hours, figuring about 50% CPU utilization. That only needs the equivalent of one and a half full passes over the source tree (and, as I don't have the actual compiler part written, I expect far more than that in run time during development) to make up for the time I took to fix the grammar. (Like it or not, 8 hours of time in office does not equal 8 hours of full-on work, alas)
Not to mention the reduction in complaints from folks when a machine
they use is bogged down for days on end, as this is running on a
shared machine, and it's nice to be stingy polite with shared
resources. (Besides, the compiler phase will, no doubt, eat up quite a
lot of time itself...)
Posted by Dan at August 25, 2003 12:04 PM
| TrackBack (0)