DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH
 

(flex.info.gz) Are certain equivalent patterns faster than others?

Info Catalog (flex.info.gz) deleteme00 (flex.info.gz) FAQ (flex.info.gz) Is backing up a big deal?
 
 Are certain equivalent patterns faster than others?
 ===================================================
 
 
      To: Adoram Rogel <adoram@orna.hybridge.com>
      Subject: Re: Flex 2.5.2 performance questions
      In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT.
      Date: Wed, 18 Sep 96 10:51:02 PDT
      From: Vern Paxson <vern>
 
      [Note, the most recent flex release is 2.5.4, which you can get from
      ftp.ee.lbl.gov.  It has bug fixes over 2.5.2 and 2.5.3.]
 
      > 1. Using the pattern
      >    ([Ff](oot)?)?[Nn](ote)?(\.)?
      >    instead of
      >    (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.)))
      >    (in a very complicated flex program) caused the program to slow from
      >    300K+/min to 100K/min (no other changes were done).
 
      These two are not equivalent.  For example, the first can match "footnote."
      but the second can only match "footnote".  This is almost certainly the
      cause in the discrepancy - the slower scanner run is matching more tokens,
      and/or having to do more backing up.
 
      > 2. Which of these two are better: [Ff]oot or (F|f)oot ?
 
      From a performance point of view, they're equivalent (modulo presumably
      minor effects such as memory cache hit rates; and the presence of trailing
      context, see below).  From a space point of view, the first is slightly
      preferable.
 
      > 3. I have a pattern that look like this:
      >    pats {p1}|{p2}|{p3}|...|{p50}     (50 patterns ORd)
      >
      >    running yet another complicated program that includes the following rule:
      >    <snext>{and}/{no4}{bb}{pats}
      >
      >    gets me to "too complicated - over 32,000 states"...
 
      I can't tell from this example whether the trailing context is variable-length
      or fixed-length (it could be the latter if {and} is fixed-length).  If it's
      variable length, which flex -p will tell you, then this reflects a basic
      performance problem, and if you can eliminate it by restructuring your
      scanner, you will see significant improvement.
 
      >    so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about
      >    10 patterns and changed the rule to be 5 rules.
      >    This did compile, but what is the rule of thumb here ?
 
      The rule is to avoid trailing context other than fixed-length, in which for
      a/b, either the 'a' pattern or the 'b' pattern have a fixed length.  Use
      of the '|' operator automatically makes the pattern variable length, so in
      this case '[Ff]oot' is preferred to '(F|f)oot'.
 
      > 4. I changed a rule that looked like this:
      >    <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN...
      >
      >    to the next 2 rules:
      >    <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;}
      >    <snext8>{and}{bb}/{ROMAN}         { BEGIN...
      >
      >    Again, I understand the using [^...] will cause a great performance loss
 
      Actually, it doesn't cause any sort of performance loss.  It's a surprising
      fact about regular expressions that they always match in linear time
      regardless of how complex they are.
 
      >    but are there any specific rules about it ?
 
      See the "Performance Considerations" section of the man page, and also
      the example in MISC/fastwc/.
 
      		Vern
 
Info Catalog (flex.info.gz) deleteme00 (flex.info.gz) FAQ (flex.info.gz) Is backing up a big deal?
automatically generated byinfo2html