DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH
 

/usr/man2/cat.3/pcre.3.Z





NAME

       PCRE - Perl-compatible regular expressions


INTRODUCTION


       The  PCRE  library is a set of functions that implement regular expres-
       sion pattern matching using the same syntax and semantics as Perl, with
       just  a  few  differences.  The current implementation of PCRE (release
       6.x) corresponds approximately with Perl  5.8,  including  support  for
       UTF-8 encoded strings and Unicode general category properties. However,
       this support has to be explicitly enabled; it is not the default.

       In addition to the Perl-compatible matching function,  PCRE  also  con-
       tains  an  alternative matching function that matches the same compiled
       patterns in a different way. In certain circumstances, the  alternative
       function  has  some  advantages.  For  a discussion of the two matching
       algorithms, see the pcrematching page.

       PCRE is written in C and released as a C library. A  number  of  people
       have  written  wrappers and interfaces of various kinds. In particular,
       Google Inc.  have provided a comprehensive C++  wrapper.  This  is  now
       included as part of the PCRE distribution. The pcrecpp page has details
       of this interface. Other people's contributions can  be  found  in  the
       Contrib directory at the primary FTP site, which is:

       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

       Details  of  exactly which Perl regular expression features are and are
       not supported by PCRE are given in separate documents. See the pcrepat-
       tern and pcrecompat pages.

       Some  features  of  PCRE can be included, excluded, or changed when the
       library is built. The pcre_config() function makes it  possible  for  a
       client  to  discover  which  features are available. The features them-
       selves are described in the pcrebuild page. Documentation about  build-
       ing  PCRE for various operating systems can be found in the README file
       in the source distribution.

       The library contains a number of undocumented  internal  functions  and
       data  tables  that  are  used by more than one of the exported external
       functions, but which are not intended  for  use  by  external  callers.
       Their  names  all begin with "_pcre_", which hopefully will not provoke
       any name clashes. In some environments, it is possible to control which
       external  symbols  are  exported when a shared library is built, and in
       these cases the undocumented symbols are not exported.


USER DOCUMENTATION


       The user documentation for PCRE comprises a number  of  different  sec-
       tions.  In the "man" format, each of these is a separate "man page". In
       the HTML format, each is a separate page, linked from the  index  page.
       In  the  plain text format, all the sections are concatenated, for ease
       of searching. The sections are as follows:

         pcre              this document
         pcreapi           details of PCRE's native C API
         pcrebuild         options for building PCRE
         pcrecallout       details of the callout feature
         pcrecompat        discussion of Perl compatibility
         pcrecpp           details of the C++ wrapper
         pcregrep          description of the pcregrep command
         pcrematching      discussion of the two matching algorithms
         pcrepartial       details of the partial matching facility
         pcrepattern       syntax and semantics of supported
                             regular expressions
         pcreperform       discussion of performance issues
         pcreposix         the POSIX-compatible C API
         pcreprecompile    details of saving and re-using precompiled patterns
         pcresample        discussion of the sample program
         pcretest          description of the pcretest testing command

       In  addition,  in the "man" and HTML formats, there is a short page for
       each C library function, listing its arguments and results.


LIMITATIONS


       There are some size limitations in PCRE but it is hoped that they  will
       never in practice be relevant.

       The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
       is compiled with the default internal linkage size of 2. If you want to
       process  regular  expressions  that are truly enormous, you can compile
       PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
       the  source  distribution and the pcrebuild documentation for details).
       In these cases the limit is substantially larger.  However,  the  speed
       of execution will be slower.

       All values in repeating quantifiers must be less than 65536.  The maxi-
       mum number of capturing subpatterns is 65535.

       There is no limit to the number of non-capturing subpatterns,  but  the
       maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,
       including capturing subpatterns, assertions, and other types of subpat-
       tern, is 200.

       The  maximum  length of a subject string is the largest positive number
       that an integer variable can hold. However, when using the  traditional
       matching function, PCRE uses recursion to handle subpatterns and indef-
       inite repetition.  This means that the available stack space may  limit
       the size of a subject string that can be processed by certain patterns.


UTF-8 AND UNICODE PROPERTY SUPPORT


       From release 3.3, PCRE has  had  some  support  for  character  strings
       encoded  in the UTF-8 format. For release 4.0 this was greatly extended
       to cover most common requirements, and in release 5.0  additional  sup-
       port for Unicode general category properties was added.

       In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
       support in the code, and, in addition,  you  must  call  pcre_compile()
       with  the PCRE_UTF8 option flag. When you do this, both the pattern and
       any subject strings that are matched against it are  treated  as  UTF-8
       strings instead of just strings of bytes.

       If  you compile PCRE with UTF-8 support, but do not use it at run time,
       the library will be a bit bigger, but the additional run time  overhead
       is  limited  to testing the PCRE_UTF8 flag in several places, so should
       not be very large.

       If PCRE is built with Unicode character property support (which implies
       UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
       ported.  The available properties that can be tested are limited to the
       general  category  properties such as Lu for an upper case letter or Nd
       for a decimal number. A full list is given in the pcrepattern  documen-
       tation. The PCRE library is increased in size by about 90K when Unicode
       property support is included.

       The following comments apply when PCRE is running in UTF-8 mode:

       1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
       subjects  are  checked for validity on entry to the relevant functions.
       If an invalid UTF-8 string is passed, an error return is given. In some
       situations,  you  may  already  know  that  your strings are valid, and
       therefore want to skip these checks in order to improve performance. If
       you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
       PCRE assumes that the pattern or subject  it  is  given  (respectively)
       contains  only valid UTF-8 codes. In this case, it does not diagnose an
       invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
       PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
       crash.

       2. In a pattern, the escape sequence \x{...}, where the contents of the
       braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8
       character whose code number is the given hexadecimal number, for  exam-
       ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,
       the item is not recognized.  This escape sequence can be used either as
       a literal, or within a character class.

       3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte
       UTF-8 character if the value is greater than 127.

       4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
       vidual bytes, for example: \x{100}{3}.

       5.  The dot metacharacter matches one UTF-8 character instead of a sin-
       gle byte.

       6. The escape sequence \C can be used to match a single byte  in  UTF-8
       mode,  but  its  use can lead to some strange effects. This facility is
       not available in the alternative matching function, pcre_dfa_exec().

       7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
       test  characters of any code value, but the characters that PCRE recog-
       nizes as digits, spaces, or word characters  remain  the  same  set  as
       before, all with values less than 256. This remains true even when PCRE
       includes Unicode property support, because to do otherwise  would  slow
       down  PCRE in many common cases. If you really want to test for a wider
       sense of, say, "digit", you must use Unicode  property  tests  such  as
       \p{Nd}.

       8.  Similarly,  characters that match the POSIX named character classes
       are all low-valued characters.

       9. Case-insensitive matching applies only to  characters  whose  values
       are  less than 128, unless PCRE is built with Unicode property support.
       Even when Unicode property support is available, PCRE  still  uses  its
       own  character  tables when checking the case of low-valued characters,
       so as not to degrade performance.  The Unicode property information  is
       used only for characters with higher values.


AUTHOR


       Philip Hazel
       University Computing Service,
       Cambridge CB2 3QG, England.

       Putting  an actual email address here seems to have been a spam magnet,
       so I've taken it away. If you want to email me, use my initial and sur-
       name, separated by a dot, at the domain ucs.cam.ac.uk.

Last updated: 07 March 2005
Copyright (c) 1997-2005 University of Cambridge.

                                                                       PCRE(3)

Man(1) output converted with man2html