 |
模式语法模式语法 -- 解说 Perl 兼容正则表达式的语法 说明
PCRE 库是一组用和 Perl 5
相同的语法和语义实现了正则表达式模式匹配的函数,不过有少许区别(见下面)。当前
PCRE 的实现是与 Perl 5.005 相符的。
与 Perl 的区别
这里谈到的区别是就 Perl 5.005 来说的。
默认情况下,空白字符是 C 语言库函数 isspace()
所能识别的任何字符,尽管有可能与别的字符类型表编译在一起。通常
isspace() 匹配空格,换页符,换行符,回车符,水平制表符和垂直制表符。Perl 5
不再将垂直制表符包括在空白字符中了。事实上长久以来存在于 Perl
文档中的转义序列 \v 从未被识别过,不过该字符至少到 5.002
为止都被当成空白字符的。在 5.004 和 5.005 中 \s 不匹配此字符。
PCRE 不允许在向前断言中使用重复的数量符。Perl
允许这样,但可能不是你想象中的含义。例如,(?!a){3}
并不是断言下面三个字符不是“a”,而是断言下一个字符不是“a”三次。
捕获出现在排除模式断言中的子模式虽然被计数,但并未在偏移向量中设定其条目。Perl
在匹配失败前从此种模式中设定其数字变量,但只在排触摸式断言只包含一个分支时。
尽管目标字符串中支持二进制的零字符,但不能出现在模式字符串中,因为它被当作普通的
C 字符串传递,以二进制零终止。转义序列“\\x00”可以在模式中用来表示二进制零。
不支持下列 Perl 转义序列:\l,\u,\L,\U。事实上这些是由
Perl 的字符串处理来实现的,并不是模式匹配引擎的一部分。
不支持 Perl 的 \G 断言,因为这和单个的模式匹配无关。
很明显,PCRE 不支持 (?{code}) 结构。
当部分模式重复的时候,有关 Perl 5.005_02
捕获字符串的设定有些古怪的地方。举例说,用模式
/^(a(b)?)+$/ 去匹配 "aba" 会将 $2 设为 "b",但是用模式
/^(aa(bb)?)+$/ 去匹配 "aabbaa" 会使 $2 无值。然而,如果把模式改成
/^(aa(b(b))?)+$/,则 $2(和 $3)就有值了。在
Perl 5.004 中以上两种情况下 $2 都会被赋值,在 PCRE 中也是
TRUE。如果以后 Perl 改了,PCRE 可能也会跟着改。
另一个未解决的矛盾是 Perl 5.005_02 中模式
/^(a)?(?(1)a|b)+$/ 能匹配上字符串 "a",但是 PCRE
不会。然而,在 Perl 和 PCRE 中用 /^(a)?a/
去匹配 "a" 会使 $1 没有值。
PCRE 提供了一些对 Perl 正则表达式机制的扩展:
尽管向后断言必须匹配固定长度字符串,但每个向后断言的分支可以匹配不同长度的字符串。Perl
5.005 要求所有分支的长度相同。
如果设定了
PCRE_DOLLAR_ENDONLY
而没有设定
PCRE_MULTILINE,则
$ 元字符只匹配字符串的最末尾。
如果设定了
PCRE_EXTRA,反斜线后面跟一个没有特殊含义的字母会出错。
如果设定了
PCRE_UNGREEDY,则重复的数量符的
greed 被反转,即,默认时不是 greedy,但如果后面跟上一个问号就变成 greedy 了。
正则表达式详解介绍
下面说明 PCRE 所支持的正则表达式的语法和语义。Perl
文档和很多其它书中也解说了正则表达式,有的书中有很多例子。Jeffrey
Friedl 写的“Mastering Regular Expressions”,由 O'Reilly
出版社发行(ISBN 1-56592-257-3),包含了大量细节。这里的说明只是个参考文档。
正则表达式是从左向右去匹配目标字符串的一组模式。大多数字符在模式中表示它们自身并匹配目标中相应的字符。作为一个小例子,模式
The quick brown fox 匹配了目标字符串中与其完全相同的一部分。
元字符
正则表达式的威力在于其能够在模式中包含选择和循环。它们通过使用元字符来编码在模式中,元字符不代表其自身,它们用一些特殊的方式来解析。
有两组不同的元字符:一种是模式中除了方括号内都能被识别的,还有一种是在方括号内被识别的。方括号之外的元字符有这些:
- \
有数种用途的通用转义符
- ^
断言目标的开头(或在多行模式下行的开头,即紧随一换行符之后)
- $
断言目标的结尾(或在多行模式下行的结尾,即紧随一换行符之前)
- .
匹配除了换行符外的任意一个字符(默认情况下)
- [
字符类定义开始
- ]
字符类定义结束
- |
开始一个多选一的分支
- (
子模式开始
- )
子模式结束
- ?
扩展 ( 的含义,也是 0 或 1 数量限定符,以及数量限定符最小值
- *
匹配 0 个或多个的数量限定符
- +
匹配 1 个或多个的数量限定符
- {
最少/最多数量限定开始
- }
最少/最多数量限定结束
模式中方括号内的部分称为“字符类”。字符类中可用的元字符为:
- \
通用转义字符
- ^
排除字符类,但仅当其为第一个字符时有效
- -
指出字符范围
- ]
结束字符类
以下说明了每一个元字符的用法。
反斜线(\)
反斜线字符有几种用途。首先,如果其后跟着一个非字母数字字符,则取消该字符可能具有的任何特殊含义。此种将反斜线用作转义字符的用法适用于无论是字符类之中还是之外。
例如,如果想匹配一个“*”字符,则在模式中用“\*”。这适用于无论下一个字符是否会被当作元字符来解释,因此在非字母数字字符之前加上一个“\”来指明该字符就代表其本身总是安全的。尤其是,如果要匹配一个反斜线,用“\\”。
如果模式编译时加上了
PCRE_EXTENDED
选项,模式中的空白字符(字符类中以外的)以及字符类之外的“#”到换行符之间的字符都被忽略。可以用转义的反斜线将空白字符或者“#”字符包括到模式中去。
反斜线的第二种用途提供了一种在模式中以可见方式去编码不可打印字符的方法。并没有不可打印字符出现的限制,除了代表模式结束的二进制零以外。但用文本编辑器来准备模式的时候,通常用以下的转义序列来表示那些二进制字符更容易一些:
- \a
alarm,即 BEL 字符(0x07)
- \cx
"control-x",其中 x 是任意字符
- \e
escape(0x1B)
- \f
换页符 formfeed(0x0C)
- \n
换行符 newline(0x0A)
- \r
回车符 carriage return(0x0D)
- \t
制表符 tab(0x09)
- \xhh
十六进制代码为 hh 的字符
- \ddd
八进制代码为 ddd 的字符,或 backreference
“\cx”的精确效果如下:如果“x”是小写字母,则被转换为大写字母。接着字符中的第
6 位(0x40)被反转。从而“\cz”成为
0x1A,但“\c{”成为
0x3B,而“\c;”成为 0x7B。
在“\x”之后最多再读取两个十六进制数字(其中的字母可以是大写或小写)。在 UTF-8
模式下,允许用“\x{...}”,花括号中的内容是表示十六进制数字的字符串。原来的十六进制转义序列
\xhh 如果其值大于 127 的话则匹配了一个双字节 UTF-8 字符。
在“\0”之后最多再读取两个八进制数字。以上两种情况下,如果少于两个数字,则只使用已出现的。因此序列“\0\x\07”代表两个二进制的零加一个
BEL 字符。如果是八进制数字则确保在开始的零后面再提供两个数字。
处理反斜线后面跟着一个不是 0
的数字比较复杂。在字符类之外,PCRE
以十进制数字读取该数字及其后面的数字。如果数字小于
10,或者之前表达式中捕获到至少该数字的左圆括号,则这个序列将被作为逆向引用。有关此如何运作的说明在后面,以及括号内的子模式。
在字符类之中,或者如果十进制数字大于 9
并且之前没有那么多捕获的子模式,PCRE 重新从反斜线开始读取其后的最多三个八进制数字,并以最低位的
8 个比特产生出一个单一字节。任何其后的数字都代表自身。例如:
- \040
另一种表示空格的方法
- \40
同上,如果之前捕获的子模式少于 40 个的话
- \7
总是一个逆向引用
- \11
可能是个逆向引用,或者是制表符 tab
- \011
总是表示制表符 tab
- \0113
表示制表符 tab 后面跟着一个字符“3”
- \113
表示八进制代码为 113 的字符(因为不能超过 99 个逆向引用)
- \377
表示一个所有的比特都是 1 的字节
- \81
要么是一个逆向引用,要么是一个二进制的零后面跟着两个字符“8”和“1”
注意八进制值 100 或更大的值之前不能以零打头,因为不会读取(反斜线后)超过三个八进制数字。
所有的定义了一个单一字节的序列可以用于字符类之中或之外。此外,在字符类之中,序列“\b”被解释为反斜线字符(0x08),而在字符类之外有不同含义(见下面)。
反斜线的第三个用法是指定通用字符类型:
- \d
任一十进制数字
- \D
任一非十进制数的字符
- \s
任一空白字符
- \S
任一非空白字符
- \w
任一“字”的字符
- \W
任一“非字”的字符
任何一个转义序列将完整的字符组合分割成两个分离的部分。任一给定的字符匹配一个且仅一个转义序列。
“字”的字符是指任何一个字母或数字或下划线,也就是说,任何可以是
Perl "word" 的字符。字母和数字的定义由
PCRE 字符表控制,可能会根据指定区域的匹配而改变(见上面的“区域支持”)。举例说,在
"fr" (French) 区域,某些编码大于 128 的字符用来表示重音字母,这些字符能够被
\w 所匹配。
这些字符类型序列可以出现在字符类之中和之外。每一个匹配相应类型中的一个字符。如果当前匹配点在目标字符串的结尾,以上所有匹配都失败,因为没有字符可供匹配。
反斜线的第四个用法是某些简单的断言。断言是指在一个匹配中的特定位置必须达到的条件,并不会消耗目标字符串中的任何字符。子模式中更复杂的断言的用法在下面描述。反斜线的断言有:
- \b
字分界线
- \B
非字分界线
- \A
目标的开头(独立于多行模式)
- \Z
目标的结尾或位于结尾的换行符前(独立于多行模式)
- \z
目标的结尾(独立于多行模式)
- \G
目标中的第一个匹配位置
这些断言可能不能出现在字符类中(但是注意
"\b" 有不同的含义,在字符类之中也就是反斜线字符)。
字边界是目标字符串中的一个位置,其当前字符和前一个字符不能同时匹配
\w 或者 \W(也就是其中一个匹配
\w 而另一个匹配
\W),或者是字符串的开头或结尾,假如第一个或最后一个字符匹配
\w 的话。
\A,\Z 和 \z
断言与传统的音调符和美元符(下面说明)的不同之处在于它们仅匹配目标字符串的绝对开头和结尾而不管设定了任何选项。它们不受
PCRE_NOTBOL 或
PCRE_NOTEOL
选项的影响。\Z 和
\z 的不同之处在于 \Z
匹配了作为字符串最后一个字符的换行符之前以及字符串的结尾,而
\z 仅匹配字符串的结尾。
The \G assertion is true only when the current
matching position is at the start point of the match, as specified by
the offset argument of
preg_match(). It differs from \A
when the value of offset is non-zero.
It is available since PHP 4.3.3.
\Q and \E can be used to ignore
regexp metacharacters in the pattern since PHP 4.3.3. For example:
\w+\Q.$.\E$ will match one or more word characters,
followed by literals .$. and anchored at the end of
the string.
Unicode character properties
Since PHP 4.4.0 and 5.1.0, three
additional escape sequences to match generic character types are available
when UTF-8 mode is selected. They are:
- \p{xx}
a character with the xx property - \P{xx}
a character without the xx property - \X
an extended Unicode sequence
The property names represented by xx above are limited to the Unicode
general category properties. Each character has exactly one such
property, specified by a two-letter abbreviation. For compatibility with
Perl, negation can be specified by including a circumflex between the
opening brace and the property name. For example, \p{^Lu} is the same
as \P{Lu}.
If only one letter is specified with \p or \P, it includes all the
properties that start with that letter. In this case, in the absence of
negation, the curly brackets in the escape sequence are optional; these
two examples have the same effect:
\p{L}
\pL
表格 1. Supported property codes C | Other | Cc | Control | Cf | Format | Cn | Unassigned | Co | Private use | Cs | Surrogate | L | Letter | Ll | Lower case letter | Lm | Modifier letter | Lo | Other letter | Lt | Title case letter | Lu | Upper case letter | M | Mark | Mc | Spacing mark | Me | Enclosing mark | Mn | Non-spacing mark | N | Number | Nd | Decimal number | Nl | Letter number | No | Other number | P | Punctuation | Pc | Connector punctuation | Pd | Dash punctuation | Pe | Close punctuation | Pf | Final punctuation | Pi | Initial punctuation | Po | Other punctuation | Ps | Open punctuation | S | Symbol | Sc | Currency symbol | Sk | Modifier symbol | Sm | Mathematical symbol | So | Other symbol | Z | Separator | Zl | Line separator | Zp | Paragraph separator | Zs | Space separator |
Extended properties such as "Greek" or "InMusicalSymbols" are not
supported by PCRE.
Specifying caseless matching does not affect these escape sequences.
For example, \p{Lu} always matches only upper case letters.
The \X escape matches any number of Unicode characters that form an
extended Unicode sequence. \X is equivalent to
(?>\PM\pM*).
That is, it matches a character without the "mark" property, followed
by zero or more characters with the "mark" property, and treats the
sequence as an atomic group (see below). Characters with the "mark"
property are typically accents that affect the preceding character.
Matching characters by Unicode property is not fast, because PCRE has
to search a structure that contains data for over fifteen thousand
characters. That is why the traditional escape sequences such as \d and
\w do not use Unicode properties in PCRE.
音调符(^)和美元符($)
在字符类之外,默认匹配模式下,音调符是一个仅在当前匹配点是目标字符串的开头时才为真的断言。在字符类之中,音调符的含义完全不同(见下面)。
如果涉及到几选一时音调符不需要是模式的第一个字符,但如果出现在某个分支中则应该是该选择分支的第一个字符。如果所有的选择分支都以音调符开头,这就是说,如果模式限制为只匹配目标的开头,那么这是一个紧固模式。(也有其它结构可以使模式成为紧固的。)
美元符是一个仅在当前匹配点是目标字符串的结尾或者当最后一个字符是换行符时其前面的位置时为
TRUE 的断言(默认情况下)。如果涉及到几选一时美元符不需要是模式的最后一个字符,但应该是其出现的分支中的最后一个字符。美元符在字符类之中没有特殊含义。
美元符的含义可被改变使其仅匹配字符串的结尾,只要在编译或匹配时设定了
PCRE_DOLLAR_ENDONLY
选项即可。这并不影响 \Z 断言。
如果设定了
PCRE_MULTILINE
选项则音调符和美元符的含义被改变了。此种情况下,它们分别匹配紧接着内部
"\n" 字符的之后和之前,再加上目标字符串的开头和结尾。例如模式
/^abc$/ 在多行模式下匹配了目标字符串
"def\nabc",但正常时不匹配。因此,由于所有分支都以
"^" 开头而在单行模式下成为紧固的模式在多行模式下为非紧固的。如果设定了
PCRE_MULTILINE,则
PCRE_DOLLAR_ENDONLY
选项会被忽略。
注意 \A,\Z 和 \z 序列在两种情况下都可以用来匹配目标的开头和结尾,如果模式所有的分支都以
\A 开始则其总是紧固的,不论是否设定了
PCRE_MULTILINE。
句号(.)
在字符类之外,模式中的圆点可以匹配目标中的任何一个字符,包括不可打印字符,但不匹配换行符(默认情况下)。如果设定了
PCRE_DOTALL
则圆点也会匹配换行符。处理圆点与处理音调符和美元符是完全独立的,唯一的联系就是它们都涉及到换行符。圆点在字符类之中没有特殊含义。
\C 可以用来匹配单一字节。在
UTF-8 模式下这有意义,因为句号可以匹配由多个字节组成的整个字符。
方括号([])
左方括号开始了一个字符类,右方括号结束之。单独一个右方括号不是特殊字符。如果在字符类之中需要一个右方括号,则其应该是字符类中的第一个字符(如果有音调符的话,则紧接音调符之后),或者用反斜线转义。
字符类匹配目标中的一个字符,该字符必须是字符类定义的字符集中的一个;除非字符类中的第一个字符是音调符,此情况下目标字符必须不在字符类定义的字符集中。如果在字符类中需要音调符本身,则其必须不是第一个字符,或用反斜线转义。
举例说,字符类 [aeiou] 匹配了任何一个小写元音字母,而 [^aeiou]
匹配了任何一个不是小写元音字母的字符。注意音调符只是一个通过枚举指定那些不在字符类之中的字符的符号。不是断言:仍旧会消耗掉目标字符串中的一个字符,如果当前位置在字符串结尾的话则失败。
当设定了不区分大小写的匹配时,字符类中的任何字母同时代表了其大小写形式,因此举例说,小写的
[aeiou] 同时匹配了 "A" 和 "a",小写的
[^aeiou] 不匹配 "A",但区分大小写时则会匹配。
换行符在字符类中不会特殊对待,不论
PCRE_DOTALL 或者
PCRE_MULTILINE
选项设定了什么值。形如 [^a] 的字符类总是能够和换行符相匹配的。
减号(-)字符可以在字符类中指定一个字符范围。例如,[d-m]
匹配了 d 和 m 之间的任何字符,包括两者。如果字符类中需要减号本身,则必须用反斜线转义或者放到一个不能被解释为指定范围的位置,典型的位置是字符类中的第一个或最后一个字符。
字面上的 "]" 不可能被当成字符范围的结束。形如
[W-]46] 的模式会被解释为包括两个字符的字符类("W" and "-")后面跟着字符串
"46]",因此其会匹配 "W46]" 或者 "-46]"。然而,如果将
"]" 用反斜线转义,则会被当成范围的结束来解释。因此
[W-\]46] 会被解释为一个字符类,包含有一个范围以及两个单独的字符。八进制或十六进制表示的
"]" 也可以用来表示范围的结束。
范围是以 ASCII 比较顺序来操作的。也可以用于用数字表示的字符,例如
[\000-\037]。在不区分大小写匹配中如果范围里包括了字母,则同时匹配大小写字母。例如
[W-c] 等价于 [][\^_`wxyzabc] 不区分大小写地匹配。如果使用了
"fr" 区域的字符表,[\xc8-\xcb] 匹配了大小写的重音 E 字符。
字符类型 \d,\D,\s,\S,\w 和 \W
也可以出现于字符类中,并将其所能匹配的字符添加进字符类中。例如,[\dABCDEF]
匹配了任何十六进制数字。用音调符可以很方便地制定严格的字符集,例如
[^\W_] 匹配了任何字母或数字,但不匹配下划线。
任何除了 \,-,^(位于开头)以及结束的 ]
之外的非字母数字字符在字符类中都没有特殊含义,但是将它们转义也没有坏处。
竖线(|)
竖线字符用来分隔多选一模式。例如,模式:
匹配了 "gilbert" 或者 "sullivan"
中的一个。可以有任意多个分支,也可以有空的分支(匹配空字符串)。匹配进程从左到右轮流尝试每个分支,并使用第一个成功匹配的分支。如果分支在子模式(在下面定义)中,则“成功匹配”表示同时匹配了子模式中的分支以及主模式的其它部分。
内部选项设定
PCRE_CASELESS,PCRE_MULTILINE,PCRE_DOTALL,PCRE_EXTRA 和
PCRE_EXTENDED
的设定可以在模式内部通过包含在
"(?" 和 ")" 之间的 Perl 选项字母序列来改变。选项字母为:
例如,(?im) 设定了不区分大小写,多行匹配。也可以通过在字母前加上减号来取消这些选项。例如组合的选项
(?im-sx),设定了
PCRE_CASELESS 和
PCRE_MULTILINE,并取消了
PCRE_DOTALL 和
PCRE_EXTENDED。如果一个字母在减号之前与之后都出现了,则该选项被取消设定。
如果选项改变出现于顶层(即不在子模式的括号中),则改变应用于其后的剩余模式。因此
/ab(?i)c/ 只匹配 "abc" 和
and "abC"。此行为是自 PHP 4.3.3 起绑定的 PCRE 4.0 中被修改的。在此版本之前
/ab(?i)c/ 的执行与
/abc/i 相同(例如匹配 "ABC" 和 "aBc")。
如果选项改变出现于子模式中,则效果不同。这是
Perl 5.005 的行为的一个变化。子模式中的选项改变只影响到子模式内部其后的部分,因此
(a(?i)b)c
将只匹配 "abc" 和 "aBc"(假定没有使用
PCRE_CASELESS)。这意味着选项在模式的不同部位可以造成不同的设定。在一个分支中的改变可以传递到同一个子模式中后面的分支中,例如
(a(?i)b|c)
将匹配 "ab","aB","c" 和 "C",尽管在匹配 "C"
的时候第一个分支会在选项设定之前就被丢弃。这是因为选项设定的效果是在编译时确定的,否则会造成非常怪异的行为。
PCRE 专用选项
PCRE_UNGREEDY 和
PCRE_EXTRA
可以和 Perl 兼容选项以同样的方式来改变,分别使用字母
U 和 X。(?X) 标记设定有些特殊,它必须出现于任何其它特性之前。最好放在最开头的位置。
子模式
子模式由圆括号定界,可以嵌套。将模式中的一部分标记为子模式可以:
1. 将多选一的分支局部化。例如,模式:
匹配了 "cat","cataract" 或 "caterpillar"
之一,没有圆括号的话将匹配 "cataract","erpillar" 或空字符串。
2. 将子模式设定为捕获子模式(如同以前定义的)。当整个模式匹配时,目标字符串中匹配了子模式的部分会通过
pcre_exec() 的 ovector
参数传递回调用者。左圆括号从左到右计数(从 1 开始)以取得捕获子模式的数目。
例如,如果将字符串 "the red king" 来和模式
the ((red|white) (king|queen)) |
进行匹配,捕获的子串为 "red king","red"
以及 "king",并被计为 1,2 和 3。
The fact that plain parentheses fulfil two functions is not
always helpful. There are often times when a grouping subpattern
is required without a capturing requirement. If an
opening parenthesis is followed by "?:", the subpattern does
not do any capturing, and is not counted when computing the
number of any subsequent capturing subpatterns. For example,
if the string "the white queen" is matched against the
pattern
the ((?:red|white) (king|queen))
the captured substrings are "white queen" and "queen", and
are numbered 1 and 2. The maximum number of captured substrings
is 99, and the maximum number of all subpatterns,
both capturing and non-capturing, is 200.
As a convenient shorthand, if any option settings are
required at the start of a non-capturing subpattern, the
option letters may appear between the "?" and the ":". Thus
the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative
branches are tried from left to right, and options are not
reset until the end of the subpattern is reached, an option
setting in one branch does affect subsequent branches, so
the above patterns match "SUNDAY" as well as "Saturday".
It is possible to name the subpattern with
(?P<name>pattern) since PHP 4.3.3. Array with matches will
contain the match indexed by the string alongside the match indexed by
a number, then.
Repetition
Repetition is specified by quantifiers, which can follow any
of the following items:
a single character, possibly escaped the . metacharacter a character class a back reference (see next section) a parenthesized subpattern (unless it is an assertion -
see below)
The general repetition quantifier specifies a minimum and
maximum number of permitted matches, by giving the two
numbers in curly brackets (braces), separated by a comma.
The numbers must be less than 65536, and the first must be
less than or equal to the second. For example:
z{2,4}
matches "zz", "zzz", or "zzzz". A closing brace on its own
is not a special character. If the second number is omitted,
but the comma is present, there is no upper limit; if the
second number and the comma are both omitted, the quantifier
specifies an exact number of required matches. Thus
[aeiou]{3,}
matches at least 3 successive vowels, but may match many
more, while
\d{8}
matches exactly 8 digits. An opening curly bracket that
appears in a position where a quantifier is not allowed, or
one that does not match the syntax of a quantifier, is taken
as a literal character. For example, {,6} is not a quantifier,
but a literal string of four characters.
The quantifier {0} is permitted, causing the expression to
behave as if the previous item and the quantifier were not
present.
For convenience (and historical compatibility) the three
most common quantifiers have single-character abbreviations:
表格 3. Single-character quantifiers * | equivalent to {0,} | + | equivalent to {1,} | ? | equivalent to {0,1} |
It is possible to construct infinite loops by following a
subpattern that can match no characters with a quantifier
that has no upper limit, for example:
(a?)*
Earlier versions of Perl and PCRE used to give an error at
compile time for such patterns. However, because there are
cases where this can be useful, such patterns are now
accepted, but if any repetition of the subpattern does in
fact match no characters, the loop is forcibly broken.
By default, the quantifiers are "greedy", that is, they
match as much as possible (up to the maximum number of permitted
times), without causing the rest of the pattern to
fail. The classic example of where this gives problems is in
trying to match comments in C programs. These appear between
the sequences /* and */ and within the sequence, individual
* and / characters may appear. An attempt to match C comments
by applying the pattern
/\*.*\*/
to the string
/* first command */ not comment /* second comment */
fails, because it matches the entire string due to the
greediness of the .* item.
However, if a quantifier is followed by a question mark,
then it ceases to be greedy, and instead matches the minimum
number of times possible, so the pattern
/\*.*?\*/
does the right thing with the C comments. The meaning of the
various quantifiers is not otherwise changed, just the preferred
number of matches. Do not confuse this use of
question mark with its use as a quantifier in its own right.
Because it has two uses, it can sometimes appear doubled, as
in
\d??\d
which matches one digit by preference, but can match two if
that is the only way the rest of the pattern matches.
If the PCRE_UNGREEDY option is set (an option which is not
available in Perl) then the quantifiers are not greedy by
default, but individual ones can be made greedy by following
them with a question mark. In other words, it inverts the
default behaviour.
Quantifiers followed by + are "possessive". They eat
as many characters as possible and don't return to match the rest of the
pattern. Thus .*abc matches "aabc" but
.*+abc doesn't because .*+ eats the
whole string. Possessive quantifiers can be used to speed up processing since PHP 4.3.3.
When a parenthesized subpattern is quantified with a minimum
repeat count that is greater than 1 or with a limited maximum,
more store is required for the compiled pattern, in
proportion to the size of the minimum or maximum.
If a pattern starts with .* or .{0,} and the PCRE_DOTALL
option (equivalent to Perl's /s) is set, thus allowing the .
to match newlines, then the pattern is implicitly anchored,
because whatever follows will be tried against every character
position in the subject string, so there is no point in
retrying the overall match at any position after the first.
PCRE treats such a pattern as though it were preceded by \A.
In cases where it is known that the subject string contains
no newlines, it is worth setting PCRE_DOTALL when the pattern begins with .* in order to
obtain this optimization, or
alternatively using ^ to indicate anchoring explicitly.
When a capturing subpattern is repeated, the value captured
is the substring that matched the final iteration. For example, after
(tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the captured
substring is "tweedledee". However, if there are
nested capturing subpatterns, the corresponding captured
values may have been set in previous iterations. For example,
after
/(a|(b))+/
matches "aba" the value of the second captured substring is
"b".
Back references
Outside a character class, a backslash followed by a digit
greater than 0 (and possibly further digits) is a back
reference to a capturing subpattern earlier (i.e. to its
left) in the pattern, provided there have been that many
previous capturing left parentheses.
However, if the decimal number following the backslash is
less than 10, it is always taken as a back reference, and
causes an error only if there are not that many capturing
left parentheses in the entire pattern. In other words, the
parentheses that are referenced need not be to the left of
the reference for numbers less than 10. See the section
entitled "Backslash" above for further details of the handling
of digits following a backslash.
A back reference matches whatever actually matched the capturing
subpattern in the current subject string, rather than
anything matching the subpattern itself. So the pattern
(sens|respons)e and \1ibility
matches "sense and sensibility" and "response and responsibility",
but not "sense and responsibility". If caseful
matching is in force at the time of the back reference, then
the case of letters is relevant. For example,
((?i)rah)\s+\1
matches "rah rah" and "RAH RAH", but not "RAH rah", even
though the original capturing subpattern is matched caselessly.
There may be more than one back reference to the same subpattern.
If a subpattern has not actually been used in a
particular match, then any back references to it always
fail. For example, the pattern
(a|(bc))\2
always fails if it starts to match "a" rather than "bc".
Because there may be up to 99 back references, all digits
following the backslash are taken as part of a potential
back reference number. If the pattern continues with a digit
character, then some delimiter must be used to terminate the
back reference. If the PCRE_EXTENDED option is set, this can
be whitespace. Otherwise an empty comment can be used.
A back reference that occurs inside the parentheses to which
it refers fails when the subpattern is first used, so, for
example, (a\1) never matches. However, such references can
be useful inside repeated subpatterns. For example, the pattern
(a|b\1)+
matches any number of "a"s and also "aba", "ababaa" etc. At
each iteration of the subpattern, the back reference matches
the character string corresponding to the previous iteration.
In order for this to work, the pattern must be such
that the first iteration does not need to match the back
reference. This can be done using alternation, as in the
example above, or by a quantifier with a minimum of zero.
Assertions
An assertion is a test on the characters following or
preceding the current matching point that does not actually
consume any characters. The simple assertions coded as \b,
\B, \A, \Z, \z, ^ and $ are described above. More complicated
assertions are coded as subpatterns. There are two
kinds: those that look ahead of the current position in the
subject string, and those that look behind it.
An assertion subpattern is matched in the normal way, except
that it does not cause the current matching position to be
changed. Lookahead assertions start with (?= for positive
assertions and (?! for negative assertions. For example,
\w+(?=;)
matches a word followed by a semicolon, but does not include
the semicolon in the match, and
foo(?!bar)
matches any occurrence of "foo" that is not followed by
"bar". Note that the apparently similar pattern
(?!foo)bar
does not find an occurrence of "bar" that is preceded by
something other than "foo"; it finds any occurrence of "bar"
whatsoever, because the assertion (?!foo) is always TRUE
when the next three characters are "bar". A lookbehind
assertion is needed to achieve this effect.
Lookbehind assertions start with (?<= for positive assertions
and (?<! for negative assertions. For example,
(?<!foo)bar
does find an occurrence of "bar" that is not preceded by
"foo". The contents of a lookbehind assertion are restricted
such that all the strings it matches must have a fixed
length. However, if there are several alternatives, they do
not all have to have the same fixed length. Thus
(?<=bullock|donkey)
is permitted, but
(?<!dogs?|cats?)
causes an error at compile time. Branches that match different
length strings are permitted only at the top level of
a lookbehind assertion. This is an extension compared with
Perl 5.005, which requires all branches to match the same
length of string. An assertion such as
(?<=ab(c|de))
is not permitted, because its single top-level branch can
match two different lengths, but it is acceptable if rewritten
to use two top-level branches:
(?<=abc|abde)
The implementation of lookbehind assertions is, for each
alternative, to temporarily move the current position back
by the fixed width and then try to match. If there are
insufficient characters before the current position, the
match is deemed to fail. Lookbehinds in conjunction with
once-only subpatterns can be particularly useful for matching
at the ends of strings; an example is given at the end
of the section on once-only subpatterns.
Several assertions (of any sort) may occur in succession.
For example,
(?<=\d{3})(?<!999)foo
matches "foo" preceded by three digits that are not "999".
Notice that each of the assertions is applied independently
at the same point in the subject string. First there is a
check that the previous three characters are all digits,
then there is a check that the same three characters are not
"999". This pattern does not match "foo" preceded by six
characters, the first of which are digits and the last three
of which are not "999". For example, it doesn't match
"123abcfoo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo
This time the first assertion looks at the preceding six
characters, checking that the first three are digits, and
then the second assertion checks that the preceding three
characters are not "999".
Assertions can be nested in any combination. For example,
(?<=(?<!foo)bar)baz
matches an occurrence of "baz" that is preceded by "bar"
which in turn is not preceded by "foo", while
(?<=\d{3}...(?<!999))foo
is another pattern which matches "foo" preceded by three
digits and any three characters that are not "999".
Assertion subpatterns are not capturing subpatterns, and may
not be repeated, because it makes no sense to assert the
same thing several times. If any kind of assertion contains
capturing subpatterns within it, these are counted for the
purposes of numbering the capturing subpatterns in the whole
pattern. However, substring capturing is carried out only
for positive assertions, because it does not make sense for
negative assertions.
Assertions count towards the maximum of 200 parenthesized
subpatterns.
Once-only subpatterns
With both maximizing and minimizing repetition, failure of
what follows normally causes the repeated item to be
re-evaluated to see if a different number of repeats allows the
rest of the pattern to match. Sometimes it is useful to
prevent this, either to change the nature of the match, or
to cause it fail earlier than it otherwise might, when the
author of the pattern knows there is no point in carrying
on.
Consider, for example, the pattern \d+foo when applied to
the subject line
123456bar
After matching all 6 digits and then failing to match "foo",
the normal action of the matcher is to try again with only 5
digits matching the \d+ item, and then with 4, and so on,
before ultimately failing. Once-only subpatterns provide the
means for specifying that once a portion of the pattern has
matched, it is not to be re-evaluated in this way, so the
matcher would give up immediately on failing to match "foo"
the first time. The notation is another kind of special
parenthesis, starting with (?> as in this example:
(?>\d+)bar
This kind of parenthesis "locks up" the part of the pattern
it contains once it has matched, and a failure further into
the pattern is prevented from backtracking into it.
Backtracking past it to previous items, however, works as normal.
An alternative description is that a subpattern of this type
matches the string of characters that an identical standalone
pattern would match, if anchored at the current point
in the subject string.
Once-only subpatterns are not capturing subpatterns. Simple
cases such as the above example can be thought of as a maximizing
repeat that must swallow everything it can. So,
while both \d+ and \d+? are prepared to adjust the number of
digits they match in order to make the rest of the pattern
match, (?>\d+) can only match an entire sequence of digits.
This construction can of course contain arbitrarily complicated
subpatterns, and it can be nested.
Once-only subpatterns can be used in conjunction with
look-behind assertions to specify efficient matching at the end
of the subject string. Consider a simple pattern such as
abcd$
when applied to a long string which does not match. Because
matching proceeds from left to right, PCRE will look for
each "a" in the subject and then see if what follows matches
the rest of the pattern. If the pattern is specified as
^.*abcd$
then the initial .* matches the entire string at first, but
when this fails (because there is no following "a"), it
backtracks to match all but the last character, then all but
the last two characters, and so on. Once again the search
for "a" covers the entire string, from right to left, so we
are no better off. However, if the pattern is written as
^(?>.*)(?<=abcd)
then there can be no backtracking for the .* item; it can
match only the entire string. The subsequent lookbehind
assertion does a single test on the last four characters. If
it fails, the match fails immediately. For long strings,
this approach makes a significant difference to the processing time.
When a pattern contains an unlimited repeat inside a subpattern
that can itself be repeated an unlimited number of
times, the use of a once-only subpattern is the only way to
avoid some failing matches taking a very long time indeed.
The pattern
(\D+|<\d+>)*[!?]
matches an unlimited number of substrings that either consist
of non-digits, or digits enclosed in <>, followed by
either ! or ?. When it matches, it runs quickly. However, if
it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
it takes a long time before reporting failure. This is
because the string can be divided between the two repeats in
a large number of ways, and all have to be tried. (The example
used [!?] rather than a single character at the end,
because both PCRE and Perl have an optimization that allows
for fast failure when a single character is used. They
remember the last single character that is required for a
match, and fail early if it is not present in the string.)
If the pattern is changed to
((?>\D+)|<\d+>)*[!?]
sequences of non-digits cannot be broken, and failure happens quickly.
Conditional subpatterns
It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative
subpatterns, depending on the result of an assertion, or
whether a previous capturing subpattern matched or not. The
two possible forms of conditional subpattern are
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
If the condition is satisfied, the yes-pattern is used; otherwise
the no-pattern (if present) is used. If there are
more than two alternatives in the subpattern, a compile-time
error occurs.
There are two kinds of condition. If the text between the
parentheses consists of a sequence of digits, then the
condition is satisfied if the capturing subpattern of that
number has previously matched. Consider the following pattern,
which contains non-significant white space to make it
more readable (assume the PCRE_EXTENDED option) and to
divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )
The first part matches an optional opening parenthesis, and
if that character is present, sets it as the first captured
substring. The second part matches one or more characters
that are not parentheses. The third part is a conditional
subpattern that tests whether the first set of parentheses
matched or not. If they did, that is, if subject started
with an opening parenthesis, the condition is TRUE, and so
the yes-pattern is executed and a closing parenthesis is
required. Otherwise, since no-pattern is not present, the
subpattern matches nothing. In other words, this pattern
matches a sequence of non-parentheses, optionally enclosed
in parentheses.
If the condition is the string (R), it is satisfied if
a recursive call to the pattern or subpattern has been made. At "top
level", the condition is false.
If the condition is not a sequence of digits or (R), it must be an
assertion. This may be a positive or negative lookahead or
lookbehind assertion. Consider this pattern, again containing
non-significant white space, and with the two alternatives on
the second line:
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
The condition is a positive lookahead assertion that matches
an optional sequence of non-letters followed by a letter. In
other words, it tests for the presence of at least one
letter in the subject. If a letter is found, the subject is
matched against the first alternative; otherwise it is
matched against the second. This pattern matches strings in
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits.
Comments
The sequence (?# marks the start of a comment which
continues up to the next closing parenthesis. Nested
parentheses are not permitted. The characters that make up a
comment play no part in the pattern matching at all.
If the PCRE_EXTENDED option is set, an unescaped # character
outside a character class introduces a comment that
continues up to the next newline character in the pattern.
Recursive patterns
Consider the problem of matching a string in parentheses,
allowing for unlimited nested parentheses. Without the use
of recursion, the best that can be done is to use a pattern
that matches up to some fixed depth of nesting. It is not
possible to handle an arbitrary nesting depth. Perl 5.6 has
provided an experimental facility that allows regular
expressions to recurse (among other things). The special
item (?R) is provided for the specific case of recursion.
This PCRE pattern solves the parentheses problem (assume
the PCRE_EXTENDED
option is set so that white space is
ignored):
\( ( (?>[^()]+) | (?R) )* \)
First it matches an opening parenthesis. Then it matches any
number of substrings which can either be a sequence of
non-parentheses, or a recursive match of the pattern itself
(i.e. a correctly parenthesized substring). Finally there is
a closing parenthesis.
This particular example pattern contains nested unlimited
repeats, and so the use of a once-only subpattern for matching
strings of non-parentheses is important when applying
the pattern to strings that do not match. For example, when
it is applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
it yields "no match" quickly. However, if a once-only subpattern
is not used, the match runs for a very long time
indeed because there are so many different ways the + and *
repeats can carve up the subject, and all have to be tested
before failure can be reported.
The values set for any capturing subpatterns are those from
the outermost level of the recursion at which the subpattern
value is set. If the pattern above is matched against
(ab(cd)ef)
the value for the capturing parentheses is "ef", which is
the last value taken on at the top level. If additional
parentheses are added, giving
\( ( ( (?>[^()]+) | (?R) )* ) \)
then the string they capture
is "ab(cd)ef", the contents of the top level parentheses. If
there are more than 15 capturing parentheses in a pattern,
PCRE has to obtain extra memory to store data during a
recursion, which it does by using pcre_malloc, freeing it
via pcre_free afterwards. If no memory can be obtained, it
saves data for the first 15 capturing parentheses only, as
there is no way to give an out-of-memory error from within a
recursion.
Since PHP 4.3.3, (?1), (?2) and so on can be used
for recursive subpatterns too. It is also possible to use named
subpatterns: (?P>foo).
If the syntax for a recursive subpattern reference (either by number or
by name) is used outside the parentheses to which it refers, it operates
like a subroutine in a programming language. An earlier example
pointed out that the pattern
(sens|respons)e and \1ibility
matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibility
is used, it does match "sense and responsibility" as well as the other
two strings. Such references must, however, follow the subpattern to
which they refer.
Performances
Certain items that may appear in patterns are more efficient
than others. It is more efficient to use a character class
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
In general, the simplest construction that provides the
required behaviour is usually the most efficient. Jeffrey
Friedl's book contains a lot of discussion about optimizing
regular expressions for efficient performance.
When a pattern begins with .* and the PCRE_DOTALL option is
set, the pattern is implicitly anchored by PCRE, since it
can match only at the start of a subject string. However, if
PCRE_DOTALL is not set, PCRE cannot make this optimization,
because the . metacharacter does not then match a newline,
and if the subject string contains newlines, the pattern may
match from the character immediately following one of them
instead of from the very start. For example, the pattern
(.*) second
matches the subject "first\nand second" (where \n stands for
a newline character) with the first captured substring being
"and". In order to do this, PCRE has to retry the match
starting after every newline in the subject.
If you are using such a pattern with subject strings that do
not contain newlines, the best performance is obtained by
setting PCRE_DOTALL, or starting the pattern with ^.* to
indicate explicit anchoring. That saves PCRE from having to
scan along the subject looking for a newline to restart at.
Beware of patterns that contain nested indefinite repeats.
These can take a long time to run when applied to a string
that does not match. Consider the pattern fragment
(a+)*
This can match "aaaa" in 33 different ways, and this number
increases very rapidly as the string gets longer. (The *
repeat can match 0, 1, 2, 3, or 4 times, and for each of
those cases other than 0, the + repeats can match different
numbers of times.) When the remainder of the pattern is such
that the entire match is going to fail, PCRE has in principle
to try every possible variation, and this can take an
extremely long time.
An optimization catches some of the more simple cases such
as
(a+)*b
where a literal character follows. Before embarking on the
standard matching procedure, PCRE checks that there is a "b"
later in the subject string, and if there is not, it fails
the match immediately. However, when there is no following
literal this optimization cannot be used. You can see the
difference by comparing the behaviour of
(a+)*\d
with the pattern above. The former gives a failure almost
instantly when applied to a whole line of "a" characters,
whereas the latter takes an appreciable time with strings
longer than about 20 characters.
ufotds at yahoo dot com
29-May-2006 11:25
Ok, I will try to explain the regex by davout. Most of this is explained above though, but if you are new to regex and just read all off the above, you probably deserved a practical one nicely dissected. I dissect it part by part...
%"(.+)"[\s]+((http|https|ftp|file)://[^\s]+?)(.*)%sU
% //this is a delimiter, basically saying to the parser that if it is found again, the expression stops.
" //this is a html entity encoding, because in html quotes are like metacharacters, so they are escaped this way. see here for a table http://www.cookwood.com/entities/
( //this is the beginning of the first group, will later be referenced by $1, or (if referenced within the exp) \1, this doesn't happen here
. //matches any character... (normally except newline unless you set the s modifier in the end, which happens here, so it is anything including newline)
+ //... one or more times
) //end of first group
" //same as above
[\s] //[]denotes a class, meaning that this place can hold any character from within the class, in this case it is a metacharacter signifying any whitespace character
+ //...again one or more of the last character
( //open the second group $2
(http|https|ftp|file) //the third group ($3) is an alteration, and can be any of these to fulfill a match.
:// //they just represent themselves literally
[^\s]+? //this is one or more none(^) whitespace characters, where the the quantifier is greedy this time, see the section on repetition above how it works, but basically we want the group to include the whole url, and because the next character (.*) matches anything, an ungreedy quantifier here would only match one character, because the .* would take over immeadiately.
) //close the second group
(.*) //$4 this basically consumes the rest of the string to put it back after we are done. As far as I can see, this last group is just redundant, as we don't replace this, we just put it back. Personally I would leave this away.
% //end of expression
s //make a . fit even newline
U //Reverses default to ungreedy, and +? or ?? or *? to greedy.
The groups are then used to replace the pattern and put these pieces back... So to make a long story short. you can write your link like "name" http://my.link bla bla bla. and it will be replaced by a real one.
So, that's how easy it is... Have fun
bobbykjack at yahoo dot co dot uk
05-May-2006 03:53
If anyone else out there is having a problem with the 'general repetition quantifier', please take heed of the note ", separated by a comma." (which should be rewritten ",separated ..." !) Specifically,
<?php
$pattern = '/z{2, 4}/';
?>
is NEVER what you want. Ensure there is no comma between the min and max values:
<?php
$pattern = '/z{2,4}/';
?>
spook at op.pl
05-Apr-2006 02:37
A useful note for beginners: note the difference between mathematical and PHP regular expressions. The _mathematical_ regex:
(a+b+c)*
which written in PHP syntax will look like:
[abc]*
will match any string built of a, b or c letters, but will not match string, for example:
abcd
However, the _PHP_ regular expression will match above string, because the regex means "accept all strings, which contain 0 or more occurences of letters: a, b or c".
To convert the regexp from the mathematical to PHP convention, use the ^ and $ characters, which indicate start and end of tested string. So the regexp:
^[abc]*$
means "match all strings, which, between its beginning and end, have 0 or more occurences of letters a, b or c" - which is, what we searched for.
Nasty habit, especially after two tests on "theoretical basics of computer science" :)
michael(at)webstaa(dot)com
14-Dec-2005 10:21
If you want to match any character other than reverse solidus (backslash), e.g.: to replace double single quotes in a string but not if they are preceeded by a backslash, try:
<?php
preg_replace ( "/[^\\x5c]''/", "null", $sql );
?>
Daniel Vandersluis
24-Nov-2005 02:50
Concerning note #6 in "Differences From Perl", the \G token *is* supported as the last match position anchor. This has been confirmed to work at least in preg_replace(), though I'd assume it'd work in preg_match_all(), and other functions that can make more than one match, as well.
roland dot illig at gmx dot de
08-Nov-2005 05:02
<quote>
9. Another as yet unresolved discrepancy is that in Perl 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string "a", whereas in PCRE it does not. However, in both Perl and PCRE /^(a)?a/ matched against "a" leaves $1 unset.
</quote>
The last sentence does not indicate a bug. If the string "a" should match against the regular expression /^(a)?a/, the last "a" in the regex must be matched by any literal "a" in the string. The rest of the string is "", which obviously does not match the first /^(a)/.
Toni
23-Sep-2005 03:47
While andrew is right about using a parser when a parser is called for, I have to point out the fact, that the "regex killer" example he is giving is consistent with the general GIGO *(garbage in -> agrabe out) principle:. That said, the below mentioned regex will do exactly what it's supposed to do on it. Choke.
That type of code is not recommended by W3C. Although an xhtml document will validate, it will yield warnings. It is much safer to use "<" in such a case. That would be <img class="match1" alt="<">. The lesson here is "rfm". And for advanced users "RTFM".
Andrew
06-Sep-2005 01:03
wfinn, your example falls into one of the common traps, which is trying to parse a complicated language with a simple regex. Just try feeding it input containing a tag like <img class="match1" alt="<">. It'll die a horrible death. The main lesson here is "never parse HTML with a regex". The lesson for the advanced reader is "never use a regex where a parser is called for."
davout_69 at yahoo dot com
23-Jul-2005 03:36
Found a great expression that replaces a text url - http:/.... with <a href ="http://www...">
Although it works, I still don't understand how. If someone could explain this it would great to learn what amounts to me as a foreign language :-)
Here is the expression:
'%"(.+)"[\s]+((http|https|ftp|file)://[^\s]+?)(.*)%sU'
$pcre = '%"(.+)"[\s]+((http|https|ftp|file)://[^\s]+?)(.*)%sU';
if ( preg_match( $pcre, $lines ) ){
$lines = preg_replace( $pcre, '<a href="$2" target="_top">$1</a>$4', $lines );
} else {
$pcre = '%\b((http|https|ftp|file)://[^\s]+?)(.*)%U';
$lines = preg_replace( $pcre, '<a href="$1" target="_top">$1</a>$3', $lines );
}
Ned Baldessin
16-Jul-2005 07:14
Although \w and \W do include as "word characters" locale-specific characters (like "" if you are using the "fr" locale), \b and \B do not work the same way.
For example :
"foo tait bar" => /\W(tait)\W/ => This captures correctly "tait".
"foo tait bar" => /\b(tait)\b/ => This fails to capture it.
This is confusing, because the manual talks in both cases about "word characters", but fails to mention the difference in behaviour.
wfinn at yakasha dot net
16-Jul-2005 03:22
Some simple assertions so you only match inside (or outside) an html tag:
<?php
$string1 = "<b><span class=\"match1\">match2</span><span class=\"match3\">match4</span></b>match5";
$string2 = preg_replace( "/match(?![^<]*?>)/", "replacement", $string1 ); // Matches outside the html tag
$string3 = preg_replace( "/match(?=[^<]*?>)/", "replacement", $string1 ); // Matches inside the html tag
echo "1: " . str_replace(array("<",">"), array("<",">"), $string1) . "<br>\n";
echo "2: " . str_replace(array("<",">"), array("<",">"), $string2) . "<br>\n";
echo "3: " . str_replace(array("<",">"), array("<",">"), $string3) . "<br>\n";
?>
This outputs:
1: <b><span class="match1">match2</span><span class="match3">match4</span></b>match5
2: <b><span class="match1">replacement2</span><span class="match3">replacement4</span></b>replacement5
3: <b><span class="replacement1">match2</span><span class="replacement3">match4</span></b>match5
june05 at tilo-hauke dot de
05-Jun-2005 02:24
//to match quotings in code strings by skipping included quotings if they appear use this expression:
$matchQuotings='/(((?<!\\\)")((?<=\\\)"
|[^"])*((?<!\\\)"))
|(((?<!\\\)\')((?<=\\\)\'
|[^\'])*((?<!\\\)\'))
/';
// for example: matchQuotings finds:
// "javascript:func('string-param');"
// in strings like this:
// ... <a href="javascript:func('string-param')"> ...
// and finds:
// 'javascript:func("string-param");'
// in strings like this:
// ... <a href="javascript:func('string-param')"> ...
// and finds:
// 'a "b" c'
// "d 'e' f"
// in strings like this:
// ... x 'a "b" c' y "d 'e' f" z...
xlex0x835 at rambler dot ru
16-May-2005 02:38
Attention (it may saves you lot's of time): in this doc there is no words about Unicode operations in pcre_* functions, while it is very important.
Just read missing part (from http://www.pcre.org/pcre.txt):
==================
In UTF-8 mode, characters with values greater than 128 never match \d,
\s, or \w, and always match \D, \S, and \W. This is true even when Uni-
code character property support is available.
==================
What to do:
==================
Unicode character properties
When PCRE is built with Unicode character property support, three addi-
tional escape sequences to match generic character types are available
when UTF-8 mode is selected. They are:
\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X an extended Unicode sequence
The property names represented by xx above are limited to the Unicode
general category properties. Each character has exactly one such prop-
erty, specified by a two-letter abbreviation. For compatibility with
Perl, negation can be specified by including a circumflex between the
opening brace and the property name. For example, \p{^Lu} is the same
as \P{Lu}.
If only one letter is specified with \p or \P, it includes all the
properties that start with that letter. In this case, in the absence of
negation, the curly brackets in the escape sequence are optional; these
two examples have the same effect:
\p{L}
\pL
The following property codes are supported:
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
Extended properties such as "Greek" or "InMusicalSymbols" are not sup-
ported by PCRE.
Specifying caseless matching does not affect these escape sequences.
For example, \p{Lu} always matches only upper case letters.
The \X escape matches any number of Unicode characters that form an
extended Unicode sequence. \X is equivalent to
(?>\PM\pM*)
That is, it matches a character without the "mark" property, followed
by zero or more characters with the "mark" property, and treats the
sequence as an atomic group (see below). Characters with the "mark"
property are typically accents that affect the preceding character.
Matching characters by Unicode property is not fast, because PCRE has
to search a structure that contains data for over fifteen thousand
characters. That is why the traditional escape sequences such as \d and
\w do not use Unicode properties in PCRE.
==================
onerob at gmail dot com
02-Apr-2005 08:51
If, like me, you tend to use the /U pattern modifier, then you will need to remember that using ? or * to to test for optional characters will match zero characters if it means that the rest of the pattern can continue matching, even if the optional characters exist.
For instance, if we have this string:
a___bcde
and apply this pattern:
'/a(_*).*e/U'
The whole pattern is matched but none of the _ characters are placed in the sub-pattern. The way around this (if you still wish to use /U) is to use the ? greediness inverter. eg,
'/a(_*?).*e/U'
W W W
07-Mar-2005 11:22
Back references are a great way to achieve exact matching when it would have been impossible any other way. Take these three strings.
1) "www.www.com"
2) 'www.www.com'
3) "www.www.com'
The regex /^("|').+?("|')$/ would match all three strings but what if you needed the 3rd string above to be illegal because the quotes are not the same? You could write four different regexes to check for every possible case OR you could use back references.
/^("|').+?\1$/ will match strings 1 and 2 but not string 3. Try this code for further proof:
$str_test="'www.www.com\"";
$int_count=preg_match("/^(\"|').+?\\1$/", $str_test, $matches, PREG_OFFSET_CAPTURE);
The preg_match function will not match against $str_test because the quotes are mismatched. If you change $str_test to
$str_test = "'www.www.com'";
the preg_match will work.
SG_01[-at-]sg01[-dot-]net
11-Feb-2005 01:24
Re: info at atjeff dot co dot nz
It's very simple actually, the question mark acts as a quantity minimizer. Which means it won't take up as much space as it can for one "tag". The same can be achieved by putting a U after the ending slash.
Like this:
<?PHP
function Simple_Tag_Decode($Data, $Tag) {
return preg_replace("/>$Tag<(.*)>\/$Tag</iU", "<$Tag>\\1</$Tag>", $Data);
}
?>
info at atjeff dot co dot nz
08-Feb-2005 08:46
ive never used regex expressions till now and had loads of difficulty trying to convert a [url]link here[/url] into an href for use with posting messages on a forum, heres what i manage to come up with:
$patterns = array(
"/\[link\](.*?)\[\/link\]/",
"/\[url\](.*?)\[\/url\]/",
"/\[img\](.*?)\[\/img\]/",
"/\[b\](.*?)\[\/b\]/",
"/\[u\](.*?)\[\/u\]/",
"/\[i\](.*?)\[\/i\]/"
);
$replacements = array(
"<a href=\"\\1\">\\1</a>",
"<a href=\"\\1\">\\1</a>",
"<img src=\"\\1\">",
"<b>\\1</b>",
"<u>\\1</u>",
"<i>\\1</i>"
);
$newText = preg_replace($patterns,$replacements, $text);
at first it would collect ALL the tags into one link/bold/whatever, until i added the "?" i still dont fully understand it... but it works :)
pilot doofy at gmail d0t k0m
05-Jan-2005 12:33
Keep in mind that when using limits whitespace doesn't count. I made this mistake quite often. Remember the [[:space:]] character set. In example:
$string = "Wow php is really cool";
eregi("[[:alpha:]]{18}", $string); //returns true but
eregi("[[:alpha:]]{22}", $string); // does not return true because the whitespace doesn't count for matches
J Daugherty
10-Dec-2004 01:06
In the character class meta-character documentation above, the circumflex (^) is described:
"^ negate the class, but only if the first character"
It should be a little more verbose to fully express the meaning of ^:
^ Negate the character class. If used, this must be the first character of the class (e.g. "[^012]").
29-May-2004 05:15
In addition to the meta-characters mentioned above, there can be another special character in a regular expression: the delimiter you use to start and end your expression. Often people use the / character for this.
For example, if you wanted to search for text surrounded by opening and closing tags like'<TD>SELL</TD>' and replace it with nothing (erase it), you might be tempted to use a regex like this:
<?php
$myNewText = preg_replace('/<TD>SELL</TD>/', "", $myText);
?>
This does not work properly. As mentioned in the Introduction at the top of http://www.php.net/manual/en/ref.pcre.php, if the delimiter appears in the middle of your regular expression, then you must put a \ character before it. So this DOES work:
<?php
$myNewText = preg_replace('/<TD>SELL<\/TD>/', "", $myText);
?>
That same Introduction also mentions that you can start and end your expression with characters other than the usual /. Because there are no % characters in the middle of my expression above, I might prefer to use the following:
<?php
$myNewText = preg_replace('%<TD>SELL</TD>%', "", $myText);
?>
That also works correctly, and I did not need a \ before the /.
napalm at spiderfish dot net
18-Mar-2004 12:14
Pay attention that some pcre features such as once-only or recursive patterns are not implemented in php versions prior to 5.00
Napalm
|  |