Brief description of regular expressions: POSIX and PCRE
Part 1: Regular expressions
To begin with, PHP supports two standards of regular expressions: POSIX and, as from the fourth version, compatible with Perl. The first standard is used by Apache server in mod_rewrite and also by MySQL in its requests (search word "regexp" in the MySQL manual, and maybe you’ll realize it at once). The second one, as it becomes clear form the title, is used in perl system. Two these standards have no principal differences – there are special symbols in the second one which substitute mostly used symbols classes (for example, figures - \d, letters and figures -\w) and special samples parameters which make it possible to determine the register dependence of search, line ends’ reference and so on (in the functions of POSIX standard the register dependence is realized simply: there are ereg and ereg_eeplace functions, eregi (insensitive) and eregi_replace). In other respects the both standards are compatible, and the methods of samples writing are the same.
If you have worked with Norton/Windows Commander or Far, you should know such thing as wildcards. For example delete c:\windows\*.* removes all the files from the indicated directory. There aren’t any sharpenings in the names of files, so their system is simple: symbol * means any symbols’ set including the blank one (*.txt), symbol ? – any symbol or none symbol at all (document?.txt) and some symbols for letters and figures.
There is another approach in the regular expressions. Firstly, the system is universal and should find correspondences to any requests, even to the more complicated ones. Now I’ll list some terms which I intend to use further to evade extensible definitions (in the direct and metaphorical sense).
So, the purpose of the system is to let the user indicate the search of given number of definite symbols besides the sharply prescribed symbols ("John (.*) Bull"). In the given example any quantity of any symbols between the words is prescribed. If we need to find six figures, we write "[0-9]{6}" (in the example with six to eight figures "[0-9]{6,8}"). Why do we need this all? Cause unlike wildcard form the operation system such things as symbols’ set indicator and necessary quality indicator are separated here: «symbols’ set» «quantifier». Instead of symbols’ set wildcard designation can be used – point, or concrete symbols’ set may be indicated (sequences mentioned as ‘0-9’ are supported). Indication ‘except given symbols’ set’ is possible.
Symbols’ quantity indicator in the official documentation is called ‘quantifier’. This term is rather convenient and doesn’t cause any confusion. So, quantifier may have concrete meaning – either one fixed ("{6}"), or as a numerical interval and an abstract one ‘any number including 0’("*"), ‘any natural number’ – from 1 to the infinity ("+": "document[0-9]+\.txt"), ‘either 0 or 1’("?") as well. On default the quantifier for the given symbols’ set is equal to 1 ("document[0-9]\.txt").
Of course, you may join these sheafs ‘symbols’ set- quantifier’ into metastructures for more flexible combinations search.
As any flexible tool, regular expressions are flexible too, but not completely: their application zone is limited. For example, if you need to replace one fixed line in the text by another one which is fixed too, you are to use str_replace. PHP creators beg you not to use complicated functions ereg_replace or preg_replace for it because by their call the process of line interpretation happens and this consumes a lot of the system resources. Unluckily, it’s the favorite way of PHP-programmer beginners.
Use regular expressions’ functions only if you don’t know exactly what line is ‘there’. Example: search key of the site in which service digits and short word are cut form the search line, spare gaps are cut as well (or rather, all gaps are compressed: " +" is replaced with one gap). By means of these functions I check e-mail of the user who leaves his comment. You may do many useful things, but it’s important to mention: regular expressions aren’t almighty. For example, you’d better not to do a complicated replace in a big text by their means. Thus, for example, combination "(.*)" in the program plan means excess of all text symbols. In case the sample isn’t attached to the beginning or the end of a line, the sample is ‘moved’ by the program through the whole text and we get double excess, or even squared excess. It’s easy to understand another combination means cubed excess and so on. Cube 5 kilobyte of the text. You’ll get 125 000 000 000 operations. Of course, if we take a strict approach, we won’t get so much operations there, they will be 4 or 8 times less, but the order of figures itself is important.
So, the principles, advantages and disadvantages are described, we are to go over to concrete examples. Two next issues will be devoted to two standards of regular expressions - POSIX and PCRE. Description of basic principles and comprehensions or regular expressions’ work.
Part 2: POSIX
Let’s continue our conversation. The previous issue was introductory, theoretical. Today the main part of the story is POSIX standard. In the next issue I’ll describe the differences or, to put it more precisely perl-compatible standard superstructure. So, all step by step.
Symbols’ set
Point any symbol
[<symbols>] square brackets symbols’ class (‘any of’)
[^<symbols>] negative symbols’ class (‘any except’)
- dash indication of succession in the symbols’ class ("[0-9]" — numbers)
No special explanations are needed except this: don’t use symbols’ class for indication of only one (" +" may be used instead of "[ ]+"). Don’t write a dot though it’s any symbol and then any other symbols in the class will be spare (we’ll get negation of all symbols in the negative class).
Quantifier
This, as I’ve already written, is the indicator of established symbols’ quantity. Any concrete meaning and limits may be indicated by the quantifier. If the established number comes within quantifier’s limits, expression’s fragment is considered congruent with the analyzed line. Syntax: either {«quantity»}or {«minimum», «maximum»}.
If you need to indicate the necessary minimum only and there's no maximum, you are to put a comma and not to write the second number "{5,}" ("minimum5"). There are special symbols for the mostly used quantifiers
* "asterisk" or oblique cross {0,}
+ plus {1,}
? question-mark {0,1}
In fact, such symbols are used more often than braces.
Anchors
^ anchor to the head of a line
$ anchor to the end of the line
These symbols should stay in the very beginning and in the end of the line accordingly. It’s better to add a back slash to it in order to make the interpreter understand symbol $ in the end correctly ereg("foo\$", $bar).
Structure
This thing is necessary or the complicated requests. For example, you need only small letters or only capital letters or only numbers. Symbols’ class "[a-zA-Z0-9]" doesn’t suit. Then we write so:
<?php
if (ereg("[a-z]+|[A-Z]+|[0-9]+", $text))
{
...
}
?>
Vertical line is sign ‘or’ for the regular expressions (there isn’t sign ‘and’, of course- it’s the regular expression itself). Samples separated in official documentation with a vertical line are called alternative branches (this implies embranchment, it means, presence of alternative sub-branches). The program compares all branches with the line (crossing them in a row from the left to the right) up to the first coincidence (this is important to mention, if you have an expression with sub- branches). For separation of the levels and extracting of this alternative tree from the rest of a sample use simple brackets. If you need to search the same small/capital letters/numbers inside of the tag container:
<?php
if (ereg("<tag>([a-z]+|[A-Z]+|[0-9]+)</tag>", $text))
{
...
}
?>
It’s all from the complicated part. Now about simpler things. From the scientific point of view brackets are called ‘subpattern’. They are used not only for complicated samples’ variants, but for flexible substitution of text fragments or their getting into variable as well. For example, for the printed version of the text we double references’ addresses by text in the brackets:
<?php
ereg_replace("<a href=([^>]+)>[^<]+</a>", "\0 [\1]", $text);
?>
The first brackets – first sub-pattern- is got in the end through "\n" indication (as the backslash in PHP and many other languages is used for special symbols, you are to put one more slash before it in order to make the program understand it literally). Under the zero-number is the whole coincided line. In my printed version I don’t search references in the text at once, I make up a list of them in the end like this:
<?php
if (ereg("<a href=([^>]+)>([^<]+)</a>", $text, $match)) {
for ($a=0;$a<sizeof($match[0]);$a++) {
$b = $a+1;
$text = str_replace($match[0][$a], $match[0][$a]." [$b]", $text);
$match[1][$a] = "$b) ". $match[1][$a];
}
$text .= "<br><h2>References used in issue:</h2>". implode("<br>", $match[1]);
}
?>
Function ereg (and eregi), if you indicate variable in the third parameter for it, writes down all substrings as two-dimensional massive.
That’s all. You need to know how to create samples. I’ll give some examples.
Rewriting all addresses by the Apache server (as I’ve already mentioned Apache works with POSIX standard). Search on the data base: sql-request is made from the user’s search request. If we throw away creation of search statistics, we prove to need only 6-7 code lines. The highlight of words in the search results is also described there. By the way, important notice: before you cut short words from the line I exchange gaps between words for the double ones. Why? Because lines that coincide with the sample shouldn’t run over each other.
I’ll explain it in details. If there aren’t any anchors in the sample, system crosses the text from the left to the right and if a coincidence is found, throws it into some variables and then skips to the next symbol after the coincided fragment. We are searching after the sample ‘gap, two gaps, gap’ and the gaps are single. The program finds ‘gap-short word- gap’, changes it to a single gap and after that jumps over to the first letter of the next word. It isn’t a gap, so even if the next word is short too, it doesn’t suit to the sample. That’s why we are to replace all the gaps with the double ones.
How to save the news in files and not to run in cycle across the date:
<?php
$handle=opendir($newsdir);
while ($file = readdir($handle)) {
if (is_file($file) && ereg("^[0-9]{6}\.txt\$", $file))
print ("<p align=justify><b>".
ereg_replace("^([0-9]{2})([0-9]{2})([0-9]{2})\.txt\$", "\1.\2.20\3", $file).
"</b> ".implode("", file($file)). "</p>");
}
closedir($handle);
?>
Comprehension of correct e-mail spelling:
<?php
if (!eregi("^[a-z0-9\._-]+@[a-z0-9\._-]+\.[a-z]{2,4}\$", $email))
{
print("Bad email: \"$email\"");
}
?>
That’s all. In the next issue – PCRE standard and optional possibilities offered by it.
Part 3: PCRE
The series of issues about regular expressions comes to an end. Let’s talk about regular expressions which are compatible with Perl (Perl compatible regular expressions — PCRE).
Their main advantage before POSIX is the possibility for the ‘greedy search’. Question mark in PCRE is the quantifier’s minimizer at the same time: .*? Finds minimal suitable line. It seems to be nothing special? You are wrong, it’s rather special thing. What example about printed text version have I shown in the last issue?
<?php
$text = ereg_replace("<a +href=([^>]+)>[^<]+</a>", "\0 [\1]", $text);
?>
It means, there shouldn’t be any tags inside of the reference (for example "«a href=...»«b»...«/b»«/a»"). If we make so:
<?php
$text = ereg_replace("<a +href=([^>]+)>.*</a>", "\0 [\1]", $text);
?>
We’ll get… You are right, the whole text between the head of the first reference and the end of the last one. The greedy search solves all problems.
<?php
$text = preg_replace("/<as+href=(.*?)>.*?</a>/", "\0 [\1]", $text);
?>
The program will find a suitable minimal line for all references, it means only up to the tag "". There’s no sense in describing the meaning of such PCRE peculiarities – it’s enormous. Go ahead.
Now we can indicate numbers not as "[0-9]", but simply as "\d". Non-numbers ("[^0-9]") as ("[^0-9]"). It’s rather convenient. Here are the rest of indications:
\w [a-z0-9]
\W [^a-z0-9]
\s [ ]
\S [^ ]
I would recommend you to look through the issues about search – these symbols are used there.
The sample line, as you’ve already mentioned, is opened and closed with slashes. I don’t know what the purpose of the first slash is. The purpose of the last one is separation of the sample from the parameters. The parameters which I understood are following:
i register-independent search
m multi-line mode. On default PCRE searches for the coincidences with sample within one line and
the symbols "^" and "$" coincide only with the head and the end of the whole text.
When this parameter is set up “^” and “$” coincide with the head end the end of single lines.
s symbol "." (dot) coincides with transfer of the line (on default – not)
A anchor to the head of the text
E makes symbol "$" coincide only with the end of the text. It is ignored if ‘m’ parameter is set up.
U Inverts ‘greed’ for each quantifier (if the quantifier is followed by "?", it isn’t ‘greedy’ any more).
Of course, the register in parameters is of great importance. The rest about them you can look up in the PHP manual. Now about PCRE functions.
Unlike ereg function preg_match searches only the first coincidence. If you need to find all the coincidences and treat their results somehow (but not directly through t preg_replace), you are to use preg_match_all. The parameters of this function are the same.
I should mention preg_quote as a useful function which inserts slashes before all service symbols (for example, brackets, square brackets and so on), to make treat them literally. If you’ve got some input of the information by user and you check it through PCRE, you’d better to comment service symbols in the incoming variable before.
That’s all I can say about the regular expressions. All you need further is art of lines combination and algorithms writing.
I’ve described mail exploder on classes in one of the previous issues. Now I supplied it with addresses saving in files and subscription acknowledgement. Of course, different addresses’ verifications, getting list of the actives and all the rest also works on PCRE. Unfortunately, we had no time for testing and tweaking, the mail exploder is ‘rough’.



