Regular expressions, Part II
Positional assertion usage
The stuff of this article may seem useless for many of you as the most tasks with usage of regular expressions are solved with methods which I have described in the foregoing article.
First digress a little bit. How will you describe something unremarkable or difficult to describe? I think you are to describe something remarkable or easy to describe and then indicate where this ‘something unremarkable’ is situated relative to the described ‘remarkable’.
In the regular expressions’ syntax we have an opportunity to describe something which stands before and after that part of a string we are to find. It’s like having an opportunity to describe the location of a bank without indicating its precise address. If you succeed in describing places between which the bank is situated and in explaining that is a bank not a cafe, there will be no need in giving precise address.
Every time the coincidences search occurs by means of regular expressions. The search is led on a string; a substring is searched in it. Now you learn to set condition of what should coincide before the substring and after it. Coincidence condition which goes before a substring is called lookbehind assertion; that which goes after a substring is called anticipatory assertion. It’s also possible to explain that lookbehind assertion coincides from the left and anticipatory assertion coincides from the right.
Positive and negative assertions
It’s good if you can write what follows before the substring required and after it. And what if you need to describe what DOESN’T follow before or after the substring required? For this both lookbehind and anticipatory assertions are divided into two: anticipatory positive, anticipatory negative, lookbehind positive and lookbehind negative assertions.
Anticipatory positive assertion indicates that condition describes something that must compulsory stay after the substring required. Anticipatory negative assertion describes things that are not to stay after the substring required at any case. In the same way it happens with lookbehind assertion with the exception of the fact that the condition is searched before the substring required.
Lookbehind assertion
Positive lookbehind assertion or assertion from the left is described with special symbols’ succession (?<=). Brackets group the condition, question mark makes us treat the brackets’ content not as symbols’ group to memorize but as a symbol grouping for carrying out other actions with the expression within brackets. Remember that simple brackets describe symbols’ group and memorize all its contents as variables and brackets with a question mark and a colon indicate special action: grouping without memorizing. Symbol (or symbols) after the question mark are perceived as the instruction what exactly action should be done. So mark ‘more-equal’ following after the question mark which stands within round brackets immediately after opening bracket indicates that the following symbols are positive lookbehind assertion (it means condition which should obligatory go before the substring required).
Negative lookbehind assertion is described by the similar symbols’ succession (?<!). Mention that sign ‘equal’ has been replaced by the exclamatory mark and assertion changed its ‘polarity’ at once; it means assertion became negative instead of positive. The logic is simple if you remember that for checking condition ‘not equal’ (negation of ‘equal’) in many programming languages such symbols’ succession is used !=
Anticipatory assertion
If you understood the lookbehind assertion description, it would be easy for you to understand the main points of anticipatory assertion.
For description of positive anticipatory assertion such symbols’ succession is used (?=). As before brackets indicate that the symbols are grouped for some action, question mark shows that the next symbol will describe the exact action done with symbols’ group or interpretation of the given symbols’ group. Symbol ‘=’ which follows after the question mark indicates that expression within brackets in an anticipatory testing, testing of the symbols to coincide from the right of the substring required.
For the description of negative anticipatory assertion such symbols’ succession is used (?!). It the same, we’ve only replaced ‘equal’ sign with negation indicated by an exclamatory mark.
Example #1
We often face a problem with parsing of interesting for a programmer data from HTML which are not always of good quality; but it could be endurable if the insertions like this wouldn’t be done on javascript:
<TD>20.02<BR>05:30
<TD class=l>good 1<BR>good 2
<TD><B>35</B>
<TD><A href="http://reference/" id=sfsd32dfs
onclick="return m(this)">26.92</A><BR><A href="http://reference/"
id=r3_3143svsfd onclick="return m(this)">27.05</A>
<TD><B>270.5</B>
</TR>
Those numbers which are written with points are prices. The task is to gather all the prices which are placed between tags <a>... </a>;
We can see that beside prices which are set between tags there are also some of them which follow immediately after the tag <TD> and also stay between tags <B>…</B>. It’s clear that exact description of tag’s <A> attributes is not an easy task, so we are to simplify it! Every tag has closing sign ‘>’; our task is to describe that this sign follows before the price but as tags <B> and <TD> can follow before a price we don’t need these prices. How will we find out that the price stands between tags <A>…</A>? By means of tag which follows after the price (if it isn’t tag </B>, it should be either tag </A> or <BR>) and also by means of tag before the price if this tag is <TD>.
Pondering this way we came to the conclusion about what should stand from the left and from the right of the string required which is written as numbers separated with a point: \d*\.\d*
Symbols which are to coincide from the left we’ve described as ‘>’, write down (?<=>). It looks a bit strange but the coincidence from the left is written down like this (?<=) and inside of it after ?<= symbol ‘>’ follows.
Now we’ll describe what shouldn’t stand before a price (?<!<TD>). Tag <TD> shouldn’t stay before a price. This is negative retrospective testing.
By means of negative anticipatory assertion we’ll describe what shouldn’t stand from the right of a price (?!<\/B>). From the right of a price shouldn’t stay tag </B>.
The resulting regular expression which describes all the enumerated conditions looks like this:
preg_match_all("/(?<!<TD>)(?<=>)\d*\.\d*(?!<\/B>)(?=<\/A>)/", $string, $matches);
print_r($matches);
After having treated the first example it’s worth to do some remarks and explanations about usage of positional assertions.
- Written after each other assertions are used independently from each other in one point without changing it. It’s clear that coincidence will be found in case all the assertions coincide. In our example they were points before and after the price. From logical point of view about usage of assertions there isn’t any difference if testing for tag <TD> stands before assertion for the sign ‘<’. But from the optimization’s point of view the first positional assertion should be that which has the most probability of non-coincidence.
- The coincided meanings of anticipatory assertions won’t be saved. So if the anticipatory assertion which shows that tag </A> follows after the price coincides in our case, tag </A> itself which is included into the construction (?=) won’t be memorized in the special variables /1,/2 and so on. It was done because of the fact that positional assertion coincides not with a line but with a place within the line (it describes the place where coincidence happened instead of describing coincided symbols).
- It’s to mention that PCRE doesn’t allow making assertions on the coincidence of a text having optional length. For example, it’s impossible to make such assertion: /(?<=\d+).
Coincidence search mechanism is realized in anticipatory assertion in such way that by search a line with fixed length should be given to the mechanism in order to enable the mechanism to return on the fixed number of symbols in the case of non-coincidence and after it keep on searching for coincidences in other positional assertions. I guess this is difficult to understand at once, but imagine how coincidences search in the part (?)(?<=>) of the foregoing regular expression occurs. We take the string in which the search is led, count out as many symbols from its beginning as many of them will be in the positional testing coincidence; they are 4 in our variant. <, T, D, > from this place ‘looking behind’ occurs; it means all 4 foregoing symbols are checked against the coincidence with line <TD>>. If the mechanism hasn’t found any coincidence, it is to turn 4 symbols back, do the same with (? <=>) assertion (to count out one symbol), ‘look’ behind, try to find assertion of the foregoing symbol with symbol ‘>’. Imagine that coincidence condition consists of a string with unfixed length. (??) such entry should mean that before price shouldn’t stand tag <TD> including maximum one copy (or it shouldn’t stay at all). It proves out that after the mechanism has counted out 4 symbols from the beginning it checks against the coincidence with <TD>; but it’s indicated in the condition that there could be no tag at all. Then we have a question how many signs we should count back to check against coincidence other assertions. Should we count back 4 symbols or now symbols at all?
A question appears at once: why should we go ahead to ‘look behind’ afterwards? It is done in order to start assertion of the symbols following after positional assertions at once in case of all assertions’ coincidence.
Example#2
Once I had to get all images that were used on a site. What should I do? You’re right, I had to click ‘Save as’ in browser and indicate where the page should be saved. The file with the source of the page and the folder containing images appeared. But you’ll never save registered in the objects’ styles images in this folder, at least in explorer:
style="background-image:url(/editor/em/Unlink.gif);"
For carrying out described operation you are to:
- Ask the host’s owner permission for usage of the content placed on his site.
- Find all the strings in the text which are similar to the foregoing one and separate them into relative file path.
- Format a file in which images’ output will be done by means of <img scr=complete_path_to_the_image>
We do following: get source of the page into variable $content. Then we search relative paths which are registered in the styles using regular expressions. Every time I describe how I have realized the example I firstly describe carefully what I am searching for and in which context the search occurs. Having analyzed the source of a page it became clear that beside styles’ description relative paths to images are used nowhere else. From the left of relative path stands symbols’ succession url(
From the right of the relative path closing round bracket stands.
Between these symbols’ successions may be Latin letters, numbers and slashes and also a point before file’s extension.
We’ll begin with simple things. Latin alphabet’s symbols, numbers, point and slash are described by the symbol class [a-z.\/]. There can be any quantity of them; in fact they are more than 3 (file’s name, one symbol minimum, point, extension, one symbol minimum) but in the given case knowing context that isn’t critical so we indicate quantifier * [a-z.\/]*
From the left ‘url’ should go and we describe this by means of lookbehind assertion (?<=url()
But pay your attention to the fact that a bracket in regular expressions is a wildcard character of grouping, so to make it into a symbol you are to put another wildcard character –slash- before it. (?<=url\()
From the right of the relative path a closing round bracket should stay. This condition is described by means of positive anticipatory assertion (?=\))
As you can see slash stands before one of the brackets; this means it is interpreted not as a wildcard character but as a literal.
Complete PHP code which carries out all operations except for request for content usage permission follows:
preg_match_all("/(?<=url\()[a-z.\/]*(?=\))/i", $content, $matches);
foreach($matches[0] as $item) {
echo "<img src = http://www.a???s.com".$item.">";
}
When and why you are to evade usage of regular expressions
This article is written after discussions on one of the forums and meant to put programmers on their guard in order to make them feel the almighty of regular expressions when working with string data. I’m trying to tell you in which cases you are to use for a more suitable salvation of your task than usage of regular expressions.
Description of the set on the forum problem was following: there is some page created as a result of sweeping-up, elements like <input type=checkbox> are present in it; an authorized user works with this page and marks some checkboxes <input type=checkbox checked>; after sending the form to server pages are reclaimed by sweeping-up again and the programmer does so that this page is saved in cash with already marked checkboxes.
The task of a programmer is to memorize marked checkboxes as results of the sent form and by following addressings of user to this page show his choice with already marked elements.
He does following: instead of taking a sample, extracting data from the base and remarking checkboxes on their basis by sweeping-up a sample the programmer takes a page which is saved in cash and parses it by means of regular expressions; depending on the data he has chosen for the given user from the base he either removes the data from the page or adds new ‘checked’ in the necessary places in the text of the page.
His argumentation: this way it works faster than treating of the primary sample.
I’ll try to give my arguments against: is there any sense in cashing if you are to change the data from cash by means of regular expressions afterwards? We need cash to omit resource-intensive processes such as processing of all text by regular expressions several times.
Let’s imagine text containing n symbols and a regular expression with
- three choice conditions part1|part2|part3
- Each of the parts consists of symbol classes and set up literal successions.
- The maximum length of regular expression’s coincidence is equal to m symbols.
- The programmer has used preg_replace function; regular expressions’ work mechanism is familiar to us which means we can repeat its work.
I’ll show these steps together with arising from them arguments against the approach the programmer intended to use:
- We take symbol from the text with index 1 and try to use a regular expression for the text beginning with this symbol. We advance every symbol from the initial one and compare it with the sample (if it suits).
- Following each choice instruction in the regular expression we can maximally do three choices and advance m following symbols for EACH i-symbol in the text.
- It’s easy to calculate which symbols’ quantity will be advanced in vain if no coincidence is found.
- The cashed document is already the result of regular expression’s work (many sweeping-ups work on the base of regular expressions). In such a way regular expressions’ mechanism is optionally called to parse raw half-finished document instead of making it ready at one passage.
- If we compare regular expressions which are basis of sweeping-up and the regular expression our programmer is going to get, we’ll see at once that the first ones are simpler! Because of this reason coincidences will be found faster.
- Often regular expressions’ work in sweeping-up will treat small parts of text (it isn’t a secret that pages may be composed from small pieces), not the generated page at once.
I hope I succeed in explaining why regular expressions which are powerful tool for work with strings don’t suit for salvation of global tasks like this. You are to learn setting tasks for yourself and solving them with different methods so that you’ve got a choice for the better variant of solution. Regular expressions are only one of many ways, not the panacea.
The case is described when a man knew how to use regular expressions and sweeping-up as well which meant he had a choice but he simply couldn’t choose any of the variants. There is a category of programmer beginners which see a plenty of signs using which many problems with strings are solved and even don’t have an idea that their task for replacing a word in a string is done by means of the simple function str_replace(). Without understanding of regular expressions’ work they come on forum and ask how to replace abcd with asdf by means of preg_replace and feel very hurt when they are given reference to str_replace in the manual on http://www.php.net/. It seems to be the mostly spread way of regular expressions’ misuse in case they may be replaced with strings treating functions.



