Regular expressions, Part I
Chapter 1. The basics of regular expressions.
Introduction.
Every web-programmer has once faced a problem when he needed to search some data according to some law, verify the data which came from user, modify found data in a complicated way. You may either invent something of your own or use the same resources as the programmers all over the world do. Sometimes professionals seem to use some methods and tools which are available only for them. I’ll disappoint the reader with the fact that the professionals use the same methods and instruments as you do; the difference is that they know how to use them and how to choose the tool which is necessary in the given case.
This article is meant to help programmers to solve their daily problems by means of regular expressions. I’ll try to describe the basics of this tool’s usage in order you don’t treat combination like /^(?:http:\/\/)?[-0-9a- z._]*.\w{2,4}[:0-9]*$/ as something unknown.
The common purpose of the regular expressions mechanism is to find or not to find the coincidences of a line or its part with the sample. Let’s analyze the first sentence of this paragraph in order to find confusing and scaring words:
‘The regular expressions’ mechanism’ and ‘sample’ – these two words made me feel depressed as I understood that I couldn’t do without using regular expressions. We have some mechanism which either searches and founds or searches and doesn’t found, such ideas as ‘line’ and ‘sample’ are closely connected with this ‘something’. We’ll start and finish with them because after we’ll have realized how this mechanism acts with lines and samples we won’t have to look up in the maths manuals and search the meaning of the words ‘regular expressions’.
Part 1.
Where have we seen a sample? Let’s go to a secretary and ask her. The right answer is – in the Microsoft Word samples! What is the difference between the samples ‘Calendar’ and ‘Sophisticated resume’? The differences are the data themselves and the way of their presentation. If you’ve seen both of them only once, you would recognize them easily. Then why the programmers are so afraid of the regular expressions? It’s nearly the same! What are the thoughts of a person which sees a calendar and knows what it is, how des he recognize it? Calendar is a document which is divided into blocks, each block consists of numbers which correspond with the days of a month. Each month is corresponded with only one block, there are no more than 31 days in a month, and no more than 28 days in February (excluding a leap-year), days which correspond with Sundays or state holidays are marked with red, we can go on systemizing of the data, having indicated months which consist of 30 and 31 days exactly. What have we done? We’ve created the calendar description or, in other words, we described data finding which in an optional text we can definitely say we deal with a calendar. Such description I call a sample when speaking about regular expressions.
The days comes when the programmer will tell to the computer which program he wants to create and the computer will do it, but by the time the programmer has to work over it himself. It means, if I say I’m searching for a text fragment which complies with a calendar description, computer won’t understand me nowadays. So our purpose today is to learn how to describe data which we want to find and make the form of this description clear to computer. You can’t do it? You should be ashamed! Every secretary can it. I hope, there’s no need in explanations of command line to a programmer. Start=>Run=>cmd. Write ‘dir’ in the command line.
That’s what I got:
C:\Documents and Settings\Administrator>dir
Volume in drive C has no label.
Volume Serial Number is 3CC6-6445
Directory of C:\Documents and Settings\Administrator
13.10.2003 18:03 <DIR> .
13.10.2003 18:03 <DIR> ..
18.07.2003 21:55 <DIR> .java
18.07.2003 21:54 <DIR> .javaws
18.07.2003 21:55 <DIR> .jpi_cache
15.10.2003 16:33 694 .plugin141.trace
05.10.2003 11:40 <DIR> Desktop
16.10.2003 13:08 <DIR> Favorites
08.10.2003 16:42 <DIR> My Documents
18.08.2003 20:51 <DIR> Start Menu
04.07.2003 21:24 <DIR> WINDOWS
1 File(s) 694 bytes
10 Dir(s) 2 162 040 832 bytes free
Is it familiar to you? Sure! So you can already use regular expressions, you are left only to improve your skills. What will you do if you’ve got a lot of files and directories and you have to check only some of them which interest you at the moment? You’ll try to reduce the output data quantity, having indicated the conditions of the search and having described the data you’d like to get. Pay your attention to the fact you need to describe the data and so you face creation of a sample. Suppose, we show an interest for all files and directories which have title containing word ‘java’. I’m sure, your thoughts are the same as mine, so you’ll get such result:
C:\Documents and Settings\Administrator>dir *java*
Volume in drive C has no label.
Volume Serial Number is 3CC6-6445
Directory of C:\Documents and Settings\Administrator
18.07.2003 21:55 <DIR> .java
18.07.2003 21:54 <DIR> .javaws
0 File(s) 0 bytes
2 Dir(s) 2 161 618 944 bytes free
Let’s try to analyze dir *java* line: it means to find and show all files and directories in the current directory containing word ‘java’.
It seems to be all right, but it isn’t so! Understanding regular expressions means, first of all, correct description of the coincidence (or convergence) law and knowledge of tools by means of which you can ‘explain’ this description to the computer. The most articles in the net are busy with solving second problem and ignore the first one completely; they tell the programmer about tools using which they can ‘explain’ their descriptions of necessary data to the computer. Why I said the description is wrong? Because the difference between dir java* and dir *java* is concluded in one symbol, but the word ‘java’ is present in the both titles.
Make another effort to describe line dir *java*
We are to find and show all files and directories in the current directory (up to this moment everything’s all right); the title of these files and directories may begin with any symbol, there can be any number of these symbols (or no symbols at all), but after them symbols "j", "a", "v", "a" follow in a row. These symbols may be followed with any other symbols, with any quantity of them (there can be no symbols at all).
Is it different? Sure! Let’s continue our learning how to describe the data which we want to find. Search with dir command, work form the command line. But we complicate the task. I’ll give you the description of the data which I want to find and you’ll try to do it; all the tools are already familiar to you.
Find and show all files and directories in the current directory which title begins with any symbol, any quantity of such symbols is possible (there can be no symbols at all), but they are obligatory followed with symbol "j" after which any quantity of any symbols follows again but the symbols "w"and "s" are to stay in a row at the end of the title.
If you’ve read attentively, it wouldn’t take you much effort to write dir *j*ws and you’ll get an answer:
C:\Documents and Settings\Administrator>dir *j*ws
Volume in drive C has no label.
Volume Serial Number is 3CC6-6445
Directory of C:\Documents and Settings\Administrator
18.07.2003 21:54 <DIR> .javaws
0 File(s) 0 bytes
1 Dir(s) 2 161 504 256 bytes free
Is it simple? Of course! Now imagine that I want to find not any but definite quantity of any symbols. Use symbol "?" (question mark) for this. Here’s the sum:
Find and show all files and directories in the current directory which title begins with any symbol, there can be any quantity of such symbols but they are obligatory followed by symbol "j" after which any four symbols go and after them symbols "w" and "s" follow in a row. Its solution is dir *j??ws
C:\Documents and Settings\Administrator>dir *j?ws
Volume in drive C has no label.
Volume Serial Number is 3CC6-6445
Directory of C:\Documents and Settings\Administrator
18.07.2003 21:54 <DIR> .javaws
0 File(s) 0 bytes
1 Dir(s) 2 161 504 256 bytes free
The first part is over. Brief summary:
Data for the search are sent in our case to the input of command dir in some uniform view which I called data description or sample; you may manage command dir output mechanism by changing sample; command dir puts the data out only if they comply with the determined sample; sample includes ordinary letters and special symbols as well.
Part 2.
The sample includes ordinary letters and special symbols as well. It was the last line in the previous part, and it is the first in this one. You haven’t notice it yourself that your thinking is abstract when you describe the data you want to find. Firstly you are encouraged to it, but soon you’ll learn to do it yourself and you won’t need my help any more. Ordinary symbols and special symbols prove out to have names. Every single symbol is called literal; every special symbol is called metacharacter. We don’t pay our attention to the literals; they don’t interest us at the moment. We’d better to examine metacharacters. We know only two of them by the time, but it’s enough.
Look what symbol "*" and symbol "?" mean. The first one indicates any quantity of any symbols; the second one indicates only one symbol. If you understand everything about quantity, than what ‘any symbol’ means? You ask, we answer! ‘Any symbol’ means in the context of our work with command dir from the command line any of the literals. It means, the asterisk signifies ‘any quantity of literals’, ‘?’ signifies only one literal. Understood? It seems to be. It’s understood but the definition isn’t exact enough. Why is it so? Cause when I described data for the search with dir command I wrote it like this: on any symbol, there can be any quantity of such symbols (there can be no symbols at all).
The asterisk "*" signifies any quantity of literals and their absence as well! What does it mean? It means there can be any quantity of literals but we don’t need any of them for the coincidence by search! You are to understand and remember it.
Question mark "?" means that there can be only one literal! So only one literal can stand on the given position, no more and no less!
Let’s treat more detailed the idea of literal but from a slightly different side. Which of these two symbols is a literal and which is a metacharacter? "j" "*"
It’s elementary, "j" is a literal and "*" is a metacharacter.
You’ve got quite a logical question: what is the difference?
There is a logical answer for this logical question. Symbols "*" and "?" are known to have some magic effect, at least command dir interprets them in other way unlike symbols "j", "a", "v", "a". So we have divided all symbols into two classes; one class form literals, another one – metacharacters. We know exactly that class of literals includes all Latin letters a, b, c, d and so on up to z, it also contains letters 0,1,2 and up to 9. In such way we’ve divided class of literals into subclasses. Don’t you think we know too much to keep dealing with the command line? It’s time to prove your knowledge.
Brief summary of the second part:
- ordinary symbols are called ‘literals’;
- special symbols are called ‘metacharacters’;
- literals signify themselves;
- metacharacters are meant to describe literals’ range, some applicable to literals rules, literals’ qualities and their quantity;
- all the literals may be classified by gathering them together on some basis.
Part 3.
Eventually we’ve finished our narration about the command line and got over to the purpose of this article: to the usage of regular expressions in web-programming. Let’s stop and think if you’ve understood properly the previous two parts. If not, read them once more, carry out all the examples, do experiments! I don’t recommend you to read the third part without understanding previous parts; it’s just a waste of time!
From all the PHP language we need only one function preg_match()
Its general format is following:
preg_match(‘search_pattern’, ‘line_of_the_search’, ‘massive_with_search_results’)
This function realizes appeal to the mechanism of regular expressions treatment, search for the coincidences in a line and return of the coincidences to the massive. What do appeal to the mechanism, search and return mean? We used to operate with slightly different notions. Let’s return to the command dir. It includes some mechanism which reads your sample and searches files in the current directory according to it, doesn’t it? To simplify we’ll also call this mechanism the mechanism of regular expressions treatment. Imagine that the files’ and directories’ titles are written in the line, the search pattern is written in another line and the results are shown not on the black screen, but in the massive; and for the search of information about files and directories we use function PHP preg_match() instead of command dir. Everything’s on the right places. Function preg_match() transfers search pattern and the line in which we’re going to search to the coincidence search mechanism (mechanism of regular expressions treatment) and provides the search results’ output in the massive. It’s command dir itself! It’s clear now why I firstly said you didn’t need to learn usage of regular expressions; you already knew how to do it, you simply didn’t realize the fact. Now you realize it. It could be the end if the search opportunities in PHP haven’t had any differences from the possibilities of command dir. But it isn’t so and we have to keep learning and read chapter 2.
Chapter 2. The basis.
Another introduction
Regular expressions is the samples’ language (in fact, it’s mathematical term; if you are interested in, read about determined and non-determined terminal automata). To do some action, you are to indicate which action exactly you intend to do; usually the action is indicated by a function:
In PHP it’ll be like that:
- preg_replace – to replace
- preg_match – to match
- preg_split – to split
It’s the same with functions Posix standard of ereg type, the only difference is that unlike Preg they’ve got different mechanism of coincidence search treatment.
Treatment of a sample occurs symbol-by-symbol, so how will you search letter‘d’ in word ‘stadium’? Right, the first solution is to advance all the letters of the word ‘stadium’ and compare them with the letter we’re searching for, so we get simple excess. So you’ve learnt to use regular expressions and you’re left only to learn how to set a sample of what you want to find in a line.
Sample is a specific indicator of what we are to find in a line.
You can search numbers, letters, invisible symbols (like space, tab).
How will you search letter‘d’ or letter ‘m’ in the word ‘stadium’ until the first coincidence of one of the symbols happens? Right, also by means of excess; you’ll take each letter of the word and compare it with what you are to find, with‘d’ and ‘m’ in turns. But now you are to compare each time a letter of the word (the line) with two letters in the conditions of the search. In such a way you’ve created your first symbol class which you can write down using the regular expressions language like this: [dm]- this means you’re searching either for‘d’ or for ‘m’.
What do you need to indicate you’re searching in the line for the thing that can be any letter of the alphabet? You’re either to enumerate them all [abc....xyz] or simply to write an interval [a-z].
Attention! This is possible with small letters [a-z] and with figures [0-9], but we also have capital letters [A-Z], so to get a symbol class with all the letters of Latin alphabet you are to write down in the sample [a-zA-Z].
But such symbol class describes only one symbol and we’ve got a lot of them; we can manage it by means of quantifiers.
Quantifiers
So you can describe the condition of search by means of symbol classes.
One symbol class can coincide with one symbol only! You are to understand it!
How is it possible to set search of two symbols in the condition?
I’ll give a simple example without symbol classes which has already been treated before:
We’re searching for succession ‘iu’ in the word ‘stadium’. How will you do this? Of course by excess:
1. We take the first symbol of the word and compare it with the first symbol in condition‘s’ not equal to ‘u’
2. We take the second symbol of the word and compare it with the first symbol in condition‘t’ not equal to ‘i’
3. We do it until symbol ‘x’ in the word coincides with the first symbol in the search condition. In this moment ‘i’ is equal to ‘i’ (we are in the detailed position in the word ‘stadium’ and in the detailed position in the search condition ‘iu’), as soon as this condition is carried out, you are to take the next symbol from the word and from the condition (before this you are to memorize where the first coincidence happened; this step is very important!)
4. We compare the next symbol of the word with the second symbol in the condition ‘u’ is equal to ‘u’. My congratulations, the initial combination is found!
5. What should you do if condition 4 isn’t accomplished, if you’ve got a mistaken word ‘stadiem’ instead of word ‘stadium’?
You are to return to step 3 and remember which step you’ve done. You memorized where you’ve found the first coincidence. Coincidence of the following symbol in the word with the following symbol of the condition doesn’t happen, so you are to take the next symbol in the word (it will be ‘stadiem’ in our mistaken word) and compare with the first symbol in condition again! And go on accomplishing step 3 up to the end of the word. It’s clear that you won’t find any coincidence.
We’ve already learnt to search on two definite symbols, now we are to learn how to use symbol classes.
For example, we have lines:
abcd12345efg
fghi56789qwe
Condition: find in these lines parts which consist of any four Latin letters after which any five numbers follow.
I’ve told before how to describe a symbol which coincides with any Latin letter, remember we use for this symbol class [a-z] (we don’t mention capital letters by the time); a symbol of condition which coincides with any number is described with such symbol class [0-9]
Let’s return to the initial task. We are to find lines with four letters at the beginning which are immediately followed with five numbers.
From what we already know we can write down: [a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9][0-9]
Check it yourself, such form of recording the search condition works! As we’ve described one symbol in the search condition with each symbol class. Don’t you think this search condition is too bulky? And here quantifiers can help. Quantifier is something that expresses quantity of something else; in our case it’s the quantity of symbols in the search condition. We simplify the search condition by means of quantifiers: [a-z]{4}[0-9]{5}.
That’s all! If you cannot guess, in the braces is written how many symbols described in the symbol class can follow in a line in which we search the coincidence.
It occurs that any five Latin letters are followed by any five numbers. Search happens in the same way as I’ve described it before except that the symbols in the search condition are indicated with symbol classes instead of being explicit.
Each symbol class describes only one symbol; quantity of similar symbols following in a row is described by quantifiers.
Of course, such assignment of symbols’ quantity in the search conditions is not the only one. Quantifiers can be different! [a-z]{1,3} means that from 1 to 3 Latin letters can follow in a row. [a-z]{2,} means that minimum two Latin letters can follow in a row.
But quantifier in the braces isn’t the only way to set the quantity of symbols following in a row which are described by a symbol class. [a-z]* signifies that any quantity of Latin letters can follow in a row; there can be no letters at all, identically to [a-z]{0,}. [a-z]+ means that minimum one Latin letter should obligatory follow in a row, but the maximum quantity isn’t indicated identically to [a-z]{1,} [a-z]? means that the quantity of Latin letters shouldn’t be more than 1; a letter may be absent at all identically to [a-z]{0,1}.
Applying quantifiers to the literals
Let’s return to the very beginning of the article where we have searched letter‘d’ in the word ‘stadium’ by excess. Coincidence search condition which is described with one non-special symbol is called literal.
There are lines:
abcdefg
abcddefg
abcdddefg
abcddddefg
Demand: to write coincidence search condition for all lines.
Answer: abcd{1,4}efg
You’ve just seen me apply quantifier to the literal. You may apply to the literals any of foregoing quantifiers.
It’s clear that by applying search condition avcd{1,4}efg for line abcefg the coincidence won’t be found as the quantifier {1,4} implies that after ‘abc’ before ‘efg’ minimum one and maximum four letters ‘d’ follow.
Symbol classes for ‘advanced’
You’ve already seen how much symbol classes simplify the description of coincidence condition. Let’s examine them up to the end.
What can symbol class include? It can include any literal and literals’ intervals as well. For description of the literals symbol ‘-‘is used which stands between the first and the last symbols of the interval. I’ll give some examples of setting different intervals in one symbol class: [1-5]-numbers in the range from 1 to 5, [a-f] –Latin letters from ‘a’ to ‘f’, [a-fq-x] - Latin letters from ‘a’ to ‘f’ and from ‘q’ to ‘x’. As you’ve mentioned I use two ranges in the last symbol class.
Imagine you need to describe the condition that symbols ‘a’, ‘g’, ‘7’ or ‘4’ can stay in the definite part of a line. What should you do? You should write symbol class [ag47].
The explanation is simple; literals which are possible in the search condition may be enumerated in the symbol class. You can combine enumeration of literals with indication of intervals: [14a-kz] means that symbol in the line may coincide with 1, 4, Latin letters from ‘a’ to ‘k’ and also with letter ‘z’. Naturally, literals can be not only letters and numbers but punctuation marks and mathematical signs as well (for example ‘,’ (comma), ‘!’ (exclamatory mark), ‘+’ (plus), ‘-’ (minus). You say minus is used for the description of intervals. Right, if you put a minus between ‘a’ and ‘z’ it will be an interval but if you put a minus immediately after opened square brackets it will be a minus! Here’s an example: [-,a-z] means that symbol class includes minus, comma and Latin letters from ‘a’ to ‘z’.
And how to write a symbol class which includes all symbols except for the established ones? For example all symbols except for ‘a’, ‘b’, ‘c’? There is a special negation symbol for this: ^ (cover). We write [^abc] – all symbols (not letters) except for Latin letters ‘a’, ‘b’, ‘c’.
Memorize all!
Many of you should have a question what you should do if you need to check the whole line on the condition correspondence and return by means of language functions only part of this line?
How by means of language functions operate not with the whole line which complies with the condition but with its part only?
There are grouping and at the same time saving round brackets meant for such purposes.
Example:
Line may consist of 5 Latin letters after which a minus sign follows and after it go four numbers from 1 to 8.
Task:
1. Check the lines on the correspondence with the condition
2. Return four numbers into the program what will mean on the given stage only that these four numbers are to mark out somehow in the search condition
If we just want to check the line against the search condition correspondence, the search condition will be following: [a-z]{5}[1-8]{4} but we are to memorize the numbers so we add the memorization instruction [a-z]{5}([1-8]{4})
As you can see, the things to memorize are marked out with round brackets. Within the function which will operate with the line by means of the foregoing condition the coincidence will be memorized in the special variables; it may be also appealed in PHP through \1 ? Perl - $1. There may be some memorization instructions in one search condition: ([a-z]{5})([1-8]{4}) checks the line against the coincidence with the condition; in the case of successful coincidence it will memorize five letters in ? \1 ($1) and four numbers in \2 ($2). If we appeal to the variable \0 it proves out to save the whole coincided line which was described with the condition.
Closer to reality
Those who had enough desire to read up to this place have already understood a lot and they try to write their own line search conditions but they still don’t manage to do this. Obviously this happens because you don’t know many peculiarities of work with regular expressions. I’ll tell you about some of them right now. The line has a head and an end. You say it’s clear to everyone but you are to change a bit your comprehension of lines. An invisible symbol is thought to stay in the head and the end of each line. This is to mention by the search in a real program by putting symbol ‘^’ (head of a line) in the beginning of search condition and symbol ‘$’ (end of a line) in its end. So we are to rewrite the example with five letters and four numbers (the same as rest of following them foregoing examples): ^[a-z]{5}[1-8]{4}$. Now by means of this condition we can actually check the lines
asbvc1234
sdwtsv1234
against correspondence with the condition.
You are not to mix ‘^’ (cover) in a symbol class with it by description of the coincidence condition. Cover put in the beginning of the search condition corresponds with the head of a line; cover put after square bracket opening symbol class description will be a special symbol of negation (all except for present in a symbol class); cover put in any other position (except for the first one) within symbol class will be an ordinary literal.
Your search condition may be out of order because you don’t realize the quantity and the shape of whitespaces which are placed in the line. spaces are described either with a literal (you are simply to put a space in the search condition) or by means of a visible symbols’ combination: \s
\s is supposed to correspond either with literal ‘space’, or with literal ‘line feed’, or with literal ‘tab’.
In programs separators are used to indicate where the search condition (regular expression) begins.
Example:
preg_match("/^[a-z0-9]/", $string,$mathces);
look what is indicated in the inverted commas of the first function argument: firstly slash goes, then we put symbol of the line’s head, then symbol class follows, then we put another slash. Exactly the first and the last slashes symbolize that a regular expression is enclosed into them. This has come from Perl where you can put modifiers (about them later) immediately after separators. On this stage you are to know that you need stops by writing down a search condition (these are two identical symbols, not necessarily slash, many use tilde).
Examples to the chapter
All the examples are from the questions which are asked on different forums. I’ll examine them in the language that I use by writing myself: PHP.
1. Very often we face the situation when we need to write a registration on the resource which you support. Because of some reasons claims on the input information are set up. For example, it is often asked how to check that user has inserted as a nick (nickname, login) word which consists only of Latin letters and numbers.
I’ll show you condition testing right inside the code to make it clear what and where you should write:
<?php
$user = $_POST["username"];
if(!preg_match("/^[a-zA-Z0-9]+$/", $user)) {
echo "User’s name is set in invalid format";
} else {
echo "User’s name is set in valid format";
}
?>
And now let’s examine the regular expression itself.
As the user’s registration name should consist of Latin letters and numbers you are to write a symbol class which complies this condition: [a-zA-Z0-9] this symbol class includes three intervals; the first interval ‘a-z’ (all symbols from small letter ‘a’ to small letter ‘z’), the second interval ‘A-Z’(in a similar manner but with capital letters), the third interval ‘0-9’ (numbers from 0 to 9). We’ve described only one of the letters of which registration name can consist but there may be…how many such letters may be there? You’ll say any amount of them and I’ll say you are wrong.
Registration name should consist minimum of one letter! And I think this condition has to be compulsory by registration, so we are to describe this fact. Remember about quantifiers: [a-zA-Z0-9]+
Plus ‘+’ is exactly that quantifier which tells there should be minimum one complying with the condition symbol in the string variable.
Now we are to tell the regular expression that all the line from the beginning to the end is to fit with the condition, so we add into the regular expression symbol of the line beginning ‘^’ in the beginning of regular expression and symbol of the end of the line ‘$” in the end: ^[a-zA-Z0-9]+$
Now you are to explain to preg_match function that line ^[a-zA-Z0-9]+$ is a regular expression; you need to put stops, I put slash '/':
preg_match("/^[a-zA-Z0-9]+$/",$user)
2. There are some people who complicate their lives themselves. These people are called programmers and system administrators. The thing is that they complicate not only their own lives but also the lives of each other. And the boss sees that the registration on the company site is possible and the user name is verified somehow. In 80% cases this task is complicated by the boss; you are either to widen the symbols’ set what is made by simple adding symbols into the symbol class or toughen the conditions. For example the condition is set that user’s name may consist of the Latin letters and numbers as well but the first symbol in the user’s name should be only a Latin letter! We write to the end the regular expression: ^[a-zA-Z][a-zA-Z0-9]*$
As you know symbol class describes only one symbol. Our task is to describe one symbol (the first one): [a-zA-Z] here two intervals describe the first symbol and let us understand that no numbers in the first symbol of the tested line are possible.
From the foregoing example it is clear that user’s registration name should consist of minimum one symbol. We’ve already described the first symbol in the line but it can be the only one; it can be followed by any quantity of symbols or by no symbols at all; but the regular expression coincides only in case when following symbols comply with condition they are either Latin letters or numbers. [a-zA-Z0-9] The first compulsory symbol is described; their rest isn’t compulsory so we change quantifier from ‘+’ to '*'. Put symbols of the beginning and the and of a line.
Chapter 3.
After having studied the basis of this work it’s worth to get over to the practical usage of regular expressions. But if you treat the most regular expressions with the knowledge you have they still prove to be a set of symbols though you can recognize some of them. In this part I’m going to liberalize you basing on the written in the previous chapter.
Wildcard characters
How to show the invisible? For example how to indicate the presence of a space in the search condition? Many of you should guess that to show something invisible we are to make it visible first; it means we are to bring in some symbol or symbols’ set which will be interpreted as invisible ones. With this knowledge we start our study of wildcard characters in regular expressions.
- \s-if you put in the search condition a back slash symbol one after another and immediately after it letter ‘s’, you’ll describe either space or tabulation symbol. Of course you may put a space in search condition in the same way as you usually do it when writing but the entry [a-z\s] will be much easier to understand and read than [a-z ]. At first sight it’s clear that the first symbol class includes space and you have to examine closely the second symbol class and as regular expressions are just a set of signs for many you’ll easily omit a space put in such way. Use this wildcard character attentively because beside the coincidence with space and tabulator it also coincides with the new line symbol.
- \S – I’ll say simply that these are mostly visible symbols; it means all that doesn’t coincide with \s
- \w – wildcard character which is meant to substitute the whole symbol class; it includes all the symbols that may form a word (usually these are [a-zA-Z_] although a lot depends on the set local, Unicode supporting and so on)
- \W – all that isn’t included in definition \w or [^a-zA-Z_]
- \d – all numbers or the symbol class [0-9] which is already familiar to us
- \D – everything that isn’t a number
As you can see wildcard characters often describe some frequently used hosts of symbols which programmers use every day but these hosts are limited; they can include either letters and underlining or invisible symbols. But how can we describe all symbols? In a very simple way! We are to put a dot.
You’ll ask then what to do if you are to describe the dot itself not all symbols? Put a back slash before the dot: \.
But you are eager to know what to do if you’re searching for the back slash in the text which is followed by a dot. It’s simple; as you can see almost every wildcard character contains back slash which is also necessary for making a literal of a wildcard character. It means this back slash is a wildcard character itself and to change it into a literal you are to follow the same rules which you have used by changing a dot from wildcard character into literal; and namely to put another wildcard character before this one. In order to evade inventing any wildcard characters we decided to use the same slash. Thus to put a back slash as a literal in the search condition you have to double it like this: \\
In the same way, if you need to put two back slashes you are to double them like this: \\\\
Alternatives
To be or not to be... That is eternal alternative! After having read this part you’ll be able to write the Hamlet alternative yourself by means of regular expressions. First let’s examine data which you are to process. Hamlet had got a choice between ‘to be’ and ‘not to be’. At any case he got at the output some solution which was equal to ‘true’. To describe a choice condition by means of regular expressions you are firstly to decide between which things you are to choose. In Hamlet’s case we choose between two literal groups; one of these groups consists of two literals ‘b’ and ‘e’ which follow each other, another one consists of none literals following each other: n, o, t, \s, t, o, \s, b, e
It’s clear that \s represents one space. I seem to use a new word which you haven’t mention ‘literal group’. Literal group is a character string; characters in this string are described either by symbol classes or by literals themselves. Literal group is described in the round brackets, the same brackets save coincided literal group in the specified variables about which I’ve written in the previous article. Here are examples of literal groups:
(be)
(not\sto\sbe)
Now return to the choice form two literal groups: (be)|(not\sto\sbe) This stick | between two literal groups is the choice condition itself and it is read as ‘or’. Now imagine that Hamlet’s answer is treated by a computer which knows that there are only two possible variants: either ‘to be’ or ‘not to be’. The rest of answers aren’t treated like answers as such. We write a regular expression for Hamlet’s answers testing:
preg_match("/^(be)|(not\sto\sbe)$/", $alternate, $answer);
Regular expression coincides in case
- if $alternate is equal to "be"
- if $alternate is equal to "not to be"
Firstly I asked between which things we were going to choose. I think it’s time to examine the things between which it’s generally possible to choose. It’s possible to choose between literals and literal groups. As I’ve already said literal groups are joined together with round brackets; if you are to choose between single literals, you are to group two literals separated by a vertical line with brackets.
Example of choice between two literals: s(o|u)n coincides with ‘son’ and ‘sun’ as well.
Example of choice between two literal groups: (son)|(sun) coincides with ‘son’ and ‘sun’ in a similar way.
In the case of choice either between two literal groups or between single literals these literals are joined together with round brackets, but in the first article in part ‘Memorize all’ I said that all enclosed in round brackets is memorized in the special variables. It means ‘memory’ resource is used. What shall we do if we need to group marked out literal groups without memorizing them? We are to make them ignore memorizing! It is done by means of succession ?: like this (?:be)|(?:not\sto\sbe). Now neither literal group ‘be’ nor literal group ‘not to be’ will be put into variables \1, \2 (Perl $1, $2); they will group perfectly instead. It’s clear that in case you need to put literal ‘|’ you are to put a wildcard character before it that indicates that the vertical line is a literal in the given case. Back slash is used as such symbol:\|
Example to the chapter
An administrator has got the users’ list which was generated either by program or by utility in format:
name surname
initials surname
initials. surname
Usually the boss needs some statistics but he isn’t interested in the names; he needs only surname and initials like ‘initials. surname’.
I’ve done it like this (excess cycle upon all the generated by utility lines you’ll write yourselves):
preg_match("/([^\s]+)\s+([^\s.])[^\s.]*(?:\s|\.)([^\s.])[^\s.]*/",$income_str,$out_arr);
print_r($out_arr);
Firstly we examine incoming data to make regular expression work in 100 per cent cases. You are not to check the data against coincidence with the condition as the data saved in the system have already passed this testing by input. Name and surname may consist of any symbols as we don’t know which symbols the user has brought in the system and which symbols are permitted by this system; but we know exactly that name is separated from surname with a space, name may be either full or shortened to one letter; name can also be separated from surname with a space or with a dot.
If you learn to set up similar data descriptions on which you are to search, the greater part of your work is done as you are left only to write down the foregoing sentence using regular expressions’ language:
- [^\s] – any symbol which is not a space! Including the symbol of new line you could write \S without cover at the beginning of symbol class but it’s rather specific thing.
- [^\s]+ - minimum one symbol which is not a space, so we’ve already described surname.
- \s+ -minimum one space between name and surname
Some methods are used for names’ writing:
- Full name and space after it
- First letter of a name and space after it
- First letter of a name and dot after it
What shall we search? Everything after the space between name and surname up to the next space or dot, a name or its shortening consists at least of one letter.
- [^\s] –everything that isn’t a space, the first compulsory name (shortening) symbol
- [^\s.] –everything that isn’t a space or a dot; it’s literal in the symbol class
- [^\s.]*- if we deal with a full name, it means that we are to find everything that isn’t a dot or a space and follows immediately after the first name symbol which we described above.
We prove to have described name by such construction [^\s.][^\s.]* but we are to memorize the first letter of a name (even if it is shortened) according to the problem situation ([^\s.])[^\s.]* What follows after name? At any case it is either a space or a dot. It means you are to pass this area so that you meet the condition of choice between two symbols: (\s|\.) – Choice between two literals is done by grouping them into the round brackets ant writing a vertical line between the literals. If only one space is granted in a line, you may leave it like this; in case you aren’t sure put symbol ‘+’ after \s: (\s+|\.)



