Blog postRegexp in PHP

To match or not to match with regular expressions in PHP

Published April 05, 2016

One of the most important tools in any programming language is regular expressions, because it searches for patterns in strings.

Where regular expressions can be used? Almost everywhere. For form validation, browser detection, spam filtering, to check the strength of passwords, and so much more.

This tutorial will explain the subject from the perspective of the PHP programmer.

  1. How to find matching strings?
  2. The i modifier
  3. Global regular expression matching
  4. Meta what? Metacharacter!
  5. The problem of string literals
  6. Character sets
  7. Shorthand character sets
  8. Qunantifiers in regular expressions
  9. Lazy and greedy expressions
  10. How to find a match to set of expressions?
  11. How to match alternatives when the order does matter?
  12. Capturing groups and backreferences
  13. The search for strings that don't match
  14. Search and replace with preg_replace
  15. How to split strings by regular expressions?
  16. How to search for matches inside arrays?
  17. Where to go from here?

# How to find matching strings?

preg_match is built-in PHP function, that searches for a match to regular expressions within strings. If it finds a match, it returns true, otherwise it returns false.

The syntax is straight forward:

preg_match($regex, $string, $match);

$regex is the regular expression, and it needs to be encompassed with slashes (/). For example:

$regex = "/exp/";

$string stands for the string inside which we look for the matching pattern.

$match is the array that stores the first match that the function finds (preg_match stops searching as soon as it finds the first match).

For example:

<?php
// The regex here is the word 'go'.
$regex  = "/go/";

// We search for a match inside this string.
$string = "you gotta give the go kart a try";

// preg_match returns true or false.
if(preg_match($regex, $string, $match)) 
{
  echo "We found a match to the expression: " . $match[0];
} 
else 
{
  echo "We found no match.";
}

And the result is:

We found a match to to the expression: go

preg_match searches for a string that matches the regular expression ($regex) in the string ($string), and the first match it finds is stored as the first item of the $match array.

If we change the string to something that doesn't match the pattern, preg_match will return false.

# The i modifier

Regular expressions are by default case sensitive. This means that the regular expressions /chr*/ is different from /Chr*/, only because the the first expression starts with a lowercase 'c' while the second starts with a capital 'C'. But we can change this behavior by adding the i modifier, right after the closing delimiter of the regular expression.

So, the regular expression:

$regex = "/Chr*/i";

will match both 'Chrome' as well as 'chrome'.

# Global regular expression matching

We learned about preg_match that stops searching as soon as it finds the first match. But in order to find all the matches to the regular expression (i.e., global matching), we need to use another built-in function, preg_match_all.

In the following example, we perform a global regular expression matching for the expression $regex in the $string, and each match is stored in the $matches array.

$regex = "/reg/";

$string = "Both regex and regexp are short for regular expression";

if(preg_match_all($regex, $string, $matches))
{
    print_r($matches[0]);
}

Result:

Array
(
 [0] => reg
 [1] => reg
 [2] => reg
)

# Meta what? Meta character!

Metacharacters are characters that have special meaning in regular expressions. Let's learn our first 3 metacharacters:

The metacharacter matches
. The dot metacharacter matches any character except for a new line
^ The caret metacharacter indicates the start of the string
$ The dollar metacharacter indicates end of the string

Let's give it a try:

// The metacharacter $ indicates that we look for a match at the end of the string.
$regex = "/3.2$/";

$string = "13.2";

if (preg_match($regex, $string, $match)) 
{
    echo "We found a match to: <i>$match[0]</i>";
}
else
{
    echo "No match!";
}

Result:

We found a match to: 3.2

The regex matches the literal string "3.2" as long as it is found at the end of the string (because the regex ends with '$').

It might be surprising, but the regex also matches the string "3f2" and "3%2", as long as it is found at the end of the string, because the dot symbol is a metacharacter that represents any character that is not a new line.

So, how can we match the dot literally (without its special meaning)? In fact, how can we match any metacharacter literally?

# The problem of string literals

As we saw in the previous section, the metacharacters have a special meaning (^ stands for the start of the string, $ for the end of the string, and so on), so we need a way to tell the regular expression to search for these symbols in their literal meaning.

The way to change this behavior is by escaping the metacharacters with a backslash (\).

For example, the following regex, matches exactly the string "3.2":

$regex = "/3\.2/";

And this example, matches exactly the string "^$":

$regex = "/\^\\$/";

$string = "^$^";

if (preg_match($regex, $string, $match)) 
{
  echo "We found a match to:<i>" . $match[0] . "</i>";
}

Result:

We found a match to:^$

Let's see some more examples for escaping the metacharacters:

The expression matches
\. Simply a dot
\$ Simply the $ character
\^ Simply the caret character
\\ Simply the backslash character

In some cases, you don't need to escape the metacharacters, because the regular expressions are smart enough to know where to use the metacharacters in their literal meaning. On the other hand, it can't hurt to add the backslash whenever you want to use metacharacters in their literal meaning. So, use the backslash if you're not sure, it probably won't hurt you.

# Character sets

There's no need to specify each and every character that we want to match. Instead, we can use character sets to specify a range of characters to which we want to find a match between square brackets.

Let's see some examples to character sets:

The character set matches
[ab] a or b
[abc] a, b or c
[A-Z] the Uppercase letters A-Z
[a-z] lowercase letters (a-z)
[A-Za-z] lowercase or Uppercase letters
[a-d] the range of letters from a to d
[a-dA-D] both Uppercas and lower case letters a-d
[a-dm-p] the range of letters from a to d and from m to p
[0-9] all the digits
[1-4] the range of digits from 1 to 4
[a-zA-Z0-9] all the letters and all the digits

We have already learned that the caret (^) symbol represents the start of the string, but when it is used inside the square brackets it indicates the negation of the character set.

For example, if [a-z] means the range of letters from a to z, than [^a-z] with the caret symbol inside the brackets, means every charcter that is not in the set of lowercase letters.

Let's see some more examples:

The character set matches
[^A-Z] every character that is not an Uppercase letter
[^a-z] everything that is not a lowercase letter
[^A-Za-z] every character other than English letters
[^0-9] everything that is not a digit
[^A-Za-z0-9] everything that is not a digit or a letter
[^a-d] everything that is not in the range of a-d

# Shorthand character sets

We have shortcuts for some of the most commonly used character sets. Let's learn some of these shorthands.

The shorthand matches
\s White space characters like space, tab and new line
\d Matches any digit (0-9)
\w Matches word characters, including the English letters (a-zA-Z), digits (0-9), and underscore (_)

For example, the regular expression:

$regex = "/\s\w\s\d\d\d\d$/";

Searches for a pattern that matches white space, a word character, another white space, and than 4 digits.

Let's test the expression with the following code:

$regex = "/\s\w\s\d\d\d\d/";
    
$string = "bat a 1000";

if (preg_match($regex, $string, $match)) 
{
  echo "We found a match: <i>$match[0]</i>";
}
else
{
  echo "No match!";
}

Result:

We found a match: a 1000

We found the match because we have the pattern of a space, followed by a word character, followed by 4 digits at the end of the string.

# Qunantifiers in regular expressions

We use quantifiers in order to specify the number of times that a group of characters or a character can be repeated in a regular expression.

For example, in order to find a match to the string "Mississippi", we can use the following expression:

$regex = "/Mis{2}is{2}ip{2}i/";

The {2} in the expression means exactly 2 times.

The following table gives examples to the use of quantifiers:

The quantifier Searches for
n{x} the letter 'n' exactly x times
n{2} the letter 'n' exactly 2 times
n{x,y} 'n' between x and y times
n{2,3} 'n' between 2 and 3 times
n{x,} 'n' at least x times
n{2,} 'n' at least 2 times
n{,y} 'n' not more then y times
n{,3} 'n' not more than 3 times

For example, we can find a match to both "color" and "colour" by using the following regular expression:

$regex = "colou{0,1}r";

Here, the use of the quantifier makes the 'u' optional.

Other regex quantifiers are less specific:

The quantifier Searches for
* zero times or more
+ at least 1 time
? zero or 1 time

For example, the following expression can match both "color" and "colour".

$regex = "colou?r";

Since the "?" metacharacter makes the one character that it follows optional, the regular expression finds a match with or without the "u".

# Lazy and greedy expressions

Quantifiers are greedy. They are greedy because they try to match the longest string possible. This may have unforeseen outcomes, since we might get a much longer match than we anticipated.

For example, I would like to replace the colors of both cats (which are found inside span elements) with the string 'M@#!' in the following sentence.

$string = "Said the <span>striped</span> cat to the <span>orange</span> cat.";

// ordinary greedy expression.
$regex = "/<span>.+<\/span>/";

echo preg_replace($regex ,' M@#! ', $string);

And the result is:

Said the M@#! cat.

Not quite what we expected.

Instead of replacing each span separately, the expression started replacing from the first opening span tag and ended with the last closing span tag. This behavior is caused by the greedy nature of the regular expressions.

To get what we want, we need to make the expression lazy. We can make the expression lazy by adding the '?' symbol right after the quantifier.

Let's precede the expression with the '?' symbol to make it lazy.

$string = "Said the <span>striped</span> cat to the <span>orange</span> cat.";

// Lazy expression with the '?' symbol.
$regex = "/<span>.+?<\/span>/";

echo preg_replace($regex , ' M@#! ', $string );

And the result is:

Said the M@#! cat to the M@#! cat.

Every span is separately replaced.

# How to find a match to set of expressions?

In order to choose between several alternatives we need to put the pipe (|) symbol between the different alternatives. For example,

  • To find a match to one of the strings 'png' or 'jpeg' we use the expression '/png|jpeg/'.
  • We can have more than two alternatives, '/png|jpeg|gif|bmp/'.

If we want to choose between 'jpg' and 'jpeg', we can add the expression 'jpeg' to the set: '/png|jpg|gif|bmp|jpeg/'.

We can also make the 'e' in 'jpeg' optional, by adding the '?' symbol right after it: '/png|jpe?g|gif|bmp/'.

The following expression matches images filenames:

$regex = "/^([A-Za-z0-9-_.])+\.(png|jpe?g|gif|bmp)$/";

Pay attention to the parentheses around the set of file extensions. The parentheses separate the extensions from the rest of the expression, so the first alternative in the set is 'png' and not '([A-Za-z0-9-_.])+\.png'.

# How to match alternatives when the order does matter?

In the previous section, we saw how to write a set of options when the order does not matter, but how can we search for matches when the order does matter?

Let's take, for example, the following string:
'The sum of 2 and 3 is 5'
which is equivalent to:
'The sum of 3 and 2 is 5'.

The first regex that comes to mind may be the following:

$regex = "/^The sum of (2|3) and (3|2) is 5$/";

The problem is that the regex can also match the following strings:
"The sum of 2 and 2 is 5"
and
"The sum of 3 and 3 is 5"
which are obviously wrong.

To find a perfect match, we need to modify the regex a little bit, so it can only match the right options.

$regex = "/^the sum of (2( and 3)|3( and 2)) is 5$/";

Let's test the regex:

$regex = "/^The sum of (2( and 3)|3( and 2)) is 5$/";

$string = "The sum of 2 and 3 is 5";

if(preg_match($regex, $string, $match))
{
  echo "We found a match to the expression: " . $match[0];
}
else
{
  echo "We found no match.";
}

And the result is:

We found a match to the expression: The sum of 2 and 3 is 5

# Capturing groups and backreferences

When we use parentheses we capture the expressions, and so we can later backreference these expressions with the '$' metacharacter. The first group that we captured we'll be referenced by '$1', the second group by '$2', the third group by '$3', etc.

In the following example, we take the dates in the European date format and re-format them into the American date format.

$string = "16-04-2016";

$regex = "/([0-9]{1,2})-([0-9]{1,2})-([0-9]{4})/";
//The first group references the day
//The second group references the month

$replace = "$2-$1-$3";
//Replace the first group with the second

echo preg_replace($regex ,$replace ,$string );

And the result is:

04-16-2016

If we want to avoid capturing one of the groups, we add the non-capturing group (?:) at the beginning of the group, so it will be excluded from the match.

For example, in order to avoid capturing the year, we can add the non-capturing group at the beginning of the expression that matches the year:

$string = "16-04-2016";
$regex = "/([0-9]{1,2})-([0-9]{1,2})-(?:[0-9]{4})/";
$replace = "$2-$1-$3";

echo preg_replace($regex ,$replace ,$string );

Accordingly, the result does not contain the third group:

04-16-

 

How to improve the performance of capturing groups?

It is advisable to use capturing groups only when they are really needed because they slow down the regex.
Of course, there are cases in which we have no escape but to use them. So, a neat way to solve the problem is to avoid capturing the groups, by using the non-capturing group, ':?'.

For example, we can improve the performance of the regex that searches for images by adding the non-capturing group:

$regex = "/^(?:[A-Za-z0-9-_.])+\.(?:png|jpe?g|gif|bmp)$/";

# The search for strings that don't match

Sometimes we may be intersted only in those strings that do not match the regular expression, and in these cases we will precede the PHP function with the "not operator" (!) in order to reverse the result of the boolean expression.

Better done than said. In the following example, we want to find a match to those strings that do not contain the string "fowl":

$regex = "/fowl/";

$string = "Birds of a feather flock together.";

// In order to search for strings that don't match
// we precede the PHP function with the not operator, "!"
if(!preg_match($regex, $string, $match))
{
    echo "No match";
} 
else 
{ 
    echo "There is a match";
}

And the result is:

No match

# Search and replace with preg_replace

In order to replace strings, we use the preg_replace() function, with the following syntax:

preg_replace($regex, $replace, $string);
  • $regex - the expression that we search for.
  • $replace - what we want the match to be replaced with.
  • $string - the string in which we look for the expression.

In the following example, we replace all the wrong forms of the word 'misspelled' with the correct form.

$regex = "/miss?pp?ell?e?d/";
$replace = "misspelled";
$string  = "He mispeled the word in all of his emails.";

echo preg_replace($regex, $replace, $string);

And the result is:

He misspelled the word in all of his emails.

# How to split strings by regular expressions?

preg_split is the built-in PHP function that we use when we want to split a string by regular expression. It has the following syntax:

preg_split($regex, $string);
  • $regex - the expression that we search for, and want to split at.
  • $string - the string in which we search for the expression.

In the following example we want to split at the comma followed by any number of spaces.

$regex = "/,\s+/";

$string = "html, css,     javascript,           php";

$languages = preg_split($regex, $string);

print_r($languages);

And the result is:

Array
(
 [0] => html
 [1] => css
 [2] => javascript
 [3] => php
)

# How to search for matches inside arrays?

preg_grep searches for matches inside arrays, and brings back an array that is consisted only of matching items.

The syntax:

preg_grep($pattern, $array);

$array stands for the array in which we search for matching items.

In the following example, we serch inside the $cars array for items that start with 't' (lower or upper case):

$models = array("Bentley", "Tesla", "Maserati", "toyota", "Subaru", "Alpha");
    
$output = preg_grep('/^t[a-z]+/i', $models);

print_r( $output );

And the result is:

Array
(
 [1] => Tesla
 [3] => toyota
)

# Where to go from here?

It is impossible to exaggerate in the importance of regular expressions to any programming language, so it is advisable to expand and learn as much as you can.

These are the resources that I recommend the most:

comments powered by Disqus