Finally Understand Regular Expressions: Regex isn't as hard as it looks

Finally Understand Regular Expressions: Regex isn't as hard as it looks

by Alex Hyett | 6 min read

There’s nothing like a regular expression to strike fear in the heart of a developer.

Regular expressions (regex) are used for a lot of things, such as validating that a string is in the right format, as well as grabbing certain parts of a string as well.

You can do simple string searches with regex, but obviously that’s not what makes it powerful.

If you want to follow along with these examples, there is a great website called Regexr that I always use when testing regular expressions.

Special characters

There are a few different special characters you can use to help you with your searches.

  • \w - will match every alphanumeric character as well as underscores.
  • \d - will match all number characters.
  • \s - matches spaces, tabs and new lines.

We can turn these into the negative versions by using a capital letter:

  • \W - will match everything that is not alphanumeric or an underscore.
  • \D - will match everything that is not a number.
  • \S - will match everything that is not a space, tab or new line.

There is also another special character projects that is used to match any character in your string, and that is a ..

If you were to search for .at in the following sentence:

The cat sat on the mat at home.

It will match on cat, sat, mat, and at.

Quantifiers

There are a few quantifiers you can use in regex to match for multiple occurrences of a letter.

  • * - match 0 or more of the preceding pattern.
  • + - match at least 1 of the preceding pattern.
  • ? - match 0 or 1 of the preceding pattern (it is basically optional).
  • {3} - matches exactly 3 occurrences of the preceding pattern.
  • {3,5} - match between 3 and 5 occurrences (3,4,5) of the preceding pattern.

Let’s say we have the following text:

a aa aaa aaaa aaaaa aaaaaa

This is what we get with the following patterns:

  • a* matches a, aa, aaa, aaaa, aaaaa, aaaaaa
  • a+ matches all of them as well as we have at least 1.
  • a? matches a 21 times for each individual a in the text.
  • a{3} matches just aaa 5 times. Once in aaa, aaaa and aaaaa and twice in aaaaaa.
  • a{4,5} matches aaaa and aaaaa 3 times.

Character Sets

In some cases, we want to match a range of different characters. For this, we have character sets. To use a character set, we can put a range of characters in square brackets [].

Let’s take our simple sentence again and look at an example:

The cat sat on the mat at home.

If we search for the pattern [cs]at we are going to match on cat and sat but not mat.

You can also do ranges of characters too. If we search for the pattern [a-p]at then we are going to match on cat and mat but not sat.

As with the special characters, it is also possible to look at the negative version of this by putting a ^ symbol at the start inside the brackets.

So doing [^a-p]at will match on sat but also at as spaces are included as characters as well.

Capture Groups

One of the main reasons for using regular expressions is because you want to extract a string from a bit of text.

For example, if you wanted to extract the domain from the following email address:

cat@alexhyett.com

We can use the following regular expression to match on this email address:

[\w-\.]+@([\w-]+\.+[\w]{2,63})

Let’s break this down, so we can see what it is doing:

  • [\w-\.]+ - The first part is matching on any alphanumeric character and underscore (as denoted by the \w ) as well as a hyphen - and a dot .. The dot here has been escaped with a backslash so that it doesn’t get confused with the . special character. These characters are matched one or more times, denoted by the +.
  • @ - is just matching the @ character.
  • [\w-]+ - is matching any alphanumeric character and underscore (as denoted by the \w ) as well a hyphen -. These characters are matched one or more times, denoted by the +.
  • \. - is just matching the . character.
  • [\w]{2,4} matches any alphanumeric character and underscore. I don’t think you can have underscores in top level domains so this should probably just be [a-zA-Z] but \w is simpler. This is then matched 2 to 63 times to allow for extensions such as uk, com, technology.

We have then added brackets after the @ until the end of the string to create a capture group.

When you use this regex in code you will be able to look at the groups and extract the domain e.g. alexhyett.com.

Lookahead and Lookbehind

This is where people to start to switch off when it comes to regular expressions.

Positive and Negative Lookahead and Lookbehinds sound complicated, but they are not actually that hard.

A lookahead, or lookbehind, just looks for a particular pattern ahead or behind what you are looking for, without including it as part of the match.

Positive Lookahead

Let’s go back to our simple string and see how we can use a positive lookahead.

The cat sat on the mat at home.

Say we want to match on the letter o but only if it has the letter m after it.

To do this, we use the pattern o(?=m). Which will match the o in home but not the o in on.

Negative Lookahead

We can also do the negative of this. If we wanted to find all occurrences of the letter o that does not have the letter m after it we would use o(?!m) basically replacing the = with an !.

This would then match the o in on but not the o in home.

Positive Lookbehind

You can probably see where this is going now. A lookbehind, as the name suggests, looks backwards instead of forwards.

If we want to find all occurrences of the word at that are preceded by the letter c we can use the following pattern (?<=c)at. This will only match the at in cat but not any of the other occurrences.

Negative Lookbehind

Similarly, we can find the negative version of the positive lookbehind by changing the = to a !.

If we now search for (?<!c)at it will match on the at in sat, mat and at.

Extra Tip

It is also possible to combine multiple patterns in one regular expression.

Let’s say we want to find all the a characters that aren’t preceded by an s as well as all the t characters.

We can do an OR symbol | and have a pattern that looks like this:

(?<!s)a|t

You can do this, but it can get quite complicated if you’re going to be chaining on lots of different expressions.

If you’re doing these regular expressions in code, then I recommend that you split these out, just to make the regular expressions that much clearer.

I hope that demystifies regular expressions for you. If you like this post, you can also follow me on Twitter and Medium.


ALSO ON ALEXHYETT.COM

How I would learn to code (if I could start over)

How I would learn to code (if I could start over)

  • 06 January 2023
When I was 8 years old I learnt how to code. I learnt to code from an old BASIC book that my Dad had lying around from his ZX Spectrum. I…
Understanding Big-O Notation

Understanding Big-O Notation

  • 03 January 2023
It’s important when you’re writing applications especially, those that are going to be processing a large amount of data that you understand…
Stack vs Heap Memory - What are the differences?

Stack vs Heap Memory - What are the differences?

  • 30 November 2022
In modern programming languages such as C# or Java, we tend to take memory management for granted. Gone are the days when we need to call…
Git Flow vs GitHub Flow

Git Flow vs GitHub Flow

  • 10 November 2022
Losing code that you have spent hours writing can be painful, which is why we use version control (or source control) to store our code and…
Bitwise Operators and WHY we use them

Bitwise Operators and WHY we use them

  • 26 October 2022
Bitwise operators are one of those concepts that a lot of programmers don’t understand. These are not used a great deal anymore so you can…
8 Data Structures you NEED to Know

8 Data Structures you NEED to Know

  • 26 October 2022
You can get pretty far in programming without understanding Data Structures, but eventually, you are going to need to know them, understand…
Binary Numbers Explained for Programmers

Binary Numbers Explained for Programmers

  • 21 October 2022
Everyone knows that computers run on ones and zeros. This is because CPUs are made up of billions of transistors, which are basically just…
Beginners Guide to Programming

Beginners Guide to Programming

  • 12 October 2022
A lot of my articles are aimed at intermediate to advanced developers, but as part of my creative sabbatical, I am working on creating…