Home    


Regex Tutorial - A Cheatsheet with Examples

Regular expressions or commonly called as Regex or Regexp is technically a string (a combination of alphabets, numbers and special characters) of text which helps in extracting information from text by matching, searching and sorting.

It can also be used to replace text, regex define a search pattern which is used for find and find and replace in string operations.

If you want to learn Regex with Simple & Practical Examples, I will suggest you to see this simple and to the point Complete Regex Course with step by step approach & exercises. This video course teaches you the Logic and Philosophy of Regular Expressions from scratch to advanced level.


This is an online Regex tutorial for learning Regular expressions effectively and efficiently with examples and exercises. Regular expressions are very useful tool and they are not that much difficult to learn however on internet their are not that much ample resources to learn regex online. This is an attempt to explain regex and make them simple. So if you want to learn regex QUICK and EFFICIENT you are at the right place. Our methodology in this regex course is easy and step by step to understand the philosophy and logic. You can learn regex in just one day or a weekend. Just give it a try.


Regex are that much important that most of the programming languages like Python, Java, Javascript, PERL, PHP, Golang, C and C++ etc have regex engines to process regex. Regular expressions is a skill that is must for all programmers, network engineers, network administrators and all those who deal with data, who manage process store search and sort data. Regular expressions will make your work Fast, tidy and save you hundreds and thousands of hours of work. So if you are an IT related individual, not knowing Regex means you are missing a lot. Make a decision and get a skill for better job prospects and career advancement.


A simple regex can be /tree/ or even /a/ i.e the word tree or the letter a. Complex regex can take many forms. You don't have to worry as you will study from simple alphabets and numbers and move up the ladder step by step easy way learning complex forms of regular expressions.

A regex for matching the words    regular expression    regular expressions    and    regex    will be

reg(ex|ular\sexpressions?)


Lets look here in regex engine the full code with global multiline mode represented by gm


regexp pattern


I will again say, don't worry you will learn all this in just a day or two. You can see it is matching the specified three words only and leaving similar words. Beautiful, isn't it?


Why Learn Regular Expressions?


  • For parsing and extracting data from text. Like checking if user inputs a valid email id or card no.
  • Regex are fast and they also save your precious time.
  • They help you write effective code.
  • If you deal with any data, they help you to manage data efficiently making searching and sorting easier and picking up the right information from huge volumes of data.
  • They are every where.
  • They make you feel if you are a wizard or magician, doing work of hours and hours in minutes. 
  • Career Advancement and better job prospects
  • Regular Expressions are Fun! Learn REGEX and you will love it

Regex Engines:

A Regular expression engine is infact a computer program, which might run as indenpendent standalone application on your PC or Mac or run inside a web browser from a server or as offline browser extension and processes regular expressions for you. It will match data as suggested in regex. Mostly an engine work for one flavor of regex or a couple of flavors. These engines are not fully compatible and hence you need to know the subtle differences in your specific regex engine.

Here is a quick tutorial for you to learn regular expressions, you can see the details about each individual topic also.

Regex Tools and Environment:

You are going to use an online Regex tool which is available for free and is quite easy to use and very helpful in sorting out how the regex is working. Open your internet browser and type  regex101.com or simply write regex101 in google or any other search engine, the results of search will have this site. Simply click it and now you are there, ready to work. The working area will look like this

Regex tester

The first textfield is for writing regular expression and after that there is an area to write the Test string from which you are going to search for a match. For the time being that introduction is good enough. Later, I will tell you more about the regex101 engine and its characteristics.


Regular Expressions Syntax:

The first important thing is regular expression syntax. How to write a regex? what are the syntactical rules and how to follow them. Well there aren't many, in most regex engines the regex starts with a forward slash and ends with a forward slash, like javascript, Php Regex engine.

/regex/

In some regex engines like Python Re module the regex is encapsulated with inverted commas. 

r"regex"

However in most cases forward slash at start and end are used and this pattern is followed here.

Usually there are one or more modes applied to the regex. The syntax of mode is that after the second forward slash mode is written. like 

/regex/ mode

A more practical application is 

/tree/ gm

where g is global mode and m is multiline mode and tree is the text to match. 

And here is the explanation as shown in  regex engine. We will discuss modes later.

modes-explanation

Literal Characters

The simplest match is a literal character match. Literal means what it appears to be. 

a means alphabet  a

c means single character  c

tree means the word tree

A single literal character matches the first occurence of that character in the string in standard mode.

If the test string is I like driving car fast. The regex is going to match first occurence of a from left to right and that is a in car. This a is in the center of word car but it doesn't matter to regex engine. It will simply match any first occurence of a in the test string. However, if you don't want to match a in the middle of a word you can tell regex by setting boundaries. Then instead of      / a /   the regex will be   / \ba\b /    at this point this is just an illustration that you can also match single characters. Anchoring and boundaries are discussed in detail later and also as a seperate topic on this site.

The regex engine can find all the occurences of a instead of the first a. You can do so by changing the mode to global i.e adding g after the second forward slash like     / a / g

Metacharacters


Mostly English alphabets upper and lower case, numbers, underscore show literal behavior when used alone or as a combination of words meaningful or meaningless.


ABCDEFGHIJKLMNOPQRSTUVWXYZ

abcdefghijklmnopqrstuvwxyz

0123456789

But not all characters behave like a or b or tree etc. If you try to match a question mark ? or backslash \ or square bracket [ or plus sign + etc you will not be able to match them literally. Writing these symbols in regex field is going to raise a pattern error like this

Regex metacharacter

These symbols which don't have a literal match are called metacharacters as they have special meaning in regular expressions. Most of them raise a pattern error when used alone. Metacharacter means special characters. These special characters have special meanings in regex and without knowing their specific behavior you can not use them in your regex as it will generate error or will not give the expected results. Like period or dot is metacharacter, but if you use it as a single character in regex with global mode, it will not raise a pattern error but it will match everything, yes literally everything in your test string instead of matching a dot or period, though it will also match the dot or period but it will not leave anything matching all characters in your test string. This is an example of unexpected results. 

Regular Expressions Modes:


Modes in Regex play an important role. There are many modes however the important ones are discussed below.

1. First is standard mode and it means no mode. It has no symbol the syntax is 

/regex/

and it will match the first occurence of regex pattern.  

2. If you want to match all occurences of a single character or word you may use global mode represented by g. The syntax of global mode is 

/ a / g

using g for global mode after the second or closing forward slash results in matching all occurences of a instead of first occurence match.

3. Another important mode is case insensitive mode represented by i. Its syntax is 

/ Tree / i

As expected by its name, it results in a match which is case insensitive, i.e. it will match Tree as well as tree

4. Next are single line mode represented by s and multiline mode represented by m. Single line mode treats your string as one line instead of multiple lines and matches for your search pattern. In single line mode dot or period matches all including newline. Its syntax is 

/ tree / s

5. While a multiline mode will treat a string consisting of multiline and hence if you want to have matches at individual lines you may use multiline mode. A multiline mode anchors regex with ^ and $. Its syntax is 

/ tree / m

Escaping Metacharacters:


As discussed earlier metacharacters are special characters with special meanings and behavior. Some of the metacharacters are  ?    [    /    \    +    *. But question is in case you want to match a metacharacter then what to do, as writing a single metacharacter is usually going to raise a pattern error. Well the answer is very simple. We match a metacharacter by simply placing a backslash just before it. Using a backslash before these metacharacters make them literal characters and then you can simply match them in test string. Also by using a backslash before metacharacter makes the special behavior of that single character metacharacter go away.


At this point, I should tell another group of metacharacters which are called complex metacharacters or metacharacters of more than one character. Its kind of funny that these metacharacters are metacharacters when we use backslash before them. Some of the examples are \d    \w    \s    \D    \W    \S


Now these are metacharacters with special meanings like \d is used to match digits from 0 to 9. Now if we drop backslash before these metacharacters, they become literal characters. Like \d is a metacharacter dropping backslash it will be d a literal character. 


As discussed above \d is used to match numbers 0 to 9. \D is used to match anything but numbers 0 to 9.

\w matches word characters which include a to z upper and lower case, 0 to 9 and underscore.

\W matches anything but \w

\s matches whitespace like space, tab etc.

\S matches all but \s

Period or dot

First see how dot or period works in regex.

dot or period in regex tutorial

As you can see dot matches all characters, therefore it may be called as wildcard character as it matches all. 

Dot is the most commonly used metacharacter in regular expressions and it is also the most commonly misused metacharacter. Dot matches a single character at one time and if standard mode is used it will match only one character without looking what that character is as dot will match the first occurence from left to right and it will match the first character whatever it is. 

In global mode, it is going to match every character leaving line break. Yes, dot does not match line break by default. It is the only thing it does not match. If you want dot or period to match line breaks change mode to single line or s. In this mode dot is going to match line break also. Javascript and vbscript do not have this facility of dot matching all. However, in most of regex flavors line break can also be matched with dot in single line mode.

Non Printable characters:

There are certain characters which are non printable. Like tab, new line or line break, carriage return etc. For matching these non printable characters, you have to specify in regex about these characters. 

For matching a tab usually \t metacharacter is used. 

\f is used to match form feed

\n is for new line character

\r is for carriage return

\v is used in many flavors for a vertical tab. 

\a is used for bell

\e is to escape.

Moreover, unicode can also be used for non printable characters. In case the regex engine you are using supports unicode you may use \uFFFF for a unicode character. If your engine doesn't support unicode, you may use \xFF to match a specific character by using its base sixteen hexadecimal index in the character set. Every character which is non printable can be matched in regex or in or part of a set.

Character Class

With the help of a character class or character set you may direct regex to match only one character out of two or more characters. The syntax of character class is to put two or more characters inside square brackets

[class]

You may put any number of characters in a character class however your regex engine is going to match only one out of a given set of characters. 

Like you want to match     affect and effect

/ [ae]ffect / g

The regex will match both affect as well as effect. 

Similarly words with two different acceptable spellings can be matched by this

if you want to match  gray     or      grey

/ gr[ae]y / g

Some words are misspelled for example the word separate four possible forms are

separate

seperete

separete

seperate

to match any variant of separate the regex should be

/ sep[ae]r[ae]te / g

regex character class

Character ranges:

Instead of writing a large number of characters in a character class you may define ranges. Like instead of writing

[abcdefghij]ar

you can simply write 

[a-j]ar

and instead of writing 

[ABCDEFGHIJKLMNOPQRSTUVWXYZ] you may use  [A-Z] similarly with lower case letters from a to z you may use the character class [a-z].

For numbers you may use [0-9] instead of [0123456789]

Writing a series makes work a bit easy and code a bit tidy and enhances clarity. 

Negated Character Class:

After character class series you may use negated character class. A negated character class means don't match any character given in the set, or in other words match any character that is not in the character class. 

The syntax of a negated character class is like 

[^class]

using a caret just after the opening square bracket and then writing the character class. 

[^abc]ip 

will not match aip bip cip but it will match any character followed by ip

hence dip eip fip gip etc etc all will be matches

a[^b] does not mean a not followed by b, it means a followed by a character that is not b.

Metacharacters inside a character class:

Metacharacters have special meaning in regex and most raise a pattern error if used alone. In a character class there are certain metacharacters ] \ ^ - these four are metacharacters inside a character class with some special meanings, all other metacharacters are literal characters inside a character class. 

Quantifiers:

Quantifiers are used for repetition. If you want to repeat a certain character there are quantifiers available. For in detail topic of quantifiers and repetition please see detailed topic. 

If you want to match    flavor and flavour you may use 

/ flavou?r /

? as a metacharacter here means zero or 1 repetition. It simply looks either that particular character is present or not. It makes the character as optional, the regex will select if the character is there, and it will also match if the character is not in the test string. 

For zero or more repetition * is used. This metacharacter will match in case a character has no existence to any number of occurence. 

.* is going to select any number of characters. 

For one or more repetition + is used. + metacharacter will match only if the character or group before + occurs atleast once, however it will match any additional occurences.

\d+ will match one digit up to any digit number. \d is the shorthand character class for 0-9 and + means one or more repetition. \d+ is going to match all numbers in the text string. 

Another form of repetition is used with braces. There are three cases.

1. {fixed}

2. {min,max}

3. {min,}

The first case with fixed number of repetition. In this scenario use braces and with in braces write in digits the total number of repetitions required to match. For example if you want to match all four digit numbers simply write

/ \d{4} /

This regex will match 1234    4567    2589    5478

but it will not match 12    258    357    8 etc

In the second case if you want to match numbers with minimum of two digits and maximum of five digits then the regex will be

/ \d{2,5} /

This regex will match all numbers which are two digit, three digit, four digit and five digit and it will not match one digit or six digit numbers.

The last case is when you have minimum number of digits required in a number but you don't have information about the maximum number of digits. In that case {min,} is used.

This is a brief overview of repetition.

Anchors:

If you want to match a position instead of any character, use anchors. Some of the commonly used anchors are \b ^ and $. ^ matches the start of a string while $ matches the end of a string. If you have multiple lines then you may opt for multiline mode i.e m after the second forward slash.  / regex / m

Some of the examples

/ ^regex$ / mg

/ ^\d{4}$ / mg

If you want to match a word bounday, use \b. A word boundary is a character other than \w. \w is [a-zA-Z0-9_] now either one or more occurence of word character in a series will be matched until a non word character comes which is usually a space and that will be the boundary. \B matches every position where \b does not match. The word boundary is not always a space it can be any other character not included in \w shorthand character class.

To match with word boundary 

/ \bregex\b /mg

Alternation:

Alternation is used for or operation. It results in selection one option out of a number of available options. These options are seperated by | the pipe symbol. Any number of options can be written with in the alternation group. However an alternation group should contain atleast two options otherwise there is no used of alternation. 

/ I\slike\s(java|python) /g

It will match two strings

I like java

I like python

if I like is a complete match then after that if it is java or python in both cases it will be a match. 

some other examples can be 

/ (regex|regular expression|regular expressions) / g

/ (car|truck|bus|airplane|rocket) / g

The first expression will match either regex or regular expression or regular expressions while second example will match any word out of car truck bus airplane rocket.

Groups:

Groups help in applying different operations on a collection instead of a single character. Groups are created by putting parenthesis around a collection of tokens or symbols and later you may apply conditions on these groups. For example you may apply a quantifier to a whole group like + or * or even ? 

The whole match is referred to as group 0, if you create a single group it is called group 1. In case of more than one group in regex the regex engine names them as 1,2,3 starting from left to right.

/ ^(\d{3}) - (\d{2}) - (\d{5})$ / g

Here, 

group 0    (\d{3}) - (\d{2}) - (\d{5})

group 1    (\d{3})

group 2    (\d{2})

group 3      (\d{5})  

If you have to repeat a group instead of writing the regex for that simply write group number and a backslash before it.

/ ^(\d{3}) - (\d{2}) - (\d{5}) \2 \1$ / g

This is also known as backreferencing. Backreferencing is quite helpful in many situations, it also makes your regex look less clumsy and enhances clarity. 

If you don't want to capture any group simply place ?: at the start of group regex engine will not number that group i.e. it will not capture that group and it is a non capturing group. An example is like 

(?:non-captured group)

Assertions & Lookaround:

There are certain instances when a certain word before or after match is required. This is called lookaround assertions. They are extremely helpful. Lookaround is of two types lookahead and lookbehind. Lookahead can be positive or negative. Similarly lookbehind can also be positive or negative. Lookaround is technically a group. The regex inside the parenthesis is matched as usuall, however after matching the regex engine looks if a certain word follows it or precedes it or follows and precedes it. In case of a positive reply the match is declared as successful.

First the syntax:

Positive lookahead           ?=

Negative lookahead          ?!

Positive lookbehind          ?<=

Negative lookbehind         ?<!

Positive lookahead is to check the presence of an element or character after the given character or group.

a(?=b) will match a in abc but will not match a in acb or bac

Negative look ahead is to see if a certain element does not follow the match

y(?!z) will match y in xyz but will not match y in zyx

Similarly for Positive and negative lookbehind

(?<=y)z will match z in xyz but will not match z in zyx

(?<!y)z will match z in yxz but will not match z in xyz




Features

  • Regex made Easy for all
  • High School and College
  • University level
  • Professionals doctors, engineers, scientists
  • Data Analysts