In this article you will learn about Lookbehind assertions in Regular Expressions their syntax and their positive and negative application.Lookbehind assertions are sometimes thought to be a bit difficult to comprehend and construct however, if some basic rules are followed they are as simple as any other regular expression element or group. Actually lookaround is divided into lookbehind and lookahead assertions. Lookbehind means to check what is before your regex match while lookahead means checking what is after your match. And the presence or absence of an element before or after match item plays a role in declaring a match.
If you want to learn Regex with plenty of examples and logic, I will strongly suggest you to see this simple and to the point Regex Course with practical examples and step by step approach with exercises. This video course teaches you the Logic and Philosophy of Regular Expressions from the Beginners level to advanced level.
Lookbehind as the name shows is the process to check what is before match. It matches a character or characters or a group before the actual match and decides to declare a successful match or a failure. But just like lookahead assertions they do not consume any characters and give up the match and return only a match or not a match. So two possible conditions are YES or NO. On this basis a decision is made. There are two types of lookbehind assertions:
i. Positive lookbehind
ii. Negative lookbehind
In positive lookbehind the regex engine searches for an element ( character, characters or a group) just before the item matched. In case it finds that specific element before the match it declares a successful match otherwise it declares it a failure.
The syntax for positive lookbehind is
/ (?<=element)match /
Where match is the word to match and element is the item or token to check which lies before match item. The whole lookbehind expression is a group enclosed in parenthesis. The structure starts with an opening parenthesis immediately followed by a question mark immediately followed by a less than symbol and equal sign. After that the element which should exist before actual match and closing parenthesis followed by the element to match.
Now suppose you want to match an x which immediately follows a y. In other words you want to match an x only and only if there is a y before it. The regex for this will be
This expression will match x in calyx but will not match x in caltex. So it will match abyx, cyz, dyz but it will not match yzx, byzx, ykx.
Now lets see how a regex engine works in case of a positive lookbehind.
Test string: Here is an example.
Regex: / (?<=r)e /
Please keep in mind that the item to match is e. The first structure is a lookbehind structure so regex notes whenever there will be an e match it will have to traceback and enter into lookbehind structure. Hence the regex engine will first start looking for an e from the start of string and will move from left to right. First it will check the first character that is an H ok, no match. Next yes it is an e. It is a conditional match now the engine will enter the lookbehind structure. From the parenthesis and ?<= syntax regex engine knows it is a lookbehind assertion and it notes that. After the e match the engine trace backs and checks the token that precedes e and tests if it is r. The answer is No the engine declares this e not a successful match and moves forward from left to right and checks the next character which is r, so no match then moves to next character again an e so a conditional match just before e is lookbehind assertion and it checks the character before e is r, the answer is yes, hence it declares this e as a match and exits.
One more example. Lets suppose you have data about different currencies and you want to add up only the USD dollars, ofcourse the digits and present this sum. The test string is
Now you want to match only those amounts which are in USD. The regex for that is
/ (?<=USD)\d+?,?\d+ /
Here the regex engine will search for a number with comma as an option. If it finds a number it will search for if USD precedes this number if the answer is yes, then it will declare that number as a match. Now many things could be done with this number like all the amounts in USD can be added up, similarly all other currencies can be summed up etc etc.
In negative lookbehind the regex engine first finds a match for an item after that it traces back and tries to match a given item which is just before the main match. In case of a successful traceback match the match is a failure, otherwise it is a success. Simply it means, if a particular item is found before a certain match, it will not be a match, however, in all other cases it will be a match.
The syntax of a negative lookbehind is
/ (?<!element)match /
Where match is the item to match and element is the character, characters or group in regex which must not precede the match, to declare it a successful match. So if you want to avoid matching a token if a certain token precedes it you may use negative lookbehind.
For example / (?<!x)y / will match y in ay and by but it will not match xy. Or we can say it will not match y in xy, other wise it will match every y which doesn't have an x before it.
Lets see a more practical example of negative lookbehind. Lets say you want to match currencies of all countries but Japanese Yen after matching you may add up the amount or do something else. The list is here.
Now you want to match all currencies but for some reasons you don't want to match the japanese yen. The regex will be / (?<!JPY) \d+?,?\d+ /
This regex will match the numbers or amounts of all currencies but japanese yen. Hence in this way you can match an element by avoiding a given statement or condition.