Regular Expressions are also known simply as RegEx.
Regular Expressions is one of those topics that scares most developers who don’t take their time to understand how they work properly.
Having a strong notion of RegEx is like having an interpreter inside your own head.
Sometimes it might take you a good amount of time to come up with the RegEx you want and the solution will be literally a single line of code.
Why and When RegEx?
The find()
method in a string is good when you know exactly what you are looking for.
If you want to find the number 9 you simply do:
>>> text = "1 2 3 4 5 HEY 9"
>>> text.find("9")
14
So you know 9 is at position 14.
But what if you want to filter out everything that is not a number from a string?
It’s impossible to use find()
to look for all the numbers one at a time, it would be a lot of work.
So the task is, given "1 2 3 4 5 HEY 9", how do I return "1 2 3 4 5 9", excluding HEY or any other character that is not a number.
These kinds of tasks are very common in Data Science.
You see, raw data is usually very messy, and you have to clean it to make it usable.
Let’s see how Regular Expressions solve these kinds of tasks.
RegEx and ther numbers problem
To use regular expressions you have to import the re
module.
To use it, you have to simply make:
>>> import re
Let’s see how to solve the numbers task to get a feeling of RegEx.
>>> import re
>>>
>>> text = "1 2 3 4 5 HEY 9"
>>>
>>> only_numbers = re.findall("\d", text)
>>>
>>> print(only_numbers)
['1', '2', '3', '4', '5', '9']
YES! We did it!
Let’s understand how this magic happens.
We use the findall()
function from the re
module.
It takes two arguments, the first is what we are looking for, the second is the text you are applying the search to.
In Regular Expressions, we do not look for actual values, instead we look for patterns.
The special symbol \d
tells the findall()
to look only for numbers, i.e., digits, that’s why the ‘d’.
As a result, it returns only the numbers/digits contained in the string, filtering out the rest.
The re module
The re
module has 4 functions to work with:
- findall: returns a list with the actual values that matched your search
- split: splits the string at each match and returns the pieces as a list
- sub: when it finds the match for your search, it will replace the matched value for another one you provided.
- search: returns a Match object if your search matched something in the string
findall
We already saw how findall()
works.
It returns a list of the values that match your search.
This is the numbers example again with a variation of characters.
Notice that it doesn’t matter where the non-number characters are, the findall()
will filter them out and return only the numbers.
>>> import re
>>>
>>> text = ";? / 1 2% 3 & 4 5 HEY 9 ! $ Renan"
>>>
>>> only_numbers = re.findall("\d", text)
>>>
>>> print(only_numbers)
['1', '2', '3', '4', '5', '9']
If there is no match, it will return an empty list.
In this case, there are no numbers in "New York".
>>> import re
>>>
>>> text = "New York"
>>>
>>> only_numbers = re.findall("\d", text)
>>>
>>> print(only_numbers)
[]
split
The split()
will find every occurrence that matches your search and will split the string in pieces in the location of these matches.
The example where we match only numbers returns a list with everything but numbers.
We have ‘;? / ‘ and then split()
finds the number 1 and makes a split.
Since there is nothing between 1 and 2, a whitespace is set on the second split.
Then there is a ‘%’ and the number 3, which makes another split, and so on.
>>> import re
>>>
>>> text = ";? / 1 2% 3 & 4 5 HEY 9 ! $ Renan"
>>>
>>> split_string_every_number = re.split("\d", text)
>>>
>>> print(split_string_every_number)
[';? / ', ' ', '% ', ' & ', ' ', ' HEY ', ' ! $ Renan']
In the case with no match, there will be nothing to split, so split()
will simply return the whole string normally.
>>> import re
>>>
>>> text = "New York"
>>>
>>> split_string_every_number = re.split("\d", text)
>>>
>>> print(split_string_every_number)
['New York']
sub
The sub()
function will look for matches for your search, then replace the matches with some given value you provide.
Notice we have to pass three arguments, the regular expression, in this case, \d
to tell sub()
to match only numbers as we already know, *
is the value that we are choosing to replace the matching numbers, and finally text
is the variable containing being searched.
>>> import re
>>>
>>> text = ";? / 1 2% 3 & 4 5 HEY 9 ! $ Renan"
>>>
>>> text_with_subs = re.sub("\d", "*", text)
>>>
>>> print(text_with_subs)
;? / * *% * & * * HEY * ! $ Renan
Notice that every single number was replaced with an asterisk *
.
There is also the option to specify how many matches will be replaced with a 4th argument.
Here we are telling it to replace only the first three matches.
>>> import re
>>>
>>> text = ";? / 1 2% 3 & 4 5 HEY 9 ! $ Renan"
>>>
>>> text_with_subs = re.sub("\d", "*", text, 3)
>>>
>>> print(text_with_subs)
;? / * *% * & 4 5 HEY 9 ! $ Renan
Notice how the numbers 1, 2 and 3 were replaced by the *
, but not the other numbers since we specified that only the first three matches were to be replaced.
search
The search()
function returns a Match object if your search matched something in the string, otherwise, it returns None
.
There are no numbers in "New York", so it returns None
.
>>> import re
>>>
>>> text = "New York"
>>>
>>> match_object = re.search("\d", text)
>>>
>>> print(match_object)
None
It found a match, so a Match object is returned.
If there are a number of matches, it will match only the first one, in this case, the number 1.
>>> import re
>>>
>>> text = ";? / 1 2% 3 & 4 5 HEY 9 ! $ Renan"
>>>
>>> match_object = re.search("\d", text)
>>>
>>> print(match_object)
<re.Match object; span=(5, 6), match='1'>
But what do you do with a Match object?
The Match object has a few methods to work with.
Considering we already have the result from the match before in the variable match_object
, let’s start from there.
The span()
method will give you the position of the match, which means 1 is at position 5 and it ends ante position 6.
>>> match_object.span()
(5, 6)
The group()
method returns the match itself.
>>> match_object.group()
'1'
The string
and re
variables will give you the original string and the regular expression you used.
>>> match_object.re
re.compile('\\d')
>>> match_object.string
';? / 1 2% 3 & 4 5 HEY 9 ! $ Renan'
So, why would you use search()
instead of findall()
if the latter gives you all the matches and not just the first match?
The answer is performance.
In many situations you don’t need all the occurrences that match your search, sometimes you just need to know that there is at least one match, and search()
is perfect for that and it also lets you know the position of the match, not just the value that was matched.
Since findall()
costs more memory and it is slower, use it only if you really need to find every occurrence of your search.
Our strings are not that long, so performance won’t matter much here in our examples, the difference is neglectable, but consider when to use findall()
and when to use search()
when coding something in a real project.
In the next section, we are going to use only findall()
to demonstrate the examples simply because it is visually easier to understand.
Other special characters and sequences
There a number of special characters and sequences to work with just like the \d
we have used so far.
Let’s see a list of them, what they mean, and then apply the most important ones in some examples in this section.
\D
is the opposite of \d
, it matches everything that is not a digit.
>>> import re
>>>
>>> text = "1 2 3 & 4 5 HEY 9 Renan"
>>>
>>> matches = re.findall("\D", text)
>>>
>>> print(matches)
[' ', ' ', ' ', '&', ' ', ' ', ' ', 'H', 'E', \
'Y', ' ', ' ', 'R', 'e', 'n', 'a', 'n']
\w
matches alphanumeric characters, that is, characters from a to Z, digits from 0-9, including _
.
Symbols like ‘@’, and ‘%’ won’t be matched.
>>> import re
>>>
>>> text = "1@ 2! 3% & 4 5 *HEY 9 Renan-+"
>>>
>>> matches = re.findall("\w", text)
>>>
>>> print(matches)
['1', '2', '3', '4', '5', 'H', 'E', 'Y', '9', 'R', 'e', 'n', 'a', 'n']
You can use []
to look for a certain range.
Here we want only the lower case letters from ‘a’ to ‘q’.
>>> import re
>>>
>>> text = "New York"
>>>
>>> matches = re.findall("[a-q]", text)
>>>
>>> print(matches)
['e', 'o', 'k']
You can also match only uppercase.
>>> import re
>>>
>>> text = "New York"
>>>
>>> matches = re.findall("[A-Z]", text)
>>>
>>> print(matches)
['N', 'Y']
Now we want only the numbers from 2 to 6.
>>> import re
>>>
>>> text = "102040424532191000232323"
>>>
>>> matches = re.findall("[2-6]", text)
>>>
>>> print(matches)
['2', '4', '4', '2', '4', '5', '3', '2', '2', '3', '2', '3', '2', '3']
You can use the ^
symbol to match the beginning of a string.
When you match "^xbox", you are saying "check if the string begins with xbox.
If it does, it will return the word you are looking for, if it doesn’t, it will return an empty result.
>>> import re
>>>
>>> text = "xbox is the best console"
>>>
>>> matches = re.findall("^xbox", text)
>>>
>>> print(matches)
['xbox']
Similarly, you can use the $
symbol to match the end of a string.
With "playstation$" you are saying "check if the string ends with ‘station’".
>>> import re
>>>
>>> text = "I prefer playstation"
>>>
>>> matches = re.findall("station$", text)
>>>
>>> print(matches)
['station']
The *
symbol is to match zero or more occurrences.
When matching "go*", you are saying "match anything that has a ‘g’ followed by any number of ‘o’ ".
>>> import re
>>>
>>> text = "hey ho, let's gooooo"
>>>
>>> matches = re.findall("go*", text)
>>>
>>> print(matches)
['gooooo']
You can combine these to make more complex matches.
Say you want only the numbers from 25 to 39.
Simply use []
twice, and specify that you only want a number from 2 to 5 in the first pair of brackets, and from 3 to 9 in the second pair of brackets.
>>> import re
>>>
>>> text = "10 21 32 1000 100 323 34 22 49 27 28"
>>>
>>> matches = re.findall("[2-5][3-9]", text)
>>>
>>> print(matches)
['23', '34', '49', '27', '28']
Regular Expressions can go far and have an infinite number of combinations.
You don’t need to remember all these symbols or combinations, whenever you need to do something with regex, just look for "regex cheat sheet", you will find many resources with combinations for you to try.
The purpose of this article was to introduce you to the topic, so you know how to use it if you ever need it and what to look for.
Removing ALL the whitespaces
To finish this article, I want to suggest you a nice trick you can achieve with Regular Expressions.
Check the article How to remove all white spaces in a string in Python.
Conclusion
Regular Expressions is no easy topic.
You can find whole books dedicated solely to this topic, but I hope this was enough to desmistify what they are and what they do in way you can explore further for your specific needs.