Introduction

Regular expression(RegEx) is a sequence of characters that forms a search pattern. Python RegEx is used for identifying a search pattern in a text string. It allows finding, replacing and formatting the data. Python has re module for regular expression. 

Useful fields of Regular Expressions:

1. Data analytics 

2. Web scraping 

3. User input validation 

4. Email verification

5. Text editors and many more 

 

Implementation

Pattern matching and Searching was a very difficult task before introducing the regular expression. Regular expression has simplified the task which was very difficult in past time.

To have a look, how regular expression has simplified the task, let’s take an example that verify the phone number is valid or not. 

If this task is to be done manually without regular expression can be done in this way:

def isValidNumber(text):
    if len(text) != 15:
        return False
        
    if text[0] != '+':
        return False
        
    for i in range(1, 4):
        if text[i].isdecimal() == False:
            return False
    
    if text[4] != ' ':
        return False
        
    for i in range(5, 15):
        if text[i].isdecimal() == False:
            return False
            
    return True

text1 = '+977 9800000000'

if isValidNumber(text1):
    print('Valid number')

Output

Valid number

We’ve taken mobile number of Nepal for verification. As we know country code for Nepal is +977. And then phone number has 10 digits in it. We assume that the country code and 10 digit phone number is separated by space[ ]. 

Explanation of steps of verification 

First of all, code checks whether the length of supplied text(mobile number for verification in string) is of length 15 or not, if not it returns False. Then it checks the text begins with “+” or not, if not it returns False. Then, it checks first three characters after “+” are decimal or not, if not it returns False. Then it checks there is space or not, if not it returns False. Then finally it checks all the remaining characters are decimal or not, if not it returns False. 

Now, let’s see how phone number verification can be done with regular expression: 

import re

text = '+977 9800000000'
phReg = re.compile(r'\+\d\d\d \d\d\d\d\d\d\d\d\d\d')

if phReg.search(text) == None:
    print('Invalid number')
else:
    print('Valid number')

Just few lines of code and result is same. We’ve created a search pattern using compile function of re module. Here, d represents the decimal number(0-9). Pattern is so created that it will look for pattern starting with “+” followed by three decimal number followed by space and ten decimal numbers.

 

RegEx function

Some of the function that assist us to search a string in text are as follows: 

1) Findall()

It returns a list containing all the matches

import re

text = 'my mobile number is 977-980000000.My telephone number is 076-678567'

phReg = re.compile(r'\d\d\d-\d+')
contact_no = phReg.findall(text)

print(contact_no)

Output

['977-980000000', '076-678567']

 

2) Search()

Returns a match object if there is any match in the string

import re

text = 'My mobile number is 977-9800000000 and telephone number is 076-000000'

reg = re.compile(r'\d\d\d-\d+')
contact_no = reg.search(text)

print(contact_no.group())

Output

977-9800000000

Here we got only mobile number but not telephone number. It is because search() function returns only a first match . If no match found it returns None. 

 

3) Split()

This split() function returns a list where the string has been split at each match 

import re 

text = 'My mobile number is 977-9800000000'

reg = re.split('\s', text) 

print(reg)

Output

['My', 'mobile', 'number', 'is', '977-9800000000']

 

4) Sub()

This sub() function substitutes the certain word with another word that is required to be substituted. 

import re 

text = 'My mobile number is 977-9800000000'

reg = re.sub('\s','-', text) 

print(reg)

Output

My-mobile-number-is-977-9800000000

From above output, we can see spaces were replaced by "-" in text. 

Also, we can specify the control parameter as count to control the number of replacement. 

import re 

text = 'My mobile number is 977-9800000000'

reg = re.sub('\s','-',text, 1)

print(reg)

Output

My-mobile number is 977-9800000000

Only one space is replaced by "-" while other spaces remains same.  

 

Metacharacters 

Metacharacters are the characters with special meaning. Some of the metacharacters in regular expression are: 

1) Dot(.)

Matches any characters except new line 

import re

text = 'I am from Nepal'

match = re.findall('Ne..l', text)

print(match)

Output

['Nepal']

 

2) Caret(^)

Matches the start of the string 

import re

text = 'I am from Nepal'

match = re.search('^I', text)

print(match)
print(match.group())

Output

<re.Match object; span=(0, 1), match='I'>
I

Here, we got output as “I” because the string starts with “I”

Let's see another example

import re 

text = 'I am from Nepal' 

match = re.search('^am', text) 

print(match)

Output

None

Here, we got None as output because string doesn’t start with "am". 

 

3) Dollar($)

Matches the end of string 

import re 

text = 'I am from Nepal' 

match = re.search('Nepal$', text) 

print(match) 

Output

<re.Match object; span=(10, 15), match='Nepal'>

Here, the output we got show that the string ends with Nepal

Another example

import re 

text = 'I am from Nepal' 

match = re.search('Nep$', text) 

print(match) 

Output

None

Here, None output show that the string doesn’t end with “Nep”

 

4) Question mark(?)

It causes the resulting regular expression to match 0 or 1 repetition of preceding regular expression 

import re

text = "A superwoman is there"

reg = re.compile(r'super(wo)?man')
match = reg.search(text)

print(match)
print(match.group())

Output

<re.Match object; span=(2, 12), match='superwoman'>
superwoman

The (wo)? Tells that the (wo) group can appear one or zero time in string. Here, the (wo) group appear once in ‘text’ so we got match. Also, 

import re

text = "a superman is there"

reg = re.compile(r'super(wo)?man')
match = reg.search(text)

print(match)
print(match.group())

Output

<re.Match object; span=(2, 10), match='superman'>
superman

Here, the (wo) group is completely absent. ‘?’ allows zero one one time repetition, which is the reason we got the match.

 

import re

text = "there is a superwowowoman"

reg = re.compile(r'super(wo)?man')
mo = reg.search(text)

print(mo)

Output

None

Since the (wo) group appear twice, we couldn’t get the match. So we got None as output.

 

5) Asterisk(*)

Causes the resulting regular expression to match 0 or more repetition of preceding regular expression

import re 

reg = re.compile('super(wo)*man') 
mo = reg.search('There is a superman') 

print(mo) 
print(mo.group()) 

Output

<re.Match object; span=(11, 19), match='superman'>
superman

The (wo)* tells that the group (wo) can appear 0 or more time. There is no presence of (wo) group and still we got the match. 

import re

reg = re.compile(r'super(wo)*man')
mo = reg.search('There is a superwowoman')

print(mo)
print(mo.group())

Output

<re.Match object; span=(11, 23), match='superwowoman'>
superwowoman

Since, (wo) is repeated twice we got the match. 

 

6) Plus(+)

Causes the resulting regular expression to match 1 or more repetition of preceding regular expression 

import re 

reg = re.compile(r'super(wo)+man') 
mo = reg.search('There is a superman') 

print(mo) 

Output

None

The (wo)+ tells that the (wo) group can appear 1 or more time in string. Since, the string has no (wo) group we didn’t got match.

 

import re 

reg = re.compile(r'super(wo)+man') 
mo = reg.search('There is a superwowowoman') 

print(mo) 

Output

<re.Match object; span=(11, 25), match='superwowowoman'>

(wo) group is repeated more than one time so we got match.

 

7) Curly braces ({m})

Specifies that exactly m copies of the previous regular expression should be matched 

import re 

reg = re.compile(r'super(wo){3}man') 
mo = reg.search('There is a superwowowoman') 

print(mo)

Output

<re.Match object; span=(11, 25), match='superwowowoman'>

The (wo) group is repeated exactly 3 times in string. So we got the match in string.

 

import re 

reg = re.compile(r'super(wo){2}man') 
mo = reg.search('There is a superwowowoman') 

print(mo) 

Output

None

The (wo){2} tells that the repetition of (wo) group should be exactly 2 but string got 3 repetition so match couldn’t be found. 

 

8) {m, n}

Causes the regular expression to match from m to n repetition of preceding regular expression

import re 

reg = re.compile(r'(wo){2,4}') 
mo = reg.search('wowowowowowo') 

print(mo) 

Output

<re.Match object; span=(0, 8), match='wowowowo'>

Here, we got the match that has four (wo) groups. This could have matched the two or three or four (wo) groups but still gone for four (wo) group. This is called Greedy Matching as the regular expression in python look for most possible match. 

For Non-Greedy matching  
import re 

reg = re.compile(r'(wo){2,4}?') 
mo = reg.search('wowowowowo') 

print(mo) 

Output

<re.Match object; span=(0, 4), match='wowo'>

Here, we got match having two (wo) groups only.

 

9) Pipe(|)

Matches any one from two 

import re

text = 'Nepal is a beautiful country and people in this country speaks Nepali'

reg = re.compile(r'Nep(al|ali)')
mo = reg.search(text)

print(mo.group())

Output

Nepal

Regular expression matches (al) group to give “Nepal” as output.

 

10) Square bracket ([ ])

It is used to define the set of characters

import re

text = 'my name is Arnold and i\'m 108'

match = re.findall(r'[0-9]', text)

print(match)

Output

['1', '0', '8']

Here, we defined a raw string that grab the numbers between 0-9. Also

import re

text = 'my name is Arnold and i\'m 108'

match = re.findall(r'[^0-9]', text) 

print(match)

Output 

['m', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 'A', 'r', 'n', 'o', 'l', 'd', ' ', 'a', 'n', 'd', ' ', 'i', "'", 'm', ' ']

Using ^ symbol in above, we can search for string other than that of set of characters defined in square brackets.

 

Character classes

Character classes are used to shortening the regular expression. For example:

[0-9]- matches the digits between 0 to 9 while \d also matches the digits between 0 and 9 

\d – any digits between 0 to 9 

\D – any characters except digits 

\w – any letter, digits and underscore 

\W – any characters that is not letter, not digits nor underscore 

\s – any space, tab or newline characters 

\S – any characters other than space, tab and newline character 

Now, using these character classes. Let’s extract the phone numbers from a text in below example.

 

import re

text = 'My mobile number is 977-9867000009. My another mobile number is 977-9800000000. Also you can try my telephone: 076-000000'

reg = re.compile(r'\d\d\d-\d+')
mo = reg.findall(text)

print(mo)

Output

['977-9867000009', '977-9800000000', '076-000000']

First of all, it is noted that the mobile number has pattern of 3 digits followed by dash(-) followed by 10 digits. Also, telephone number has pattern of 3 digits followed by dash(-) followed by 6 digits.

Then a regex object created using compile() function that search for pattern first three digits denoted by \d and then dash(-) and then \d+. Here, \d+ looks for digits that is or more times as explained in previous section. 

Let’s say there is text “Birth date of Ramesh 2056-08-07 and birth date of Suresh 2054-06-09”. If we wished to take out name and birthdate we can do it easily using regular expression.

import re

text = 'Bith date of Ramesh 2056-08-07 and birth date of Suresh 2054-06-09'

reg = re.compile(r'\w+\s\d{4}-\d{2}-\d{2}')
mo = reg.findall(text)

print(mo)

Output

['Ramesh 2056-08-07', 'Suresh 2054-06-09']

Here, we created a pattern that looks for character of any length (\w+) followed by space (\s) followed by four digits that gives year (\d{4}) followed by dash(-) followed by two digits that gives month (\d{2}) followed by dash (-) and two digits that gives day (\d{2}). 

(\d{4}) and (\d{2}) looks for digits repetiting exactly 4 times and exactly 2 times respectively. 

 

Strong password detection with regular expression 

Let’s assume that the strong password is one that has minimum 8 characters length, at least one uppercase letter, at least one lowercase letter and at least one digit.

import re 

def isStrongPassword(text): 
    if len(text) < 8: 
        return False 
    
    if re.search(r'[A-Z]', text) == None: 
        return False 
        
    if re.search(r'[a-z]', text) == None:
        return False 
        
    if re.search(r'[0-9]', text) == None: 
        return False 

    return True 

pw = input('Enter password : ')

if isStrongPassword(pw): 
    print('Password is strong') 
else: 
    print('Password is not strong') 

Output

Enter password : AbGc6 

Password is not strong 

 
Enter password : abcdefgh1234 

Password is not strong 


Enter password : ABCdejfkdmdksisjsmmzmzmz 

Password is not strong 


Enter password : Abcedfghxyz1290 

Password is strong 

 

Conclusion

Regular expression has simplified the searching and matching task that we need in our daily life. Without regular expression, pattern matching, searching string through text files was very difficult.

With the aid of regular expression, these task has been simple with little effort and is time saving. Regular expression provides functions, metacharacters, character classes that make task of matching and searching really simple. So, learning regular expression is worthy.

If you have any queries regarding the tutorial, please leave a comment below. I will hear you asap. If you master the Regular Expressions, then you are way forward than other programmers

 

Reference

 
Happy Coding :)