Python RegEx

This is not one of those code Languages, Python is a pretty organized programming language that enables you to carry out search operations. This operation can be to match or find other sets of strings with the implementation of specialized syntax In a Patten. Though the RegEx(Regular expressions) are commonly used in UNIX, however, python makes available full support of the RegEx operations through the re-module.

At the end of this page, you’ll be familiar with the important functions that will enable you to handle Python Regular Expressions.

You’ll be learning the following:

  • What is Python RegEx?
  • Metacharacters
  • Special Sequences
  • Python RegEx Module
  • Match object
  • Getting the Index of matched object

You’ll be well-grounded in and fit for any operations with python RegEx

What is Python RegEx

Simply stated a Python RegEx is a set of characters that form a search pattern in python. RegEx can be used to check if a string has the specified search patterns. Python provides a re-module that supports the use of regex in Python. The regular expression is enabled in Python by the re-module. This means to use the RegEx in Python you have to Import the re module

The table Below Features the RegEx that can be matched against a string.

ExpressionStringMatched/unmatched
^a…s$absMatched
^a…s$aliasUnmatched
^a…s$abyssUnmatched
^a…s$AliasMatched
^a…s$An abacusMatched

Example:

Python Input

import re

s = ‘Simon: Phenomenal Academy’

match = re.search(r’portal’, s)

print(‘Start Index:’, match.start())

print(‘End Index:’, match.end())

Output

Start Index: 34

End Index: 40

The code above gives the starting index and the ending index of the String Portal.

Metacharacters

To understand the RE analogy, It is very necessary that you understand MetaCharacters. The table below is list of metacharacters and their description

MetacharactersDescription
\It used to drop the special meaning of the character next to it
[]Used to represent a character class
^It’s used to match the beginning
$It’s used to match the end
.It matches any character except the newline
|It means OR (Matches with any of the characters separated by it.
?It used to match zero or one occurrence
*Any number of occurrences (including 0 occurrences)
+To add one or more occurrences
{}Indicate the number of occurrences of a preceding regex to match.
()It encloses a group of Regex

\-Backslash

The backslash (\) is a sure way to escape the MetaCharacters, it pure the characters in an ordinary character class. For example, the dot(.) will be treated as a special when you search for it. So in order to avert such, you will use the backslash(\) just before the dot(.) This will cause it to lose its specialty.

Consider the example below:

Python Input

import re
s = 'Mrphenomenal'
# without using \
match = re.search(r'.', s)
print(match)
# using \
match = re.search(r'\.', s)
print(match)

Output:


<_sre.SRE_Match object>
<_sre.SRE_Match object>

[] – Square Brackets

Square Brackets ([]) it’s used to match a set of characters in a class with similar characteristics we wish to match. For example, the character class [123] will match with any single character 1, 2, or 3.

Also, you can use the box bracket to specify a range of characters used inside it

[0, 3] is the same as [0123]

[a-c] is the same as [abc]

Also, you can use the carry(^) symbol to invert the character class.

Example,

[^0-3] any number excluding 0, 1, 2, or 3

[^a-c] any alphabets excluding a, b, or c.

^ – Caret

The caret (^) symbol is used to check if the string starts with the given characters or not

Example – 

^s will confirm if the string starts with s such as sheep, Simon, etc.

^ph will check if the string starts with ph such as Phenomenal, phone, etc.

$ – Dollar

After the beginning checks comes the end check. The dollar ($) symbol checks if the string ends with the given character. It is written after the letter to be checked has been written down.

For example –

n$ will confirm that the string ends with the letter n like this: Simon, fun, etc.

ph$ will confirm that the string ends with the letter ph such as a graph, laugh and etc.

. – Period

The Dot(.) symbol matches only a single character that is not in the newline character (\n). For example – 

a.b will check if any string at the place of the dot carries any of the characters such at, amb, amc, and etc

A double dot(..) will check for two characters in the string.

| – Alternation

The Or symbol Operates as the or operator. It checks whether the pattern before or after the or is present in the string or not.

For Example

a|b will match the string that carries a or b such as in adb, uacjb and etc

? – Question Mark

The Question mark(?) sign checks whether the string that is contained in entire RegEx occurs at least once or not

For example – 

ab?c will match for the string ac, abc but will not be matched for acbb because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.

* – Star

Star (*) symbol matches zero or more occurrences of the regex preceding the * symbol. For example – 

ab*c will be matched for the string ac, abc, abbbc, dabc, etc. but will not be matched for abdc because b is not followed by c.

+ – Plus

Plus (+) symbol matches one or more occurrences of the regex preceding the + symbol. For example – 

ab+c will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and b is not followed by c in abdc.

{m, n} – Braces

Braces match any repetitions preceding regex from m to n both inclusive. For example – 

a{2, 4} will be matched for the strings aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no an in both cases.

( ) – Group

The Group bracket is used to group sub-patterns. For example – 

(a|b|c|d)yz will match for strings like acd, abcd, gacd followed by xz.

Special Sequences:

 Special sequences enables commonly used patterns to be easier to write. They do not match characters rather it specifies a location where a search must occur. Here’s a list of special sequences and examples

Special SequenceDescriptionExample
\AIt’s used to match if the string begins with the given character\Aon    on top
\bMatches if the specified character begins or ends a word. \\bthe.   the plan
\BAn opposite of the \b, the string should not start or end with the given regex.\Btu.    congratulation
\dMatches any decimal digit. this is equal to [0-9]\d         a1b2c3
\DMatches all characters except any non-digit character. Its equals [^0-9]\D        a1b2c3
\sIt matches any whitespace character and it’s equivalent to [ \t\n\r\f\v]\s         e bo k
\SAn opposite of \s it matches any non-whitespace character\S         e bo k
\wMatches any alphanumeric character. Its equivalent to the class [a-zA-Z0-9_].\w        3yj385
\WAn opposite of  \W. It matches any non-alphanumeric character.\W        $<>
\ZMatches if the string ends with the given characterab\Z     baghsab

Python RegEx Module

Python provides a re-module that supports the use of regex in Python. The regular expression is enabled in Python by the re-module. This means to use the RegEx in Python you have to Import re module

Right away, there is a range of Functions provided by the re module, let’s know some of them.

re.findall()

This method returns a list of strings containing all matches. The string is scanned left to right, and matches are given in the order found.

Example: Finding all matches of a pattern

Python Input

# A Python program to demonstrate working of
# findall()
import re
# A sample text string where regular expression
# is searched.
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""
# A sample regular expression to find digits.
regex = '\d+'
match = re.findall(regex, string)
print(match)

Output:


['123456789', '987654321']

re.compile()

 Various operations of RegEx such as searching for pattern matches or performing string substitutions are compiled into pattern objects.

Example

Python Input

# Module Regular Expression is imported
# using __import__().
import re
# compile() creates regular expression
# character class [a-e],
# which is equivalent to [abcde].
# class [abcde] will match with string with
# 'a', 'b', 'c', 'd', 'e'.
p = re.compile('[a-e]')
# findall() searches for the Regular Expression
# and return a list upon finding
print(p.findall("Aye, said Mr. Gideon Stark"))

Output:


['e', 'a', 'd', 'd', 'e', 'a']

re.split()

The re.split method is used to split the string at the point where a match occurs and returns a list of strings at the area of the split. After the operation, If the pattern of the match is not found, re.split() will return a list containing the original list. You can pass maxsplit argument to the re.split() to determine the maximum number of splits that will occur. However, the default value for maxsplit is o.

Syntax : re.split(pattern, string, maxsplit=0, flags=0)

Python Input

from re import split
# '\W+' denotes Non-Alphanumeric Characters
# or group of characters Upon finding ','
# or whitespace ' ', the split(), splits the
# string from that point
print(split('\W+', 'Words, words , Words'))
print(split('\W+', "Word's words Words"))
# Here ':', ' ' ,',' are not AlphaNumeric thus,
# the point where splitting occurs
print(split('\W+', 'On 12th June 2022, at 11:02 AM'))
# '\d+' denotes Numeric Characters or group of
# characters Splitting occurs at '12', '2022',
# '11', '02' only
print(split('\d+', 'On 12th June 2016, at 11:02 AM'))

Output:


['Words', 'words', 'Words']
['Word', 's', 'words', 'Words']
['On', '12th', 'June', '2022', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']

re.subn()

subn() is similar to sub() except in its output. It returns a tuple of 2 items, one containing the new String and the other the number of substitutes made.

Syntax:

 re.subn(pattern, repl, string, count=0, flags=0)

Python Input

import re
print(re.subn('ub', '~*', 'Subject has Uber booked already'))
t = re.subn('ub', '~*', 'Subject has Uber booked already',
            flags=re.IGNORECASE)
print(t)
print(len(t))
# This will give same output as sub() would have
print(t[0])

Output:


('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)
2
S~*ject has ~*er booked already

re.search()

The re.search() method is best for testing a regular expression than data collection because it stops after the first match. It takes a pattern and a string as arguments and searches for the first location where the RegEx pattern produces a match with the string. A None is returned if the pattern doesn’t match or a match objects if it is successful.

Syntax: match = re.search(pattern, str)

Python Input

# A Python program to demonstrate working of re.match().
import re
# Lets use a regular expression to match a date string
# in the form of Month name followed by day number
regex = r"([a-zA-Z]+) (\d+)"
match = re.search(regex, "I was born on June 24")
if match != None:
    # We reach here when the expression "([a-zA-Z]+) (\d+)"
    # matches the date string.
    # This will print [14, 21), since it matches at index 14
    # and ends at 21.
    print ("Match at index %s, %s" % (match.start(), match.end()))
    # We us group() method to get all the matches and
    # captured groups. The groups contain the matched values.
    # In particular:
    # match.group(0) always returns the fully matched string
    # match.group(1) match.group(2), ... return the capture
    # groups in order from left to right in the input string
    # match.group() is equivalent to match.group(0)
    # So this will print "June 24"
    print ("Full match: %s" % (match.group(0)))
    # So this will print "June"
    print ("Month: %s" % (match.group(1)))
    # So this will print "24"
    print ("Day: %s" % (match.group(2)))
else:
    print ("The regex pattern does not match.")

Output:


Match at index 14, 21
Full match: January 24
Month: June
Day: 24

Match Object

A Match object contains all the information n about the search results. If the search wasn’t successful it’ll return a None Otherwise it’ll return a match object. You can use the dir() function to get the methods and attributes of a match object.

Some commonly used attributes are exemplified below

Getting RegEx and String

match.re gives out the regular expression used and match.string returns the string passed. These two are known as attributes.

Example: Getting the string and the regex of the matched object

Python Input

import re
s = "Welcome to My Programming School"
# here x is the match object
res = re.search(r"\bP", l)
print(res.re)
print(res.string)

Output:

re.compile('\\bP')
Welcome to My Programming School

Getting the index of matched object

In this example, the start() method returns the starting index of the matched substring and the end() method returns the ending index of the matched substring. Similarly, span() method gives out a tuple containing the starting and the ending index of the substring that is matched.

Example:

Python Input

import re
s = "Welcome to Phenomenal Academy"
# here x is the match object
res = re.search(r"\bPhe", l)
print(res.start())
print(res.end())
print(res.span())

Output:


11
14
(11, 14)

Getting matched substring

To get the matched substring, group() method gives out the area of the substring for which the patterns match.

Example

Python Input

import re
s = "Welcome to Phenomenal Academy"
# here x is the match object
res = re.search(r"\D{2} t", s)
print(res.group())

Output:


me t

The above returns indicate a string with two characters, a space, and after the space.

Summary

Simply stated a Python RegEx is a set of characters that form a search pattern in python. RegEx can be used to check if a string has the specified search patterns.

To understand the RE analogy, It is very necessary that you understand MetaCharacters. Refer to table.

At the end of this page you’ve learned the following:

  • What is Python RegEx?
  • Metacharacters
  • Special Sequences
  • Python RegEx Module
  • Match object
  • Getting the Index of matched object

Pramod Kumar Yadav is from Janakpur Dham, Nepal. He was born on December 23, 1994, and has one elder brother and two elder sisters. He completed his education at various schools and colleges in Nepal and completed a degree in Computer Science Engineering from MITS in Andhra Pradesh, India. Pramod has worked as the owner of RC Educational Foundation Pvt Ltd, a teacher, and an Educational Consultant, and is currently working as an Engineer and Digital Marketer.



Leave a Comment