This is not one of those code Languages, Python is a pretty organized programming language that enables you to carry out search operations. This operation can be to match or find other sets of strings with the implementation of specialized syntax In a Patten. Though the RegEx(Regular expressions) are commonly used in UNIX, however, python makes available full support of the RegEx operations through the re-module.
At the end of this page, you’ll be familiar with the important functions that will enable you to handle Python Regular Expressions.
You’ll be learning the following:
- What is Python RegEx?
- Metacharacters
- Special Sequences
- Python RegEx Module
- Match object
- Getting the Index of matched object
You’ll be well-grounded in and fit for any operations with python RegEx
What is Python RegEx
Simply stated a Python RegEx is a set of characters that form a search pattern in python. RegEx can be used to check if a string has the specified search patterns. Python provides a re-module that supports the use of regex in Python. The regular expression is enabled in Python by the re-module. This means to use the RegEx in Python you have to Import the re module
The table Below Features the RegEx that can be matched against a string.
Expression | String | Matched/unmatched |
^a…s$ | abs | Matched |
^a…s$ | alias | Unmatched |
^a…s$ | abyss | Unmatched |
^a…s$ | Alias | Matched |
^a…s$ | An abacus | Matched |
Example:
Python Input
import re
s = ‘Simon: Phenomenal Academy’
match = re.search(r’portal’, s)
print(‘Start Index:’, match.start())
print(‘End Index:’, match.end())
Output
Start Index: 34
End Index: 40
The code above gives the starting index and the ending index of the String Portal.
Metacharacters
To understand the RE analogy, It is very necessary that you understand MetaCharacters. The table below is list of metacharacters and their description
Metacharacters | Description |
\ | It used to drop the special meaning of the character next to it |
[] | Used to represent a character class |
^ | It’s used to match the beginning |
$ | It’s used to match the end |
. | It matches any character except the newline |
| | It means OR (Matches with any of the characters separated by it. |
? | It used to match zero or one occurrence |
* | Any number of occurrences (including 0 occurrences) |
+ | To add one or more occurrences |
{} | Indicate the number of occurrences of a preceding regex to match. |
() | It encloses a group of Regex |
\-Backslash
The backslash (\) is a sure way to escape the MetaCharacters, it pure the characters in an ordinary character class. For example, the dot(.) will be treated as a special when you search for it. So in order to avert such, you will use the backslash(\) just before the dot(.) This will cause it to lose its specialty.
Consider the example below:
Python Input
import re s = 'Mrphenomenal' # without using \ match = re.search(r'.', s) print(match) # using \ match = re.search(r'\.', s) print(match)
Output: <_sre.SRE_Match object> <_sre.SRE_Match object>
[] – Square Brackets
Square Brackets ([]) it’s used to match a set of characters in a class with similar characteristics we wish to match. For example, the character class [123] will match with any single character 1, 2, or 3.
Also, you can use the box bracket to specify a range of characters used inside it
[0, 3] is the same as [0123]
[a-c] is the same as [abc]
Also, you can use the carry(^) symbol to invert the character class.
Example,
[^0-3] any number excluding 0, 1, 2, or 3
[^a-c] any alphabets excluding a, b, or c.
^ – Caret
The caret (^) symbol is used to check if the string starts with the given characters or not
Example –
^s will confirm if the string starts with s such as sheep, Simon, etc.
^ph will check if the string starts with ph such as Phenomenal, phone, etc.
$ – Dollar
After the beginning checks comes the end check. The dollar ($) symbol checks if the string ends with the given character. It is written after the letter to be checked has been written down.
For example –
n$ will confirm that the string ends with the letter n like this: Simon, fun, etc.
ph$ will confirm that the string ends with the letter ph such as a graph, laugh and etc.
. – Period
The Dot(.) symbol matches only a single character that is not in the newline character (\n). For example –
a.b will check if any string at the place of the dot carries any of the characters such at, amb, amc, and etc
A double dot(..) will check for two characters in the string.
| – Alternation
The Or symbol Operates as the or operator. It checks whether the pattern before or after the or is present in the string or not.
For Example
a|b will match the string that carries a or b such as in adb, uacjb and etc
? – Question Mark
The Question mark(?) sign checks whether the string that is contained in entire RegEx occurs at least once or not
For example –
ab?c will match for the string ac, abc but will not be matched for acbb because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.
* – Star
Star (*) symbol matches zero or more occurrences of the regex preceding the * symbol. For example –
ab*c will be matched for the string ac, abc, abbbc, dabc, etc. but will not be matched for abdc because b is not followed by c.
+ – Plus
Plus (+) symbol matches one or more occurrences of the regex preceding the + symbol. For example –
ab+c will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and b is not followed by c in abdc.
{m, n} – Braces
Braces match any repetitions preceding regex from m to n both inclusive. For example –
a{2, 4} will be matched for the strings aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no an in both cases.
( ) – Group
The Group bracket is used to group sub-patterns. For example –
(a|b|c|d)yz will match for strings like acd, abcd, gacd followed by xz.
Special Sequences:
Special sequences enables commonly used patterns to be easier to write. They do not match characters rather it specifies a location where a search must occur. Here’s a list of special sequences and examples
Special Sequence | Description | Example |
\A | It’s used to match if the string begins with the given character | \Aon on top |
\b | Matches if the specified character begins or ends a word. \ | \bthe. the plan |
\B | An opposite of the \b, the string should not start or end with the given regex. | \Btu. congratulation |
\d | Matches any decimal digit. this is equal to [0-9] | \d a1b2c3 |
\D | Matches all characters except any non-digit character. Its equals [^0-9] | \D a1b2c3 |
\s | It matches any whitespace character and it’s equivalent to [ \t\n\r\f\v] | \s e bo k |
\S | An opposite of \s it matches any non-whitespace character | \S e bo k |
\w | Matches any alphanumeric character. Its equivalent to the class [a-zA-Z0-9_]. | \w 3yj385 |
\W | An opposite of \W. It matches any non-alphanumeric character. | \W $<> |
\Z | Matches if the string ends with the given character | ab\Z baghsab |
Python RegEx Module
Python provides a re-module that supports the use of regex in Python. The regular expression is enabled in Python by the re-module. This means to use the RegEx in Python you have to Import re module
Right away, there is a range of Functions provided by the re module, let’s know some of them.
re.findall()
This method returns a list of strings containing all matches. The string is scanned left to right, and matches are given in the order found.
Example: Finding all matches of a pattern
Python Input
# A Python program to demonstrate working of # findall() import re # A sample text string where regular expression # is searched. string = """Hello my Number is 123456789 and my friend's number is 987654321""" # A sample regular expression to find digits. regex = '\d+' match = re.findall(regex, string) print(match)
Output: ['123456789', '987654321']
re.compile()
Various operations of RegEx such as searching for pattern matches or performing string substitutions are compiled into pattern objects.
Example
Python Input
# Module Regular Expression is imported # using __import__(). import re # compile() creates regular expression # character class [a-e], # which is equivalent to [abcde]. # class [abcde] will match with string with # 'a', 'b', 'c', 'd', 'e'. p = re.compile('[a-e]') # findall() searches for the Regular Expression # and return a list upon finding print(p.findall("Aye, said Mr. Gideon Stark"))
Output: ['e', 'a', 'd', 'd', 'e', 'a']
re.split()
The re.split method is used to split the string at the point where a match occurs and returns a list of strings at the area of the split. After the operation, If the pattern of the match is not found, re.split() will return a list containing the original list. You can pass maxsplit argument to the re.split() to determine the maximum number of splits that will occur. However, the default value for maxsplit is o.
Syntax : re.split(pattern, string, maxsplit=0, flags=0)
Python Input
from re import split # '\W+' denotes Non-Alphanumeric Characters # or group of characters Upon finding ',' # or whitespace ' ', the split(), splits the # string from that point print(split('\W+', 'Words, words , Words')) print(split('\W+', "Word's words Words")) # Here ':', ' ' ,',' are not AlphaNumeric thus, # the point where splitting occurs print(split('\W+', 'On 12th June 2022, at 11:02 AM')) # '\d+' denotes Numeric Characters or group of # characters Splitting occurs at '12', '2022', # '11', '02' only print(split('\d+', 'On 12th June 2016, at 11:02 AM'))
Output: ['Words', 'words', 'Words'] ['Word', 's', 'words', 'Words'] ['On', '12th', 'June', '2022', 'at', '11', '02', 'AM'] ['On ', 'th Jan ', ', at ', ':', ' AM']
re.subn()
subn() is similar to sub() except in its output. It returns a tuple of 2 items, one containing the new String and the other the number of substitutes made.
Syntax:
re.subn(pattern, repl, string, count=0, flags=0)
Python Input
import re print(re.subn('ub', '~*', 'Subject has Uber booked already')) t = re.subn('ub', '~*', 'Subject has Uber booked already', flags=re.IGNORECASE) print(t) print(len(t)) # This will give same output as sub() would have print(t[0])
Output: ('S~*ject has Uber booked already', 1) ('S~*ject has ~*er booked already', 2) 2 S~*ject has ~*er booked already
re.search()
The re.search() method is best for testing a regular expression than data collection because it stops after the first match. It takes a pattern and a string as arguments and searches for the first location where the RegEx pattern produces a match with the string. A None is returned if the pattern doesn’t match or a match objects if it is successful.
Syntax: match = re.search(pattern, str)
Python Input
# A Python program to demonstrate working of re.match(). import re # Lets use a regular expression to match a date string # in the form of Month name followed by day number regex = r"([a-zA-Z]+) (\d+)" match = re.search(regex, "I was born on June 24") if match != None: # We reach here when the expression "([a-zA-Z]+) (\d+)" # matches the date string. # This will print [14, 21), since it matches at index 14 # and ends at 21. print ("Match at index %s, %s" % (match.start(), match.end())) # We us group() method to get all the matches and # captured groups. The groups contain the matched values. # In particular: # match.group(0) always returns the fully matched string # match.group(1) match.group(2), ... return the capture # groups in order from left to right in the input string # match.group() is equivalent to match.group(0) # So this will print "June 24" print ("Full match: %s" % (match.group(0))) # So this will print "June" print ("Month: %s" % (match.group(1))) # So this will print "24" print ("Day: %s" % (match.group(2))) else: print ("The regex pattern does not match.")
Output: Match at index 14, 21 Full match: January 24 Month: June Day: 24
Match Object
A Match object contains all the information n about the search results. If the search wasn’t successful it’ll return a None Otherwise it’ll return a match object. You can use the dir() function to get the methods and attributes of a match object.
Some commonly used attributes are exemplified below
Getting RegEx and String
match.re gives out the regular expression used and match.string returns the string passed. These two are known as attributes.
Example: Getting the string and the regex of the matched object
Python Input
import re s = "Welcome to My Programming School" # here x is the match object res = re.search(r"\bP", l) print(res.re) print(res.string)
Output: re.compile('\\bP') Welcome to My Programming School
Getting the index of matched object
In this example, the start() method returns the starting index of the matched substring and the end() method returns the ending index of the matched substring. Similarly, span() method gives out a tuple containing the starting and the ending index of the substring that is matched.
Example:
Python Input
import re s = "Welcome to Phenomenal Academy" # here x is the match object res = re.search(r"\bPhe", l) print(res.start()) print(res.end()) print(res.span())
Output: 11 14 (11, 14)
Getting matched substring
To get the matched substring, group() method gives out the area of the substring for which the patterns match.
Example
Python Input
import re s = "Welcome to Phenomenal Academy" # here x is the match object res = re.search(r"\D{2} t", s) print(res.group())
Output: me t
The above returns indicate a string with two characters, a space, and after the space.
Summary
Simply stated a Python RegEx is a set of characters that form a search pattern in python. RegEx can be used to check if a string has the specified search patterns.
To understand the RE analogy, It is very necessary that you understand MetaCharacters. Refer to table.
At the end of this page you’ve learned the following:
- What is Python RegEx?
- Metacharacters
- Special Sequences
- Python RegEx Module
- Match object
- Getting the Index of matched object