Python Friday #212: Regular Expressions With Metacharacters

Last week we worked with the main methods of the re module. However, when we need to be specific about the text parts we search, we miss a lot of the flexibility we can get from regular expressions. In this post we look at metacharacters and how they make regular expressions so powerful.

This post is part of my journey to learn Python. You find the code for this post in my PythonFriday repository on GitHub.

Metacharacters

With metacharacters we can describe what kind of data we are looking for, without us having to be specific for what we are looking for exactly. This gives us a lot more flexibility, but at the price of higher complexity.

Helper Characters
Those characters we usually use in combination with other metacharacters:

`.`	Matches any single character except a newline.
`\|`	Alternation / the “or” operator. If A and B are regular expressions, A\|B will match anything that matches either A or B.
`\`	Escapes a metacharacter of its special meaning.

Custom character classes / sets
We can specify a set of characters that we want to match:

`[]`	Specifies a character class.
`[abc]`	Matches either a, b, or c, but it does not match abc.
`[a-z]`	Matches any character from a to z.
`[a\-z]`	Matches a, -, or z (the `-` is escaped and matches only the `-` character).
`[a-]` or `[-a]`	Matches a or -.
`[a-z0-9]`	Matches characters from a to z and from 0 to 9.
`[^abc]`	The ^ inverses the meaning of the class (matches everything that is not in the class).

Pre-defined character classes
We can use these short cuts to match specific classes of characters:

`\d`	Matches any decimal digit, same as [0-9]
`\D`	Matches any non-digit character, same as [^0-9]
`\s`	Matches any whitespace character, same as [\t\n\r\f\v]
`\S`	Matches any non-whitespace character, same as [^\t\n\r\f\v]
`\w`	Matches any alphanumeric character, same as [a-zA-Z0-9_]
`\W`	Matches any non-alphanumeric character, same as [^a-zA-Z0-9_]

Anchors
With anchors we can specify where our match must occur:

`^`	Matches at the beginning of lines.
`\A`	Matches only at the start of the string.
`$`	Matches at the end of a line.
`\Z`	Matches only at the end of the string.
`\b`	Matches the word boundary (a zero-width assertion at the beginning or end of a word).
`\B`	The opposite of \b, only matching when the current position is not at a word boundary.

Quantifiers
With quantifiers we can specify how many times the part must occur:

`*`	Matches zero or more repetitions.
`+`	Matches one or more repetitions.
`?`	Matches zero or one repetition.
`{}`	Matches an explicitly specified number of repetitions.
`{m}`	exactly m times.
`{m,n}`	between m and n times.

With all the backslashes that we need to escape our regular expressions, our code gets messy in no time. We can address this problem by telling Python that we have a raw string, and it should ignore the escape sequences. All we need to do is to add an r in front of our string:

regular = "\\w+\\s+\\1"
regular
# '\\w+\\s+\\1'
escaped = r"\w+\s+\1"
escaped
# '\\w+\\s+\\1'

regular = "\\w+\\s+\\1"

regular

# '\\w+\\s+\\1'

escaped = r"\w+\s+\1"

escaped

# '\\w+\\s+\\1'

Find text parts with metacharacters

The re.findall() function without metacharacters was not that helpful. Neither did it tell us where the locations are, nor could we get anything more back than what we were searching for. With metacharacters the result we get is of much more use:

res = re.findall("\d","123abc123")
res
# ['1', '2', '3', '1', '2', '3']

res = re.findall("\d+","123abc123")
res
# ['123', '123']

res = re.findall("\d","123abc123")

res

# ['1', '2', '3', '1', '2', '3']

res = re.findall("\d+","123abc123")

res

# ['123', '123']

When we ask for a digit (\d) we get all digits back. But when we ask for a number formed from 1 or more digits (\d+), we get the two groups of 123 back.

The metacharacters allow us to extract data in a generic way without us knowing what data exactly is in our input. This is much more useful and helps us with many use cases.

Find groups

We can combine multiple metacharacters and form groups. To get the word in capital letters and a price from a string, we can use a regular expression like this one:

res = re.search(r"(\b[A-Z]+\b).+(\b\d+.\d+)","The price of AAPL is 192.34")
res.groups()
# ('APPLE', '192.34')
res.group(1)
# 'APPLE'
res.group(2)
# '192.34'

res = re.search(r"(\b[A-Z]+\b).+(\b\d+.\d+)","The price of AAPL is 192.34")

res.groups()

# ('APPLE', '192.34')

res.group(1)

# 'APPLE'

res.group(2)

# '192.34'

To access the first group, we can use res.group(1), while the second group can be accessed with res.group(2).

The only problem with those groups is that they do not communicate their intentions well. We can change that by giving the groups a name. To do that, we need to add ?P<name> to our groups:

res = re.search(r"(?P<stock>\b[A-Z]+\b).+(?P<price>\b\d+.\d+)","The price of AAPL is 192.34")
res.groups()
# ('APPLE', '192.34')
res.group("stock")
# 'APPLE'
res.group("price")
# '192.34'

res = re.search(r"(?P<stock>\b[A-Z]+\b).+(?P<price>\b\d+.\d+)","The price of AAPL is 192.34")

res.groups()

# ('APPLE', '192.34')

res.group("stock")

# 'APPLE'

res.group("price")

# '192.34'

We can now access the stock name with res.group(“stock”) and get the price with res.group(“price”).

Conclusion

Regular expressions allow us to search for exact that part of a text that we are interested in. The more specific the part we search for, the more specific (and often complicated) our expressions need to be. While this is a complicated topic, the expressions and metacharacters change not much over the years and if you spend the time to understand it, you can profit for the years to come.

Python Friday #212: Regular Expressions With Metacharacters

Metacharacters

Find text parts with metacharacters

Find groups

More on regular expressions

Conclusion

Like this:

Related

Leave a Comment Cancel reply

Metacharacters

Find text parts with metacharacters

Find groups

More on regular expressions

Conclusion

Share this:

Like this:

Related

Leave a Comment Cancel reply