REGEX in action.
Regular Expressions or “regex” are way of text data processing which is used to match and filter strings of text such as particular characters, words, or patterns of characters. It means that we can match and extract any string/text pattern from the text data with the help of regular expressions.
Python Knowledge Base: Make coding great again.
- Updated:
2024-12-29 by Andrey BRATUS, Senior Data Analyst.
Meta characters:
Extracting emails from string:
Extracting URLs from text file:
Extracting IP addresses from text file:
Extracting filenames according to pattern:
The general use of RE is form validation, such as email validation, password validation, phone number extraction and many other common form fields. Regex use cases can vary from very simple to extremely complex and building complex regular expressions is a skill that you learn only by practice. The Python module re provides full support for regular expressions tasks.
. Matches any single character
\ Escapes one of the meta characters to treat it as a regular character
[...] Matches a single character or a range that is contained within brackets
_- -_ order does not matter but without brackets order does matter
+ Matches the preeceding element one or more times
? Matches the preeceding pattern element zero or one time
* Matches the preeceding element zero or more times
{m,n} Matches the preeceding element at least m and not more than n times
^ Matches the beginning of a line or string
$ Matches the end of a line or string
[^...] Matches a single character or a range that is not contained within the brackets
?:...|..."Or" operator
() Matches an optional expression
import re
text = 'To contact my wonderful jokes site please use andrey@python-code.pro instead of example@python-code.pro email address'
pattern = re.compile("[^ ]+@[^ ]+.[a-z]+")
matches = pattern.findall(text)
matches
OUT: ['andrey@python-code.pro', 'example@python-code.pro']
The task here only extract .net URLs.
import re
with open('urlsintext.txt', 'r') as file:
content = file.read()
pattern = re.compile("https?://(?:www.)?[^ \n]+\.net")
matches = pattern.findall(content)
matches
OUT: ['https://python-code.pro',
'http://www.python-code.pro',
'http://stupidname.net']
Additional condition here to get addresses containing 33 in the beginning of a third part.
with open('ipaddresses.txt', 'r') as file:
content = file.read()
import re
pattern = re.compile("[0-9]{3}\.[0-9]{3}\.33[0-9]{1}\.[0-9]{3}")
matches = pattern.findall(content)
matches
OUT: ['912.121.330.123', '912.121.339.123']
Additional condition here to get bills for january 1-20.
from pathlib import Path
root_dir = Path('files')
filenames = root_dir.iterdir()
filenames_str = [filename.name for filename in filenames]
import re
pattern = re.compile("jan[a-z]*-(?:[1-9]|1[0-9]|20).txt", re.IGNORECASE)
matches = [filename for filename in filenames_str if pattern.findall(filename)]
matches
OUT: ['Jan-12.txt', 'bill_Jan-13.txt', 'january-14.txt']