Rock With Regex

\b(?!\b(?:ADV|AICE|MYP|PYP|[A-z]2)\b)([A-z])([A-z]*?)\b

The above gobbledygook is a real regex string I used in one of my projects, MyGrades, to identify and format course names. By the end of this article, you'll not only be able to read and understand what it means, but also create regex patterns of your own!

I presented a workshop on this topic in 2021 at a Google DSC Event.

What is Regex?

Regex, or regular expressions, are patterns used to match strings. Regex is commonly used for searching/filtering strings for information, input validation, and web scraping. "Real-world" examples include everything from validating email addresses to formatting class names in a grades app.

Regex is incredibly powerful, but due to its seemingly unintelligible nature, it's also often intimidating to learn and difficult to remember.

But today you're gonna learn it!

Outline

In this article, we'll

Learn the "Balderdash" Basics (of Regex)
Learn Regex Syntax:

"Flapdoodle" Flags
"Gibberish" Characters
"Bafflegab" Special Characters
"Rigmarole" Ranges
"Jargon" Quantifiers
"Gobbledygook" Groups
"Malarkey" Anchors

Discuss Next Steps and Additional Practice

"Balderdash" Basics (of Regex)

How does Regex work?

(Besides potentially making your code intelligible)

Regex, or regular expressions, are based on logic. Regex follows two primary rules:

Regex engines move from left to right
Regular expressions start and end with "delimiters." For example, Javascript regex literals generally have "slash" characters /, and Python regex usually begins with "r" and ends with ". (While Python doesn't necessarily have Regex literals perse, Regex is written more easily using raw strings to avoid worrying about string escapes).
Patterns return the first case-sensitive match they find by default.

Therefore: given the sample string I scream, you scream, we all SCREAM for ice cream, /scream/ matches the first instance of "scream."

Another example:

regex string: /mon

test string: the mopey monkey stole my money

This behavior can be modified with flags.

Regex Syntax

AKA: How to parse gibberish

🚩 "Flapdoodle" Flags

Regex includes several flags that are appended to the end of the expression to change behavior. Using the string I scream, you scream, we all SCREAM for ice cream, the updated regex /scream/gi will now return scream scream SCREAM.

Syntax	Flag	Behavior	Example
`g`	global	Returns additional matches	`/foo/g`
`i`	insensitive	Allows case-insensitive matches	`/foo/i`
`x`	verbose	Ignore whitespace & allow comments	`/foo/x`
`u`	unicode	Expressions are treated as Unicode (UTF-16)	`/foo/u`
`s`	singleline	Treats entire string as one line (allows `.` to match newline)	`/foo/s`
`m`	multiline	Start & end anchors now trigger on each line	`/foo/m`
`n`	nth match	Matches text returned by nth group	`/foo/n`

✏️ "Gibberish" Characters

Now we're on to the meat of regular expressions; selecting characters. In regex, a character can refer to either a letter, digit, or symbol. If you're looking to use regex, chances are you'll include some of these in your string:

Syntax	Character	Matches	Example String	Example Expression	Example Match
`.`	any	Literally any character (except line break)	`a-c1-3`	`a.c`	`a-c`
`\w`	word	ASCII character (Or Unicode character in Python & C#)	`a-c1-3`	`\w-\w`	`a-c`
`\d`	digit	Digit 0-9 (Or Unicode digit in Python & C#)	`a-c1-3`	`\d-\d`	`1-3`
`\s`	whitespace	Space, tab, vertical tab, newline, carriage return (Or Unicode seperator in Python, C#, & JS)	`a b`	`a\sb`	`a b`
`\W`	NOT word	Anything `\w` does not match	`a-c1-3`	`\W-\W`	`1-3`
`\D`	NOT digit	Anything `\d` does not match	`a-c1-3`	`\D-\D`	`a-c`
`\S`	NOT whitespace	Anything `\s` does not match	`a-c1-3`	`\S-\S`	`a-c`

🖋️ "Bafflegab" Special Characters

Regex also allows you to select special chracters like tabs or newlines.

Syntax	Special Character	Matches	Example String	Example Expression	Example Match
`\`	escape	The following when preceding them: `[{()}].*+?$^/\`	`)$[]*{`	`\[\]`	`[]`

Syntax	Substitute	Behavior
`\n`	newline	Insert a newline character
`\t`	tab	Insert a tab character
`\r`	carriage return	Insert a carriage return character
`\f`	form-feed	Insert a form feed character

🖌️ "Rigmarole" Ranges

Ranges allow you to support several potential matches:

Syntax	Range	Matches	Example String	Example Expression	Example Match
`[pog]`	word list	Either `p`, `o`, or `g`	`awesomePOSSUM123`	`[awesum]+`	`awes`
`[^pog]`	NOT word list	Any character except `p`, `o`, or `g`	`awesomePOSSUM123`	`[^awesum]+`	`o`
`[a-z]`	word range	Any character between `a` and `z`, inclusive	`awesomePOSSUM123`	`[a-z]+`	`awesome`
`[^a-z]`	NOT word range	Any character not between `a` and `z`, inclusive	`awesomePOSSUM123`	`[^a-z]+`	`123`
`[0-9]`	digit range	Any character between `0` and `9`, inclusive	`awesomePOSSUM123`	`[0-9]+`	`123`
`[^0-9]`	NOT digit range	Any character not between `0` and `9`, inclusive	`awesomePOSSUM123`	`[^0-9]+`	`awesomePOSSUM`
`[a-zA-Z]`	word range	Any character not between `a` and `z`, inclusive	`awesomePOSSUM123`	`[a-zA-Z]+`	`awesomePOSSUM`
`[a-zA-Z]`	word range	Any character not between `a` and `z`, inclusive	`awesomePOSSUM123`	`[a-zA-Z]+`	`awesomePOSSUM`

There are also a few (mostly) semantically identical patterns in Golang and PHP. These do not appear to be supported in JS or Python:

Syntax	Range	Matches	Example String	Example Expression	Example Match
`[[:alpha:]]`	alpha class	Any character between `a` and `z`, inclusive, not case sensitive	`Woodchuck could chuck 33 wood logs.`	`[[:alpha:]]+`	`Woodchuck`
`[[:digit:]]`	digit class	Any digit 0-9	`Woodchuck could chuck 33 wood logs.`	`[[:digit:]]+`	`33`
`[[:alnum:]]`	alphanumeric class	Any character between `a` and `z`, inclusive, not case sensitive, and any digit 0-9	`Woodchuck could chuck 33 wood logs.`	`[[:alnum:]]+`	`Woodchuck`
`[[:punct:]]`	punctuation class	Any of `?!.,:;`	`Woodchuck could chuck 33 wood logs.`	`[[:punct:]]+`	`.`

In some flavors of regex, the above are also called "Character Classes."

🖊️ "Jargon" Quantifiers

Syntax	Quantifier	Matches	Example String	Example Expression	Example Match
`?`	optional	0 or 1 of the preceding expression	`ccc`	`c?`	`c`
`{X}`	X	X of the preceding expression	`ccc`	`c{2}`	`cc`
`{X,}`	X+	X or more of the preceding expression	`ccc`	`c{2,}`	`ccc`
`{X,Y}`	range	Between X and Y of the preceding expression	`ccc`	`c{1,3}`	`ccc`

Beyond standard quantifiers, there are a few additional modifiers: greedy, lazy, and possessive.

Syntax	Quantifier	Matches	Example String	Example Expression	Example Match
`*`	0+ greedy	0 or more of the preceding expression, using as many chars as possible	`abccc`	`c*`	`ccc`
`+`	1+ greedy	1 or more of the preceding expression, using as many chars as possible	`abccc`	`c+`	`ccc`
`*?`	0+ lazy	0 or more of the preceding expression, using as few chars as possible	`abccc`	`c*?`	`c`
`+?`	1+ lazy	1 or more of the preceding expression, using as few chars as possible	`abccc`	`c+?`	`c`
`*+`	0+ possessive	0 or more of the preceding expression, using as many chars as possible, without backtracking (Not supported in JS or PY)	`abccc`	`c*+`	`ccc`
`++`	1+ possessive	1 or more of the preceding expression, using as many chars as possible, without backtracking (Not supported in JS or PY)	`abccc`	`c++`	`ccc`

Put simply, greedy quantifiers match as much as possible, lazy as little as possible and possessive as much as possible without backtracking.

What this means in practice is that possessive quantifiers will always return either the same match as greedy quantifiers or if backtracking is required they will return no match. Therefore, posessive quantifiers should be used when you know backtracking is not necessary, allowing increased performance.

🖍️ "Gobbledygook" Groups

Groups allow you to pull out specific parts of a match. For example, given the string Peter Piper picked a peck of pickled peppers and the regex literal _[peck]+ of (\w+) _, an additional "capturing group" group 1 is returned.

By default, the whole match begins at group 0, and then every group after is n where n is 1 + the previous capturing group.

Syntax	Group	Matches	Example String	Example Expression	Example Match
`\|`	alternate	Either the preceding or following expression	`truly rural`	`truly\|rural`	`truly`
`(...)`	isolate	Everything enclosed; treats as separate capture group	`truly rural`	`truly (rural)`	`truly`, `rural`
`(?:...)`	include	Everything enclosed; enables using quantifiers on part of regex	`truly ruralrural`	`truly (?:rural)+`	`truly ruralrural`
`(?\|...)`	combine	Everything enclosed; treats all matches as same group	`truly rural`	`(?\|(rural)\|(truly))`	`truly`
`(?>...)`	atomic	Longest possible string without backtracking	`truly rural`	`(?>rur)`	`rur`
`(?#...)`	comment	Everything enclosed; treats as comment and ignores	`truly #rural`	`truly (?#rural)`	`truly`

⚓ "Malarkey" Anchors

Syntax	Anchor	Matches	Example String	Example Expression	Example Match
`^`	start	Start of string	`she sells seashells`	`^\w+`	`she`
`$`	end	End of string	`she sells seashells`	`\w+$`	`seashells`
`\b`	word boundary	Between a character matched and not matched by `\w`	`she sells seashells`	`s\b`	`s`
`\B`	NOT word boundary	Between two characters matched by `\w`	`she sells seashells`	`\w+$`	`seashells`

There are additional anchors available that are unaffected by multiline mode m.

Syntax	Anchor	Matches	Example String	Example Expression	Example Match
`\A`	multi-start	Start of string	`she sees cheese`	`\A\w+`	`she`
`\Z`	multi-end	End of string	`she sees cheese`	`\w+\Z`	`cheese`
`\Z`	absolute end	Absolute end of string, ignoring trailing newlines	`she sees cheese`	`\w+\Z`	`cheese`

Regex in the Real World

Regular expressions are an incredibly useful tool for you to have in your programming arsenal. Beyond the regex string I opened this article with, which enabled me to parse class names in a grades app, there are many other applications for parsing strings:

Input Validation

/^.+@.+$/

Emails

/^[a-zA-Z0-9_-]16$/

Usernames

/^\+?(\d.*){3,}$/

Phone numbers

Metadata

/^(0?[1-9]|[12][0-9]|3[01])([ /-])(0?[1-9]|1[012])\2([0-9][0-9][0-9][0-9])(([ -])([0-1]?[0-9]|2[0-3]):[0-5]?[0-9]:[0-5]?[0-9])?$/

DateTimes

/^#?([a-fA-F0-9]6|[a-fA-F0-9]3)$/

Color Hexcodes

/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).)3(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

IPv4 addresses

Those are just a couple examples of common applications for regex.

Next Steps

You can bookmark a "Regex Cheat Sheet" I created for a workshop in 2021 at github.com/GoldinGuy/UltimateRegexResource.

If you're looking for more ways to practice regex, I created an app, Redoku, which lets you learn the syntax of regular expressions by playing fun and engaging randomly generated regex sudoku puzzles.