Rock With Regex – Learn Regular Expressions in 10 Minutes

July 21, 2021 (3y ago)

Hero

\b(?!\b(?:ADV|AICE|MYP|PYP|[A-z]2)\b)([A-z])([A-z]*?)\b

The above gobbledygook is a real regex string I used in one of my projects, MyGrades, to identify and format course names. By the end of this article, you'll not only be able to read and understand what it means, but also create regex patterns of your own!

I presented a workshop on this topic in 2021 at a Google DSC Event.

Rock with Regex YouTube

What is Regex?

Regex, or regular expressions, are patterns used to match strings. Regex is commonly used for searching/filtering strings for information, input validation, and web scraping. "Real-world" examples include everything from validating email addresses to formatting class names in a grades app.

Regex is incredibly powerful, but due to its seemingly unintelligible nature, it's also often intimidating to learn and difficult to remember.

Regex Hard

But today you're gonna learn it!

Regex Easy

Outline

In this article, we'll

  1. Learn the "Balderdash" Basics (of Regex)
  2. Learn Regex Syntax:
  1. Discuss Next Steps and Additional Practice
Regex how to

"Balderdash" Basics (of Regex)

How does Regex work?

(Besides potentially making your code intelligible)

Regex, or regular expressions, are based on logic. Regex follows two primary rules:

  1. Regex engines move from left to right
  2. Regular expressions start and end with "delimiters." For example, Javascript regex literals generally have "slash" characters /, and Python regex usually begins with "r" and ends with ". (While Python doesn't necessarily have Regex literals perse, Regex is written more easily using raw strings to avoid worrying about string escapes).
  3. Patterns return the first case-sensitive match they find by default.

Therefore: given the sample string I scream, you scream, we all SCREAM for ice cream, /scream/ matches the first instance of "scream."

Another example:

regex string: /mon

test string: the mopey monkey stole my money

This behavior can be modified with flags.

Regex Simply

Regex Syntax

AKA: How to parse gibberish

Regex this part

🚩 "Flapdoodle" Flags

Regex includes several flags that are appended to the end of the expression to change behavior. Using the string I scream, you scream, we all SCREAM for ice cream, the updated regex /scream/gi will now return scream scream SCREAM.

SyntaxFlagBehaviorExample
gglobalReturns additional matches/foo/g
iinsensitiveAllows case-insensitive matches/foo/i
xverboseIgnore whitespace & allow comments/foo/x
uunicodeExpressions are treated as Unicode (UTF-16)/foo/u
ssinglelineTreats entire string as one line (allows . to match newline)/foo/s
mmultilineStart & end anchors now trigger on each line/foo/m
nnth matchMatches text returned by nth group/foo/n

✏️ "Gibberish" Characters

Now we're on to the meat of regular expressions; selecting characters. In regex, a character can refer to either a letter, digit, or symbol. If you're looking to use regex, chances are you'll include some of these in your string:

SyntaxCharacterMatchesExample StringExample ExpressionExample Match
.anyLiterally any character (except line break)a-c1-3a.ca-c
\wwordASCII character (Or Unicode character in Python & C#)a-c1-3\w-\wa-c
\ddigitDigit 0-9 (Or Unicode digit in Python & C#)a-c1-3\d-\d1-3
\swhitespaceSpace, tab, vertical tab, newline, carriage return (Or Unicode seperator in Python, C#, & JS)a ba\sba b
\WNOT wordAnything \w does not matcha-c1-3\W-\W1-3
\DNOT digitAnything \d does not matcha-c1-3\D-\Da-c
\SNOT whitespaceAnything \s does not matcha-c1-3\S-\Sa-c

🖋️ "Bafflegab" Special Characters

Regex also allows you to select special chracters like tabs or newlines.

SyntaxSpecial CharacterMatchesExample StringExample ExpressionExample Match
\escapeThe following when preceding them: [{()}].*+?$^/\)$[]*{\[\][]
SyntaxSubstituteBehavior
\nnewlineInsert a newline character
\ttabInsert a tab character
\rcarriage returnInsert a carriage return character
\fform-feedInsert a form feed character

🖌️ "Rigmarole" Ranges

Ranges allow you to support several potential matches:

SyntaxRangeMatchesExample StringExample ExpressionExample Match
[pog]word listEither p, o, or gawesomePOSSUM123[awesum]+awes
[^pog]NOT word listAny character except p, o, or gawesomePOSSUM123[^awesum]+o
[a-z]word rangeAny character between a and z, inclusiveawesomePOSSUM123[a-z]+awesome
[^a-z]NOT word rangeAny character not between a and z, inclusiveawesomePOSSUM123[^a-z]+123
[0-9]digit rangeAny character between 0 and 9, inclusiveawesomePOSSUM123[0-9]+123
[^0-9]NOT digit rangeAny character not between 0 and 9, inclusiveawesomePOSSUM123[^0-9]+awesomePOSSUM
[a-zA-Z]word rangeAny character not between a and z, inclusiveawesomePOSSUM123[a-zA-Z]+awesomePOSSUM
[a-zA-Z]word rangeAny character not between a and z, inclusiveawesomePOSSUM123[a-zA-Z]+awesomePOSSUM

There are also a few (mostly) semantically identical patterns in Golang and PHP. These do not appear to be supported in JS or Python:

SyntaxRangeMatchesExample StringExample ExpressionExample Match
[[:alpha:]]alpha classAny character between a and z, inclusive, not case sensitiveWoodchuck could chuck 33 wood logs.[[:alpha:]]+Woodchuck
[[:digit:]]digit classAny digit 0-9Woodchuck could chuck 33 wood logs.[[:digit:]]+33
[[:alnum:]]alphanumeric classAny character between a and z, inclusive, not case sensitive, and any digit 0-9Woodchuck could chuck 33 wood logs.[[:alnum:]]+Woodchuck
[[:punct:]]punctuation classAny of ?!.,:;Woodchuck could chuck 33 wood logs.[[:punct:]]+.

In some flavors of regex, the above are also called "Character Classes."

🖊️ "Jargon" Quantifiers

SyntaxQuantifierMatchesExample StringExample ExpressionExample Match
?optional0 or 1 of the preceding expressioncccc?c
{X}XX of the preceding expressioncccc{2}cc
{X,}X+X or more of the preceding expressioncccc{2,}ccc
{X,Y}rangeBetween X and Y of the preceding expressioncccc{1,3}ccc

Beyond standard quantifiers, there are a few additional modifiers: greedy, lazy, and possessive.

SyntaxQuantifierMatchesExample StringExample ExpressionExample Match
*0+ greedy0 or more of the preceding expression, using as many chars as possibleabcccc*ccc
+1+ greedy1 or more of the preceding expression, using as many chars as possibleabcccc+ccc
*?0+ lazy0 or more of the preceding expression, using as few chars as possibleabcccc*?c
+?1+ lazy1 or more of the preceding expression, using as few chars as possibleabcccc+?c
*+0+ possessive0 or more of the preceding expression, using as many chars as possible, without backtracking (Not supported in JS or PY)abcccc*+ccc
++1+ possessive1 or more of the preceding expression, using as many chars as possible, without backtracking (Not supported in JS or PY)abcccc++ccc

Put simply, greedy quantifiers match as much as possible, lazy as little as possible and possessive as much as possible without backtracking.

What this means in practice is that possessive quantifiers will always return either the same match as greedy quantifiers or if backtracking is required they will return no match. Therefore, posessive quantifiers should be used when you know backtracking is not necessary, allowing increased performance.

Regex xkcd 1

🖍️ "Gobbledygook" Groups

Groups allow you to pull out specific parts of a match. For example, given the string Peter Piper picked a peck of pickled peppers and the regex literal _[peck]+ of (\w+) _, an additional "capturing group" group 1 is returned.

By default, the whole match begins at group 0, and then every group after is n where n is 1 + the previous capturing group.

SyntaxGroupMatchesExample StringExample ExpressionExample Match
|alternateEither the preceding or following expressiontruly ruraltruly|ruraltruly
(...)isolateEverything enclosed; treats as separate capture grouptruly ruraltruly (rural)truly, rural
(?:...)includeEverything enclosed; enables using quantifiers on part of regextruly ruralruraltruly (?:rural)+truly ruralrural
(?|...)combineEverything enclosed; treats all matches as same grouptruly rural(?|(rural)|(truly))truly
(?>...)atomicLongest possible string without backtrackingtruly rural(?>rur) rur
(?#...)commentEverything enclosed; treats as comment and ignorestruly #ruraltruly (?#rural)truly

⚓ "Malarkey" Anchors

SyntaxAnchorMatchesExample StringExample ExpressionExample Match
^startStart of stringshe sells seashells^\w+she
$endEnd of stringshe sells seashells\w+$seashells
\bword boundaryBetween a character matched and not matched by \wshe sells seashellss\bs
\BNOT word boundaryBetween two characters matched by \wshe sells seashells\w+$seashells

There are additional anchors available that are unaffected by multiline mode m.

SyntaxAnchorMatchesExample StringExample ExpressionExample Match
\Amulti-startStart of stringshe sees cheese\A\w+she
\Zmulti-endEnd of stringshe sees cheese\w+\Zcheese
\Zabsolute endAbsolute end of string, ignoring trailing newlinesshe sees cheese\w+\Zcheese

Regex in the Real World

Regular expressions are an incredibly useful tool for you to have in your programming arsenal. Beyond the regex string I opened this article with, which enabled me to parse class names in a grades app, there are many other applications for parsing strings:

Input Validation

/^.+@.+$/

Emails

/^[a-zA-Z0-9_-]16$/

Usernames

/^\+?(\d.*){3,}$/

Phone numbers

Regex xkcd 2

Metadata

/^(0?[1-9]|[12][0-9]|3[01])([ /-])(0?[1-9]|1[012])\2([0-9][0-9][0-9][0-9])(([ -])([0-1]?[0-9]|2[0-3]):[0-5]?[0-9]:[0-5]?[0-9])?$/

DateTimes

/^#?([a-fA-F0-9]6|[a-fA-F0-9]3)$/

Color Hexcodes

/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).)3(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

IPv4 addresses

Those are just a couple examples of common applications for regex.


Next Steps

You can bookmark a "Regex Cheat Sheet" I created for a workshop in 2021 at github.com/GoldinGuy/UltimateRegexResource.

Regex aragorn

If you're looking for more ways to practice regex, I created an app, Redoku, which lets you learn the syntax of regular expressions by playing fun and engaging randomly generated regex sudoku puzzles.

Thanks for reading :)

More articles