Categories: Technology Services

Sprechen Du RegEx? A Beginner’s Guide to RegEx

One of my favorite things to write about, back in the VKI days, was RegEx. An incredibly useful tool for people doing anything from simple find and replace scripts in Notepad++ to server admins redirecting pages, RegEx is one of  those tools that you really should be familiar with if you work in our industry.

Sprechen Du Regex?

Regex commands can vary in complexity from simple to brain meltingly complex depending on how much “language” (and more importantly: logic) you use with them. The following is a hefty (but not complete) selection of regex terms:

. : The period is a wild card. It can represent any character what-so-ever.

+ : repeats the previous character 1 or more times.

* : repeats the previous character 0 or more times.

() : Parentheses represent a set of “tokens” or rule elements. For instance, (.+) would match any set of characters. This allows you to apply an operator to an entire group. So for instance, if you wanted to match the word “what” you would type “what”, but if you wanted it to also catch “whatwhat” then you could use “(what)+”.

Parentheses also create a “back reference”, which can be recalled with a special symbol in many regex engines (in Google Analytics, for instance, you would use $).

[] : Square brackets represents a “character class”, and are often used for ranges. For instance [a-t] would match any lower case letter between a and t. You can also have multiple items within a bracket, such as [a-zA-Z0-9s-#”=] which would match any single letter, number, space, hyphen, number sign, quotation, or equals sign. (Yes, this would be better written [ws-#”=], but I was making a point about ranges)

{} : Curly brackets are odd. They define repetition. So (what){2} would only match two repetitions of what (whatwhat). Alternatively (what){2,7} would count between two and seven repetitions of what (including 3 repetitions, 4 repetitions, 5 ,6)

d :Represents any digit

s : Represents any whitespace element (space, tag, etc.)

w : Represents any alphanumeric character or underscore

D S W : Negation of the above, so not a digit, not a white space, etc.

$ : Dollar sign matches the end of a string. In htaccess it can also be used to recall sets that have been previously defined by parenthesis.

^ : The caret has two purposes. It can match the start of a string, but also it can negate characters in characters sets. So ^[a-z]$ will only match a a string that starts and ends with a single lower case alpha character, [^a-z] will match any string that does not contains characters other than a lower case letter. So aaa will not match, aAa will match, and AAA will match.

: a hyphen creates a range. For instance, a-z would match any character from a to z (though not any uppercase characters)

| : The bar stands for “or”. So a|b will match a or b.

: slash means “literally”. So while “.” would match any character “.” would only match periods. Similarly while “?” would match the end of a sentance, “?” would match a question mark. In certain implementations of regex (eg. Notepad ++) slash can also be used with numbers to repeat areas that have previously been defined by brackets (same as $1, $2, etc. in htaccess).

?:  Question marks have a lot of uses. Following an expression it matches a string that does or does not contain this. So for example “[1080 ]? Howe st” would match “1080 Howe st.” or “Howe st.” but not “64 Howe st.” while “64?” would match “6” or “64”. The question mark also has the dual purpose of making an expression “lazy” (normally regex is greedy). Greed and laziness makes my head hurt so I’ll just leave this one to LunaMetrics (good greed and bad greed).

(?i) :  I said question marks have a lot of uses. This command turns on case insensitivity. So, oh (?i)my gosh will match “oh my gosh” and “oh MY GOSH“.

(?-i) : Yep, a negative sign. Reverses what (?i) does, turning off case insensitivity (yay double negatives).  Think of (?i) and (?-i) as HTML’s <> and </> and you’ll have the idea.

(?=): Matches the the preceeding character that follows the character after the equals sign. So in “oh my g… OH MY GOSH, G(?=O) would match.

Got all of that remembered? No? I doubt anyone does.

Sprechen Sie Regex?

So how can we use this? Here’s a neat trick.

Say you want to know how if there is a behavioral difference between people using longer keyphrases or shorter ones. One might assume that longer keyphrases would convert more, since they are more specific, and there is a greater chance that a user is finding exactly that. But why on earth would you assume when you have analytics?

Fortunately, a commenter on Avinash Kaushik’s blog has a neat trick for doing this using regular expressions.

Make a new advanced segment with ‘keyword’ ‘matching RegEx’ and input one of the following:

  • ^s*[^s]+s*$ – one keyword
  • ^s*[^s]+(s+[^s]+){1}s*$ – two keywords
  • ^s*[^s]+(s+[^s]+){2}s*$ – three keywords
  • ^s*[^s]+(s+[^s]+){3}s*$ – four keywords

So this reads as:

Start of line: matching any white space(s) repeated zero or more times (*) followed by not-a-whitespace ([^s]) one or more times followed by a white space zero or more times, then end line.  Then if you want more than one keyword, you put a repeat ({number}). Repeat once for two keywords, twice for three, etc.

You can also do ranges such as:

  • ^s*[^s]+(s+[^s]+){1,4}s*$ – two to five keywords
  • ^s*[^s]+(s+[^s]+){5,}s* – six or more keywords

There you go. Try those out and let us know how you find longer phrases affect site metrics.

Keep an eye on the blog over the next couple of weeks as we post more Regex tips and tricks that you can use both within Google Analytics and other Regex engines.

Kent Clark

Some have compared him to the Dalai Lama, others to Kublai Kahn. When he isn't teaching third world children how to purify water with nothing more than a plastic bottle and a garden hose, he is creating mad waves for surfers off the west coast with little more than a paddle. Some say there is a boat involved, others that he walks on water. Little is known about his background. he appeared from nowhere 15 years ago and claims heritage from a land with neither want not need. He makes little comment, stating only that it was a pretty cool place. Fire does not burn him, cold does not hurt him. Words could... but they don't. When he passes, pedals fall off branches. When he speaks, hair tugs at skin, pulling just slightly in his direction. He does not sleep but he does dream. He has muscled his way into the lives of the famous and whispered his way into their hearts. And in the wee hours he plays oboe softly, as if to sooth the night to sleep.

Share
Published by
Kent Clark

Recent Posts

Google Delays Third-Party Cookie Deprecation to 2025

Google announced on April 23 that it will again delay third-party cookie deprecation (3PCD) in…

6 days ago

Understanding Funnel Reports in GA4

Funnel reports have long been one of the most actionable reports in a marketing analyst’s…

1 week ago

GA4 Monetization Reports: An Overview

GA4’s Monetization reports provide organizations with simple but actionable views into the revenue-generating aspects of…

2 weeks ago

This website uses cookies.