The Blog

Regex & Google Analytics: Working with Keywords

sprechen-du-regex-a-beginners-guide-to-regex

Due to the deluge of both work and extra posts, it’s been a while since my last regex post. I hope to change that this week with the next in our series of regex for beginners posts.

A few weeks ago Michael asked:

Would you be able to give examples of using RegEx to create a brand keywords only segment when the brand contains more than a single word?

Cardinal Path would be a good example to use as people might search on a variety of brand terms such as ‘cardinal path’, ‘cardinalpath’, ‘cardinalpath.com‘ etc.

Is that possible using RegEx?

Absolutely. In fact, it’s pretty easy. The basic thing to keep in mind is the set command (keyword1|keyword2|keyword3). In this post I’ll take you through the basics of filtering out multi-word keywords. This can be done using inline filters, or with advanced segments.

Filtering Keywords

Sometimes you really don’t want to count certain keywords, like when you want to distinguish between branded and non-branded terms.

Not matching a string in regex is simple: ((?!cardinal).)*

For those of you who aren’t familiar with ?!, it essentially means “not followed by”. So this builds a set that contains a set that does not contain cardinal but is followed by any character and all this is repeated 0 or more times

Of course this is not the right way to do it in Google Analytics, which will proceed to throw a hissy fit at you. I mean, Google already has an exclude rule, why are you using a RegEx exclude? (well, because I told you to, of course)

Not a huge deal since Google offers a far superior way of doing negations. Namely the “exclude” rule. You don’t even need RegEx for this, just put the keyword in and choose exclude.

But let’s say that, under some strange situation (like analyzing branded traffic) you actually want to remove more than just that keyword. Like let’s say your top keywords include a misspelling that is bringing in a bunch of traffic (how odd…)

wacky keywords

So let’s try this (cardinal|czrdinal). Just set the filter to exclude & regex.

Now just for fun, let’s say that what I am looking for is what kind of topics people are coming to our site for. I don’t want our brand, and I don’t want people looking for Michael Straker or myself. So we could extend this as (cardinal|czrdinal|straker|clark). That takes out a bunch of names, and we could add more (for instance, we would probably want to remove Booth).

Filtering Pages

Finally, in my case I’m really only interested in what comes to the blog. If we had a normal structure (aka all blog posts fell under /blog/) we could create a segment that includes all pages matching the regex /blog.* and be done with it, but in our case our posts all fall under the root.

Our site only has the following non-blog pages, so it might be easiest just to exclude them. They are:

/
/webinars
/contact
/for-clients
/what-we-do
/who-we-are
/training

Easy enough to do straight in the advanced segments panel. However, we don’t want that, it’s way too easy. We could use the following regex in theory:  /(webinars|contact|for-clients|what-we-do|who-we-are|training)?

A slash followed by any of a series of keywords included 1 or 0 times. Easy, no?

No, actually. That won’t work. The reason is simple: it will match anything that has a /. Ooops. Easy fix.

^/(webinars|contact|for-clients|what-we-do|who-we-are|training)?$

There we go. Match any page that starts with a slash, followed by one of those, and then ends. In theory you could leave out the caret, but I think the added specificity will likely work in our favour.

Also a note, make sure, when setting up an advanced segment like this, to test them. Go through your pages and make sure there’s no misspellings or odd query strings that are letting false positives through.

Hopefully this gives you an idea of how useful RegEx can be when dealing with Google Analytics.

edit: Just a shout out to an amazing regex tool that I use when writing these posts: regexpal.

This entry was posted in Technology, Web Analytics and tagged , . Bookmark the permalink.
  • Stephane Hamel

    Great post Kent – regex is like the Swiss knife of analysts!

  • Stephane Hamel

    Great post Kent – regex is like the Swiss knife of analysts!

  • Lukas

    “Easy enough to do straight in the advanced segments panel. However, we don’t want that, it’s way too easy. We could use the following regex in theory:  /(webinars|contact|for-clients|what-we-do|who-we-are|training)?

    A slash followed by any of a series of keywords included 1 or 0 times. Easy, no?

    Why would that include anything that has a slash? Because the ? Refers only to the stuff in brackets before it?

    • Kent Clark

      Yes, the ? only matches the proceeding element. A set within a bracket gets counted as an element. Thus it only will affect those expressions. 

      So, for instance:
      ^/go+$ would match:
      /go
      /goooooooo

      on the other hand, ^/(go)+$ would match:
      /go
      /gogogo

      Due to some logical oddities like this, I suggest checking any regex over in regexpal (http://www.regexpal.com). 

  • Lukas

    “Easy enough to do straight in the advanced segments panel. However, we don’t want that, it’s way too easy. We could use the following regex in theory:  /(webinars|contact|for-clients|what-we-do|who-we-are|training)?

    A slash followed by any of a series of keywords included 1 or 0 times. Easy, no?

    Why would that include anything that has a slash? Because the ? Refers only to the stuff in brackets before it?

    • Kent Clark

      Yes, the ? only matches the proceeding element. A set within a bracket gets counted as an element. Thus it only will affect those expressions. 

      So, for instance:
      ^/go+$ would match:
      /go
      /goooooooo

      on the other hand, ^/(go)+$ would match:
      /go
      /gogogo

      Due to some logical oddities like this, I suggest checking any regex over in regexpal (http://www.regexpal.com). 

  • Phil Pearce

    @Kent:disqus 

    May I suggest a slight improvement on the above…

    ^/(webinars|contact|for-clients|what-we-do|who-we-are|training)?(/index.php|/)?($|[?].+)

    e.g. Match these variations:
    /cat
    /cat/
    /cat/index.php
    /cat/index.php?param=test

    Thanks

    Phil.

    Note1:[?] or ? can be used. I prefer [?] as it is easier to read.
    Note2: I am assuming that thanyou pages are /cat/thankyou not /cat/index.php?submit=true

    • Kent Clark

      Good point! Wasn’t thinking about thank you pages.

      And this is why you go test your segment afterwards to make sure nothing is slipping through. (talk about not drinking my own koolaid)

  • Phil Pearce

    @Kent:disqus 

    May I suggest a slight improvement on the above…

    ^/(webinars|contact|for-clients|what-we-do|who-we-are|training)?(/index.php|/)?($|[?].+)

    e.g. Match these variations:
    /cat
    /cat/
    /cat/index.php
    /cat/index.php?param=test

    Thanks

    Phil.

    Note1:[?] or ? can be used. I prefer [?] as it is easier to read.
    Note2: I am assuming that thanyou pages are /cat/thankyou not /cat/index.php?submit=true

    • Kent Clark

      Good point! Wasn’t thinking about thank you pages.

      And this is why you go test your segment afterwards to make sure nothing is slipping through. (talk about not drinking my own koolaid)

  • Palashi

    Hiya, 

    I have been trying to build a conversion segment for MCF with conversion path as-
     * First interaction is a non-brand organic visit* All other touch points are any channel, but if they are PPC/Organic, they -must- be brand terms
    However we have not been able to find a regular expression set up to achieve this configuration, would it possible to do so using the information from the above post?

    Any Help – much appreciated 

    • Kclark

      So I spent a good chunk of today working on this problem and am happy to say that I think I figured it out. Look for a post on it this Friday.

  • Palashi

    Hiya, 

    I have been trying to build a conversion segment for MCF with conversion path as-
     * First interaction is a non-brand organic visit* All other touch points are any channel, but if they are PPC/Organic, they -must- be brand terms
    However we have not been able to find a regular expression set up to achieve this configuration, would it possible to do so using the information from the above post?

    Any Help – much appreciated 

    • Kclark

      So I spent a good chunk of today working on this problem and am happy to say that I think I figured it out. Look for a post on it this Friday.

Copyright © 2014, All Rights Reserved. Privacy and Copyright Policies.