The Blog

KEYWORD (NOT PROVIDED) & THE END OF SEO DATA (PART 2 OF 2)

So, how serious has the keyword (not provided) issue become? Answer: Increasingly serious, and some project the “end of data” in September 2014 (see image below).

In part 1 of this blog post, we looked at the long tail distribution of keywords, audience segmentation, and “hot tips” to help you manage keyword (not provided). Now we will:

  1. Discuss the future of SEO and what tools remain in the toolbox.
  2. Demonstrate the degree to which our “known” keywords, can help us estimate the distribution of unknown keywords (not provided).
  3. Provide examples and a calculator to be used with your own organic keyword data, which will help make your keyword reporting bullet proof for your SEO clients.
  4. Examine factors that increasingly erode our organic keywords data (almost to extinction).
  5. Highlight the new and growing threat of Medium (not provided).

Keyword Not Provided Trend

Credit: http://www.notprovidedcount.com/

Ironically when we had all of our organic keywords, we took this data for granted (partly because long tail keyword analysis is difficult). Currently most sites have about half of this keyword data available (50% of organic keywords are “not provided”), and in the future we may need to mine our older historical data to get SEO insights about our Web presence and our competitive position. In this blog we are also going to examine the reality of this issue, and demonstrate the fact that even with a large proportion of organic keywords (not provided), you can still get high quality keyword insights that are statistically valid.

After the “end of data” we will need to use 1) Webmaster tools and 2) Adwords data to get this “visitor intent” data, and both integrate with Google Analytics so we are all good on that front. A post by Claire Broadley also references a custom Google Analytics filter to replace the “not provide” keywords with the landing page URL (URI) – not a great idea unless you are using an alternate Google Analytics profile with this filter applied, because replacing “not provided” keywords with landing page URIs can “muddy the waters” for the remaining keywords you have (e.g. keywords you bought in PPC/CPC adwords, and the remaining organic keywords). A better alternative is to use a secondary dimension in Google Analytics, so you can see the landing pages where the keyword is not provided (without corrupting the source data).

Keyword (Not Provided) Analysis Using Secondary Dimension of Landing Page in Google Analytics

SEARCH ENGINE OPTIMIZATION INSIGHTS & STRATEGIES

Landing-page based keyword analysis requires that you first perfect your SEO (technical and semantic) so that search engines algorithmically match visitor queries to each of your landing pages – as contextually appropriate as possible (minimize misaligned traffic). Great idea to use “non-bounced” visits in conjunction with the secondary dimension shown above in order to strip out the “misaligned traffic” (when search engines direct the wrong visitors to our content by mistake – there will always be a percentage of traffic like this).

At the core of SEO analysis we will need to:

  • Use landing pages for organic insights, you need URLs (URIs) that are human readable (not pageID=meaninglessNumber), and remember to consider non-bounced traffic where the keyword is not provided, because it is likely better aligned traffic.
  • Build your “compelling content” around the questions your audiences/customers are asking, this is targeting relevant audiences with relevant answers/information.
  • Build content around the competitive landscape of keyword demand trends using tools like the AdWords keyword tool, the Bing keyword tool, SEM Rush, and Keyword Spy.
  • Look at your competitor’s websites to see what keywords and context they are SEO’ing around in their content strategy. Thanks to Bethany Bey for suggesting this.
  • Use AdWords data to identify the “Matched Search Query” (same as the organic “keyword”).
  • Finally, Webmaster tools is the new SEO “golden goose” as it provides organic keyword insights – even for those visitors who do not visit your Website after an organic search – which allows you to identify higher volume queries that you are not ranking for (but someone else is). Keyword “impression” and “Click through rate (CTR)” data from Webmaster tools is powerful. Webmaster tools provides this data for the last 90 days, which implies ongoing work and analysis around this changing data.

HOW GOOD IS THE ORGANIC KEYWORD DATA THAT WE HAVE NOW?

Our historical data (before keyword “not provided”) is still a valid “mine” of insights into your business and your customers, but the value of these insights will fade away with time as your site changes, as the competitive landscape changes, and as your customer/audience needs evolve, so we need to act now on the data we have.

What about the “partial” keyword data that we have now from October 2011 through to September 2014 (the projected end of organic Keyword data)? Read on…

KEYWORD (NOT PROVIDED), THE STATISTICAL REALITY

So how can we get “a read” on all of our organic keywords that are (not provided) by leveraging the data we do have? Option 1: use the bulleted list above, or Option 2: use our remaining organic keyword data as the basis for estimating the keyword distribution of keywords within the (not provided) set.

WARNING: THE FOLLOWING CONTAINS STATISTICS AND MATH (in case it wasn’t your favorite subject in school :)…

First, a reality check: even if we were living in the perfect world, without the keywords (not provided) issue, the keywords distribution of your entire organic search is still an “estimate”. Below is the reason why. (Note that here we assume that visitors behave in the same way if they are not logged in to Google (keywords propagate in the referral data), or if visitors are logged in to Google (or are on an android mobile device) and therefore the keywords are (not provided). This assumption simplifies the math, but we recognize that keyword patterns may vary from channel to channel (e.g. mobile keywords are generally shorter or may have a different bend). Thanks to Charlotte Bourne for pressing this point home.)

For the statistical calculation of the keyword distribution we need to think multi-Bernoulli distribution. Under this model:

  1. Each organic visit is treated as a “trial”.
  2. If a visit falls under certain keyword, it is an “event” for that keyword.

Using multi-Bernoulli distribution and the law of large numbers, the variance of percentage within the entire organic search visit keyword set is:

Keyword (Not Provided) Statistical Analysis

Credit: Yi Jiang (Nelson), for creating this analytical approach.

Our objective is to minimize the variance and uncertainty above. If we apply L’Hôpital’s rule for this formula, we find that there are several ways to reduce the variance.

  1. Increase the number of visits with keywords present
  2. Increase the number of events
  3. Reduce the %KNP (Keyword not provided)

Having keywords (not provided) reduces the amount of known keywords for visits and this is fine as long as we have a sufficiently large number of visits with known keywords. Conversely, even if we know all the keywords (0% not provided), but there are only a small number of visits, then we could still have a large variance and a limited ability to extract statistically relevant insights from this data.

Illustration:

Example Scenario 1
Organic search visits (Trials):
10,000Keyword (not provided): 0%Number of Visits for one particular keyword (Event): 1,000

At the 95% confidence interval this keyword as a percentage of all organic keywords is
between 9.4120% & 10.587%.

Example Scenario 2
Organic search visits (Trials):
1,000,000 Keyword (not provided): 50% Number of Visits for one particular keyword (Event): 50,000

At the 95% confidence interval this keyword as a percentage of all organic keywords is
between 9.9168% & 10.083%.

GOOGLE KEYWORD (NOT PROVIDED) CALCULATOR TOOL

So how does this apply to your keyword data? The above example shows that even when a large volume of keyword data is “not provided” we may still have a statistically significant data set that enables us to the same degree as having all our keywords. The spreadsheet tool shown below will help you to understand the statistical significance of your keyword data.

Keyword (Not Provided) Calculator
Click here to download

Use the spreadsheet tool below with your own keyword data by following these seven steps.

  1. Go into your organic keywords report and export your top 500 keywords and the number of visits for each keyword for the time period you choose (arrows #1, #2, #3, #4). Export the “unsampled data” if you are using Google Analytics Premium Edition (GAPE) to reduce the effect of sampling (sampled data is directionally accurate, but introduces greater uncertainty into statistical calculations). If you are using the free version of Google Analytics you can shorten the time frame to reduce the effect of sampling.
  2. From your export, remove the (not provided) row, then copy and paste two columns of data into the calculator (columns with “Keywords” and “Visits”), overwriting the example data in the purple cells.
  3. Copy the total number of organic visits (arrow #4) into the first yellow box in the calculator.
  4. Divide the number of (not provided) visits (arrow #5) by the total number of organic visits, and multiply this value by 100% to get a percentage format.
  5. Next paste this percentage value into the second yellow box in the calculator.
  6. Finally, select your confidence level (95% is standard, but you determine “how sure you want to be”)
  7. Finished!

You can now use the calculator “phrases” in cells L3 and L4 in conjunction with the values in columns “L” or “M” for a specific keyword row to estimate the total number of organic keywords, including those hidden in the (not provided) keyword set!

Using the Keyword (Not Provided) Calculator Tool in Google Analytics

If you want to analyze a group of keywords that match a certain pattern (e.g. all keywords that contain “cat”), then use the inline filter in Google Analytics (the box just above arrow #5 in the above image) to export only this data, as described above (up to 500 rows). Same as before, you can use the calculator cells L3 and L4 in conjunction with the values at the bottom of the calculator (cells L510 or M510) to estimate the total number of organic keywords (e.g. that contain “cat”), including those hidden in the (not provided) keyword set. Here we assume that there is no correlation between the keywords (an assumption that makes the math easier, but still important to keep in mind).

KEYWORD (NOT PROVIDED), THE TREND CONTINUES TOWARDS EXTINCTION

Whenever people use “secure search” with Google, the keyword is (not provided).  What we are seeing now is that more and more searches are defaulting to secure search which increases the number of keyword (not provided) instances…

MEDIUM (NOT PROVIDED), A NEW THREAT EMERGES

Some search engines do not pass the keyword, and so their traffic appears under “referrals” in Google Analytics, instead of under “search engines”. This is easy to fix by updating your Google Analytics page tag to include the _addOrganic function (analytics.js introduction will move this functionality into the Google Analytics administration interface, and out of the “on page” Javascript). (https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiSearchEngines#_gat.GA_Tracker_._addIgnoredOrganic)

The big news Apple’s iOS6 & Andriodv4+ are no longer providing the referrer at all, so these organic visits now show up in Google Analytics as direct traffic…. (and you thought the keyword “not provided” issue was bad!). With iOS6 and Android4+ sending a null value in the document.referrer field for organic searches, coupled with the growing number of these mobile devices (the mobile Internet is expected to be three times the size of the traditional PC/Laptop Internet) we are moving into a dark period in analytics, where not only is the keyword (not provided), but the medium itself is no longer known for a growing number of organic visits that now are labelled “direct”…

This is ironic because Google is invested heavily in multi-channel funnels, and attribution modelling for Google Analytics Premium clients. As we lose the medium data, this modelling capability becomes less accurate and less valuable. It is conceivable that a large brand might be several million dollars off in terms of evaluating the contribution of organic traffic to its ecommerce and goal conversions. Of course paid search and display advertising should still track faithfully in Google Analytics, and perhaps this is the main concern for advertisers and Google.

Social media traffic also has this “Medium (not provided)” problem in which non-utm tagged links posted to Facebook, Twitter, and other social networks and are then shared via IM, email, apps etc., and this results in a larger volume of traffic that is labelled “direct” in Google Analytics. The real lesson here is that as more of our data becomes “lost” the importance of campaign tagging is growing!

Read more about the so called “dark social” phenomena here:
http://www.theatlantic.com/technology/archive/2012/10/dark-social-we-have-the-whole-history-of-the-web-wrong/263523/

Inflating the proportion of your online traffic categorized as “direct” also has secondary effects:

  1. Reducing the perceived value of organic search in the marketing mix, and the value of SEO
  2. Inflating all campaign traffic because under the main Google Analytics reports “direct” never overwrites the existing source/medium campaign data in the utmz cookie, so future direct visits are credited the previous visit’s source/medium.

CONCLUSION

As Web Analysts and SEOs we need to innovate in the way we do analysis, as the data landscape changes. With the keyword (not provided) issue we lost some data we were used to having, but this post shows that we have other methods. It has been noted that those who loose their sense of sight, for example, get better at using their other senses to return to a high functioning state. Adaptability is resilience.

In all our reporting it is critical to provide a clear qualification of how good the data actually is, and what assumptions we are making in our analysis. Solid statistical analysis will allow us to use confidence intervals to improve our “sensory perception” of the data environment. This approach helps decision makers improve the quality of their understanding and leads to better informed decisions that are more likely to be “on target”. This in turn allows our clients to realize greater value from the insights and recommendations we provide.

  • http://twitter.com/danbarker dan barker

    hi, Scott, how are you?

    Thanks for the thoughtful, useful post.

    Just a quick note on this:

    “A better alternative is to use a secondary dimension in Google Analytics, so you can see the landing pages where the keyword is not provided (without corrupting the source data).”

    The upside of doing it in the data itself is that it’s then visible in multi-channel funnels too.

    & I’d argue that replacing a blank with a piece of useful data isn’t ‘corrupting’.

    dan

    • Scott Shannon

      Hi Dan,

      Even in Multi-Channel Funnels you can use a primary dimension of “Keyword (Or Source/Medium) Path”, and a secondary dimension of “Landing Page URL Path”, so I still do not see any advantage to overwriting (not provided).

      You may want to get fancy and where (not provided) is detected, dynamically detect the title tag value on the page, or the H1 text, and write this in to your analytics. Then you have some data in your reports, but I would still argue this data is as likely to mislead you as it is to give you useful insights…

      ~Scott

  • Eric Baillargeon

    Another solution a bit simple to “guess” the keywords (in french) : http://intercommunication.blogspot.ca/2013/04/deceler-vos-mots-cles-not-provided-dans.html

    • Scott Shannon

      Eric,

      Thanks for pointing this out. Not sure if it matters what the keywords position is if we do not know what the keyword is, unless we are using the known keywords to guess at the unknown keywords, as you might do with the “not provided” calculator on this page.

      ~Scott

  • AniLopez

    Hi Scott
    True than an isolated % (amount of ‘not provided’ to total) is not giving a decent picture of the situation and relating that to the amount of traffic is an interesting approach but Bernoulli is not even required. My 2 cents:

    a) When there are only a small number of visits everything sucks, not only the not provided issue, that’s obvious.

    b) “is fine as long as we have a sufficiently large number of visits with known keywords” Nope, it’s not to have 40% or less of the cake when you were having it all time ago no matter how much traffic you get. Running half blind is never cool.

    The problem is we don’t have new ways to get more data or new data to work around that issue, maybe some more clever use of the numbers we already had but nothing more, and as you already point, this is going to get worst because it’s really hard for Google to fight spam and blind SEOs reducing drastically the data provided is a quick, dirty, cheap shot. It makes sense.

    I agree with you, right now WMT is the only way we have to connect some dots between SERPS and conversions as no matter how many keywrods tools you use, they provide just a very narrow and fictitious of the story.

    Cheers

    • Scott Shannon

      Hi Ani,

      I agree with you on point “A”, the worst case is to have little or no data! For point “B” the point of this article is to show that even if you do not know half your organic keywords you can still predict these keywords within statistical bounds.

      I think the big trend will be collection of onsite search keywords combined with “visitor” level analysis. This helps to better understand the mass needs of audiences. So, collecting a lot more data, and relating this data to online traffic (crm intergation) can replace what we used to use keywords for (and then some!).

      ~Scott

  • http://twitter.com/AnalyticsNick Nick Iyengar

    Hi Scott – great post. Long past time for a forward-thinking approach to dealing with the “not provided” issue. Thanks!

  • http://profiles.google.com/remi.turcotte Remi Turcotte

    man they (G) suck soooo bad – this is puuuure bullshit… :-(

  • Adrian Ignatescu

    Great post Scott! I believe this is really worrying but now that we are aware of all these we have to find new ways of organic search tracking.Anyway, you post is full of good insights.Thank you!

Copyright © 2014, All Rights Reserved. Privacy and Copyright Policies.