The Blog

KEYWORD (NOT PROVIDED) & THE END OF SEO DATA (PART 2 OF 2)

So, how serious has the keyword (not provided) issue become? Answer: Increasingly serious, and some project the “end of data” in September 2014 (see image below).

In part 1 of this blog post, we looked at the long tail distribution of keywords, audience segmentation, and “hot tips” to help you manage keyword (not provided). Now we will:

  1. Discuss the future of SEO and what tools remain in the toolbox.
  2. Demonstrate the degree to which our “known” keywords, can help us estimate the distribution of unknown keywords (not provided).
  3. Provide examples and a calculator to be used with your own organic keyword data, which will help make your keyword reporting bullet proof for your SEO clients.
  4. Examine factors that increasingly erode our organic keywords data (almost to extinction).
  5. Highlight the new and growing threat of Medium (not provided).

Keyword Not Provided Trend

Credit: http://www.notprovidedcount.com/

Ironically when we had all of our organic keywords, we took this data for granted (partly because long tail keyword analysis is difficult). Currently most sites have about half of this keyword data available (50% of organic keywords are “not provided”), and in the future we may need to mine our older historical data to get SEO insights about our Web presence and our competitive position. In this blog we are also going to examine the reality of this issue, and demonstrate the fact that even with a large proportion of organic keywords (not provided), you can still get high quality keyword insights that are statistically valid.

After the “end of data” we will need to use 1) Webmaster tools and 2) Adwords data to get this “visitor intent” data, and both integrate with Google Analytics so we are all good on that front. A post by Claire Broadley also references a custom Google Analytics filter to replace the “not provide” keywords with the landing page URL (URI) – not a great idea unless you are using an alternate Google Analytics profile with this filter applied, because replacing “not provided” keywords with landing page URIs can “muddy the waters” for the remaining keywords you have (e.g. keywords you bought in PPC/CPC adwords, and the remaining organic keywords). A better alternative is to use a secondary dimension in Google Analytics, so you can see the landing pages where the keyword is not provided (without corrupting the source data).

Keyword (Not Provided) Analysis Using Secondary Dimension of Landing Page in Google Analytics

SEARCH ENGINE OPTIMIZATION INSIGHTS & STRATEGIES

Landing-page based keyword analysis requires that you first perfect your SEO (technical and semantic) so that search engines algorithmically match visitor queries to each of your landing pages – as contextually appropriate as possible (minimize misaligned traffic). Great idea to use “non-bounced” visits in conjunction with the secondary dimension shown above in order to strip out the “misaligned traffic” (when search engines direct the wrong visitors to our content by mistake – there will always be a percentage of traffic like this).

At the core of SEO analysis we will need to:

  • Use landing pages for organic insights, you need URLs (URIs) that are human readable (not pageID=meaninglessNumber), and remember to consider non-bounced traffic where the keyword is not provided, because it is likely better aligned traffic.
  • Build your “compelling content” around the questions your audiences/customers are asking, this is targeting relevant audiences with relevant answers/information.
  • Build content around the competitive landscape of keyword demand trends using tools like the AdWords keyword tool, the Bing keyword tool, SEM Rush, and Keyword Spy.
  • Look at your competitor’s websites to see what keywords and context they are SEO’ing around in their content strategy. Thanks to Bethany Bey for suggesting this.
  • Use AdWords data to identify the “Matched Search Query” (same as the organic “keyword”).
  • Finally, Webmaster tools is the new SEO “golden goose” as it provides organic keyword insights – even for those visitors who do not visit your Website after an organic search – which allows you to identify higher volume queries that you are not ranking for (but someone else is). Keyword “impression” and “Click through rate (CTR)” data from Webmaster tools is powerful. Webmaster tools provides this data for the last 90 days, which implies ongoing work and analysis around this changing data.

HOW GOOD IS THE ORGANIC KEYWORD DATA THAT WE HAVE NOW?

Our historical data (before keyword “not provided”) is still a valid “mine” of insights into your business and your customers, but the value of these insights will fade away with time as your site changes, as the competitive landscape changes, and as your customer/audience needs evolve, so we need to act now on the data we have.

What about the “partial” keyword data that we have now from October 2011 through to September 2014 (the projected end of organic Keyword data)? Read on…

KEYWORD (NOT PROVIDED), THE STATISTICAL REALITY

So how can we get “a read” on all of our organic keywords that are (not provided) by leveraging the data we do have? Option 1: use the bulleted list above, or Option 2: use our remaining organic keyword data as the basis for estimating the keyword distribution of keywords within the (not provided) set.

WARNING: THE FOLLOWING CONTAINS STATISTICS AND MATH (in case it wasn’t your favorite subject in school :)…

First, a reality check: even if we were living in the perfect world, without the keywords (not provided) issue, the keywords distribution of your entire organic search is still an “estimate”. Below is the reason why. (Note that here we assume that visitors behave in the same way if they are not logged in to Google (keywords propagate in the referral data), or if visitors are logged in to Google (or are on an android mobile device) and therefore the keywords are (not provided). This assumption simplifies the math, but we recognize that keyword patterns may vary from channel to channel (e.g. mobile keywords are generally shorter or may have a different bend). Thanks to Charlotte Bourne for pressing this point home.)

For the statistical calculation of the keyword distribution we need to think multi-Bernoulli distribution. Under this model:

  1. Each organic visit is treated as a “trial”.
  2. If a visit falls under certain keyword, it is an “event” for that keyword.

Using multi-Bernoulli distribution and the law of large numbers, the variance of percentage within the entire organic search visit keyword set is:

Keyword (Not Provided) Statistical Analysis

Credit: Yi Jiang (Nelson), for creating this analytical approach.

Our objective is to minimize the variance and uncertainty above. If we apply L’Hôpital’s rule for this formula, we find that there are several ways to reduce the variance.

  1. Increase the number of visits with keywords present
  2. Increase the number of events
  3. Reduce the %KNP (Keyword not provided)

Having keywords (not provided) reduces the amount of known keywords for visits and this is fine as long as we have a sufficiently large number of visits with known keywords. Conversely, even if we know all the keywords (0% not provided), but there are only a small number of visits, then we could still have a large variance and a limited ability to extract statistically relevant insights from this data.

Illustration:

Example Scenario 1
Organic search visits (Trials):
10,000Keyword (not provided): 0%Number of Visits for one particular keyword (Event): 1,000

At the 95% confidence interval this keyword as a percentage of all organic keywords is
between 9.4120% & 10.587%.

Example Scenario 2
Organic search visits (Trials):
1,000,000 Keyword (not provided): 50% Number of Visits for one particular keyword (Event): 50,000

At the 95% confidence interval this keyword as a percentage of all organic keywords is
between 9.9168% & 10.083%.

GOOGLE KEYWORD (NOT PROVIDED) CALCULATOR TOOL

So how does this apply to your keyword data? The above example shows that even when a large volume of keyword data is “not provided” we may still have a statistically significant data set that enables us to the same degree as having all our keywords. The spreadsheet tool shown below will help you to understand the statistical significance of your keyword data.

Keyword (Not Provided) Calculator
Click here to download

Use the spreadsheet tool below with your own keyword data by following these seven steps.

  1. Go into your organic keywords report and export your top 500 keywords and the number of visits for each keyword for the time period you choose (arrows #1, #2, #3, #4). Export the “unsampled data” if you are using Google Analytics Premium Edition (GAPE) to reduce the effect of sampling (sampled data is directionally accurate, but introduces greater uncertainty into statistical calculations). If you are using the free version of Google Analytics you can shorten the time frame to reduce the effect of sampling.
  2. From your export, remove the (not provided) row, then copy and paste two columns of data into the calculator (columns with “Keywords” and “Visits”), overwriting the example data in the purple cells.
  3. Copy the total number of organic visits (arrow #4) into the first yellow box in the calculator.
  4. Divide the number of (not provided) visits (arrow #5) by the total number of organic visits, and multiply this value by 100% to get a percentage format.
  5. Next paste this percentage value into the second yellow box in the calculator.
  6. Finally, select your confidence level (95% is standard, but you determine “how sure you want to be”)
  7. Finished!

You can now use the calculator “phrases” in cells L3 and L4 in conjunction with the values in columns “L” or “M” for a specific keyword row to estimate the total number of organic keywords, including those hidden in the (not provided) keyword set!

Using the Keyword (Not Provided) Calculator Tool in Google Analytics

If you want to analyze a group of keywords that match a certain pattern (e.g. all keywords that contain “cat”), then use the inline filter in Google Analytics (the box just above arrow #5 in the above image) to export only this data, as described above (up to 500 rows). Same as before, you can use the calculator cells L3 and L4 in conjunction with the values at the bottom of the calculator (cells L510 or M510) to estimate the total number of organic keywords (e.g. that contain “cat”), including those hidden in the (not provided) keyword set. Here we assume that there is no correlation between the keywords (an assumption that makes the math easier, but still important to keep in mind).

KEYWORD (NOT PROVIDED), THE TREND CONTINUES TOWARDS EXTINCTION

Whenever people use “secure search” with Google, the keyword is (not provided).  What we are seeing now is that more and more searches are defaulting to secure search which increases the number of keyword (not provided) instances…

MEDIUM (NOT PROVIDED), A NEW THREAT EMERGES

Some search engines do not pass the keyword, and so their traffic appears under “referrals” in Google Analytics, instead of under “search engines”. This is easy to fix by updating your Google Analytics page tag to include the _addOrganic function (analytics.js introduction will move this functionality into the Google Analytics administration interface, and out of the “on page” Javascript). (https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiSearchEngines#_gat.GA_Tracker_._addIgnoredOrganic)

The big news Apple’s iOS6 & Andriodv4+ are no longer providing the referrer at all, so these organic visits now show up in Google Analytics as direct traffic…. (and you thought the keyword “not provided” issue was bad!). With iOS6 and Android4+ sending a null value in the document.referrer field for organic searches, coupled with the growing number of these mobile devices (the mobile Internet is expected to be three times the size of the traditional PC/Laptop Internet) we are moving into a dark period in analytics, where not only is the keyword (not provided), but the medium itself is no longer known for a growing number of organic visits that now are labelled “direct”…

This is ironic because Google is invested heavily in multi-channel funnels, and attribution modelling for Google Analytics Premium clients. As we lose the medium data, this modelling capability becomes less accurate and less valuable. It is conceivable that a large brand might be several million dollars off in terms of evaluating the contribution of organic traffic to its ecommerce and goal conversions. Of course paid search and display advertising should still track faithfully in Google Analytics, and perhaps this is the main concern for advertisers and Google.

Social media traffic also has this “Medium (not provided)” problem in which non-utm tagged links posted to Facebook, Twitter, and other social networks and are then shared via IM, email, apps etc., and this results in a larger volume of traffic that is labelled “direct” in Google Analytics. The real lesson here is that as more of our data becomes “lost” the importance of campaign tagging is growing!

Read more about the so called “dark social” phenomena here:
http://www.theatlantic.com/technology/archive/2012/10/dark-social-we-have-the-whole-history-of-the-web-wrong/263523/

Inflating the proportion of your online traffic categorized as “direct” also has secondary effects:

  1. Reducing the perceived value of organic search in the marketing mix, and the value of SEO
  2. Inflating all campaign traffic because under the main Google Analytics reports “direct” never overwrites the existing source/medium campaign data in the utmz cookie, so future direct visits are credited the previous visit’s source/medium.

CONCLUSION

As Web Analysts and SEOs we need to innovate in the way we do analysis, as the data landscape changes. With the keyword (not provided) issue we lost some data we were used to having, but this post shows that we have other methods. It has been noted that those who loose their sense of sight, for example, get better at using their other senses to return to a high functioning state. Adaptability is resilience.

In all our reporting it is critical to provide a clear qualification of how good the data actually is, and what assumptions we are making in our analysis. Solid statistical analysis will allow us to use confidence intervals to improve our “sensory perception” of the data environment. This approach helps decision makers improve the quality of their understanding and leads to better informed decisions that are more likely to be “on target”. This in turn allows our clients to realize greater value from the insights and recommendations we provide.

Copyright © 2015, All Rights Reserved. Privacy and Copyright Policies.