When working on complex websites, we often end up with tons of URLs that are, in reality, the same page with different query-string arguments.
Why is this important?
- Filter out arguments that are creating unnecessary duplicates
- Reduce the number of distinct entries in reports and the risks of breaking the number of uniques imposed by most tools
- Increase the accuracy of Page Views
- Identify rogue query string parameters and wrong arguments (as well as those with mixed upper/lowercase which are likely also considered as distinct ones)
The algorithm looks like this:
- Fetch data from GA: only paths with URLs, sorted by descending unique page views
- Split the URL to get only the Query String part
- Further split the query string to extract value-pairs (up to 5 arguments in this case) and put them in a new column named “arg”
- Combine everything back but keep only the “arg” column
- Filter out rows where “arg” is empty (i.e. there were less than 5 arguments)
- Split out the “arg” column into “var=value” pairs and keep only the “var” column
- Combine duplicate “var” and create a new column with a total of occurences of each one
- Sort in alphabetical order
- Store into Excel
It worked on a pretty complex website which had nearly 30k URLs with query string parameters and ended up with 195 distinct arguments!
Here’s how the results looks like:
Once equipped with this information, we can adequately configure Google Analytics by filtering out unnecessary parameters and improve the overall quality of our analytics.