With recent talk of “real time indexing” in Google, we’ve noticed a lot of people asking “why is it taking so long for Google to index my site?” Truthfully, there are a million factors to consider, each totally independent of the others. However, for this post I am going to list three fairly common ones that you might just want to take a look at.
Solving site structure
Deep pages can be a huge problem for some sites. Blogs in particular which might have an insane directory structure such as blog.com/justin/seo/2010/03/13/why-im-not-getting-indexed.html can create trouble for unwary webmasters, with such issues as:
Poor page rank
Driving through multiple pages to reach each level of the directory structure can reduce page rank to a trickle. Think of it like a champaign tower, each glass sending a little more down the way. Of course in this case, each page actually sends 0.85x(linking pages page rank)/(# of links on a linking page). Pages with poorer page rank are deemed “less relevent” and thus indexed less frequently.
Broken site structure
With lots of inlinking its easy to let links to internal content drop down your site structure, or even worse: break entirely. This can make it impossible to index inner content.
Recently a lot of wordpress blogs have found themselves having to wait 7 days before new posts get indexed. Why 7 days? The affected sites had their front page in Google, just subpages were taking some time.
Google (and most search engines) still have trouble with duplicate content. Seeing the same content on the front page as in a sub page it is possible that it was ignoring sub pages, believing them to be duplicate content.
Further, a lot of these sites were scraper sites, with little of their own content. Google has been cracking down on there. Hard. In amidst a bunch of restrictions on scraper sites it is also quite possible that Google would be punishing blogs that had similar key indicators, whether or not they were scrapers.
What are the key indicators? Only Google would know.
Google doesn’t like when you cloak your pages. What is cloaking? That’s when you read the user agent string of an incoming visitor and give different content to them.
This isn’t always a bad thing, in fact Youtube does it, but suffice to say Google cautions people to be very wary of cloaking (see Matt Cutts in the comments).
How could I be cloaking if I don’t mean to be?
Large organizations tend to have a lot of cooks in the kitchen, so to say. While one team might be responsible for development, another might be behind design, and another marketing. In fact, we once found that a client of ours was cloaking the content going to Google without even realizing it. Needless to say it was a forehead smacking moments when realized what was going on.
What you can do:
Rethink your site architecture: It might be time to think about how your site handles it’s masses of pages. While it is true that more pages give you more page rank to play with, without the use of nofollow to modify your inner site structure, you may be better off rethinking the way your pages interlink.
Get more links pointing at your inner pages: Sending more linkjuice to the pages you want indexed is always a good idea. It also helps with their page rank, and overall site ranking.
Try using blog snippets: Snippets on the main page with longer articles on subsequent pages can help with your content issues.
Use the canonical tag: It lets you determine original content for a page. Not that useful when your problem is front versus inner page on blog posts, but it won’t hurt.
Browse your pages with SEO browser: SEO browser queries the URL you feed it with whatever user agent string you select, then lets you see it as the bots do: text only. Useful if you’re thinking there might be something fishy going on.
Of course this will only detect user agent cloaking. For IP based cloaking you would need to compare it to the Google cached version.