Expert publishing blog opinions are solely those of the blogger and not necessarily endorsed by DBW.
Over the next few months, I will be sharing the results of an assessment I conducted with the help of 12 independent publishers on how well their websites are optimized for getting the best rankings on Google Search Engine Results Pages (SERPs). I hope that the results of this free analysis will pinpoint challenges that most publishers face on the web and continue a dialogue on how to best optimize the online experience for customers, vendors and authors.
In this installment, I will discuss the disconnect between the number of pages crawled by Google (in this case, we used our own spider) for a website (and displayed in the SERPs), and the number of pages indexed (or the total number of site pages for that specific website). The ratio for crawl vs. index should always be as close to 1:1 as possible. The better the ratio, the better the search ranking. The worse the ratio becomes, the more chances there are of duplicate pages and, well, the more room for improvement there will be.
Again, I would like to thank all of the publishers for their participation in this assessment, and I hope it will be a great learning experience for all of us.
Here are the publisher grades for Pages Crawled vs. Pages Indexed:
Crawling with SERPs
As mentioned in my previous blog post, whenever we take on a new client at Biztegra, we start with a cursory website evaluation and crawl the site ourselves, just as a search engine spider would crawl it. This gives us information about the pages, links, images, CSS, scripts and other elements of the site that we then use in the evaluation.
For example: A site crawl for Xist Publishing, which publishes children’s books on five distinct ebook formats, indexes over 8,000 site pages. A similar evaluation of Oldcastle Books, a UK publisher with several innovative imprints in its catalog, finds 134 site pages. Now, let’s see what Google has to say.
A Google search for Xist Publishing brings back 885 search results, with 129 for Oldcastle Books.1 The ratio from indexed to crawled for Oldcastle Books is pretty close—remember, we want as near to 1:1 as possible. But for Xist, the search result is returning less than 1,000 SERPs. And if you click on the very last results page, you would realize that Google is only displaying 169 of those results. Why? What happened to the rest of them?
Doubling Your Content
When Google uses its spider to crawl your site for SERPs, it checks for unique page information, content that includes title and meta tags, page descriptions and page content. It also checks whether content within your website is duplicated across multiple URLs.
If duplicates are found, Google omits the doubles from the search results. It does this to prevent a long list of results that are similar, which would prevent the user from finding the link that they are looking for. The more duplicate pages, the greater difference between the ratio of crawled vs. indexed pages, which means a lower SEO ranking on Google.
Duplicates can include the following:
• Multiple pages that have the same titles, meta tags, page descriptions or content.
• A book page that is indexed as three duplicate pages with the only difference being the format for purchase (print, ebook, printer-friendly).
• Capitalization errors in the URLs that turn one site page into multiple pages that display the same content.
• Having both the www.url and the non-www indexed, instead of just one or the other.
And it should be noted that the pages do not have to be 100 percent similar to be indexed as a duplicate. Pages that have only a 15-percent (or higher) difference may still be categorized as duplicates per Google’s algorithm.
Now, following non-Internet logic, two pages should be better than one. Two similar advertisements in a magazine should double your chances of being noticed, right? Wrong.
Duplicate pages can be harmful to your website’s health:
• The more duplicate pages you have, the greater chance that Google will see your pages as separate entities: different sites. That means that the metrics used to determine your SERPs ranking may be going to a different version of your site.
• Finding a duplicate page may confuse Google in finding the best one of your pages to rank in the SERPs vs. your competitors. If the page ranked has less impact than another page in your index, that user might choose a competitor’s page over yours.
• Duplicate pages can split your “link juice,” which is the essence of all SEO factors that go into your website’s rankings. The more link juice you accumulate from SEO best practices, such as having a mobile-friendly site or using unique header tags, the better your chance for higher search rankings. Duplicate pages create a leak in your link juice, dropping SEO results from each and every instance of a repeated page or content.
• If Google thinks you are deliberately filling the web with multiple pages to manipulate their algorithm, they will view it as spam and lower your ranking accordingly. In extreme cases, they may remove your website from their index.
Another instance of crawl vs. index upsetting your SERPs ranking is your website URL. Choosing your website’s domain name is a part of your identity as an independent publisher, but there is a catch. While you can choose between the standard www address or one without the www, it is important to make sure that only one of them will be displayed to lead your customers to your website.
According to Google’s guidelines for setting your preferred domain name as www or non-www, “If you don’t specify a preferred domain, we may treat the www and non-www versions of the domain as separate references to separate pages.” Basically, Google will view them as two separate sites. And when Google does this, it is essentially splitting the link juice between the two separate sites. Remember, don’t spill your link juice.
For example: Greystone Books, a trade book publisher from Vancouver, essentially has two separate URLs: www.greystonebooks.com and greystonebooks.com. This means that Google is indexing both URLs, splitting the link juice between both and lowering the site ranking. Conversely, Xist Publishing has a main URL of www.xistpublishing.com. Anyone typing in xistpublishing.com will be redirected to the www version of the URL—keeping the link juice from being split.
I would encourage Xist to go even further than browser redirects by ensuring that non-www pages are not being indexed. This could be accomplished through setting the preferred domain name in Google Search Console or adding a redirect or a meta element to prevent those pages from being indexed.
There are a few ways to specify a domain name and keep your link juice intact:
• Go to the Google Search Console homepage and set your preferred domain name, www or non-www.
• Set up a 301 Redirect, notifying the users that the page has moved and directing them to the proper page with your preferred URL.
• Using your main site URL in your XML site index and submitting those elements to Google’s Search Console.
• Use the robots.txt meta element to prevent some pages from being indexed. However, be sure that those pages can still be crawled by the search engine.
Most organizations will need to get an SEO expert, developer or web server admin to do any of these, and to ensure they are done correctly. You are actually going to have to do a mini SEO migration to ensure you don’t lose any of your rankings during this process.
Some Final Thoughts on Crawl vs. Index
Resolving the ratio between the number of pages crawled vs. those that are indexed is a great first step to restoring your page ranking mojo and reevaluating your own website to find pages that are important to search engines (and customers) and to your publishing business. Think of it as a smart “spring cleaning” that streamlines your website for all parties.
However important, though, it is just one step. Revising duplicate content and URLs may require more assistance than Google support can provide, especially when it comes to recoding pages. Also, items such as page descriptions and meta tags may require both creative and pragmatic solutions for writing and organizing to prevent further repetition.
In the next installment, I will be discussing URL structure and site architecture. Until then, “May your link juice cup always be full and never overflow.”
Is your site set to only show either the www or non-www version? Let me know in the comments below.
1 This number may have changed since the time of the assessment.
To get all the ebook and digital publishing news you need every day in your inbox at 8:00 AM, sign up for the DBW Daily today!