Monday, August 30, 2010

How do search engines crawl your site faster and better

Very large network in the world; all the time generating new content. Google's own resources are limited, when faced with almost endless when web content, Googlebot can find and crawl the contents of a certain percentage of them. Then, we have to crawl to the content, we can only index part of it. URLs like the website and search engine crawlers as a bridge between: In order to be able to crawl your site's content, the crawler needs to be able to find and cross the bridge (that is, to find and crawl your URLs). If your URLs are complex or lengthy, the crawler had to repeatedly take the time needed to track the Web site; if your URLs are structured and directly to your unique content, the crawler can learn to focus on your content , not in vain spent crawl space or a different page URLs to crawl to the guidelines is ultimately just the same repetitive.


In the above slide, you can see some of our counter-examples should be avoided - these are real examples of URL exist (although their names because of privacy reasons have been replaced), these examples include the URL to be black and coding, redundant arguments disguised as part of the URL path, unlimited crawl space, and so on. You can also find the website to help you straighten out the maze and help them better and faster crawlers find your content for a number of recommendations, including:

1) remove the URL in the user parameters
That do not impact on the content URL in the parameters - such as session ID, or sort parameter - can be removed from the URL, the cookie was recorded. By adding this information cookie, and then 301 redirect to a "clean" URL, you can keep the original content, and reduce the number of URL contents point to the same situation. 

Control of infinite space Does your site have a calendar table, the above links to numerous past and future dates (each a link address is unique)? Your website address whether adding a & page = 3563 parameters, you can still return 200 code, even if not so many pages? If this is the case Di, then on your web site has been called the "Pavilion of Infinity" This Qing Kuang Hui Langfei crawling robot and the bandwidth of your site. How to control the good "infinite space", refer to some skills, right here.

2) prevent the Google crawler to crawl pages that they can not handle
By using your robots.txt file, you can stop your login page, contact information, shopping cart and other reptiles can not handle the page was crawled. (Reptiles are mean, and shy of his famous, so generally they do not own "add to cart the goods" or "Contact Us"). In this way, you can spend more time crawling reptiles on your site that they can handle the content.

One man one vote. A URL, a content
In an ideal world, URL, and have a one to one correspondence between the content: Each URL corresponds a unique content, and each paragraph can only be only a URL to visit. Closer to this ideal situation, the easier your site will be crawled and included. If your content management system or the current site to establish it more difficult to achieve, you can try to use the rel = canonical element to configure the URL you want to instruct a particular content.

No comments:

Post a Comment