How does Google Crawler actually work?

I was wondering, how does a Web Crawler [wikipedia.org] actually work.

Did a search and found out…

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.


[ad#ad-1] Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Google Guide > Part II: Understanding Results > How Google Works
Next: Results Page »

How Google Works

If you aren’t interested in learning how Google creates the index and the database of documents that it accesses when processing a query, skip this description. I adapted the following overview from Chris Sherman and Gary Price’s wonderful description of How Search Engines Work in Chapter 2 of The Invisible Web (CyberAge Books, 2001).

Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

* Googlebot, a web crawler that finds and fetches web pages.
* The indexer that sorts every word on every page and stores the resulting index of words in a huge database.
* The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

Let’s take a closer look at each part.
1. Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Screen shot of web page for adding a URL to Google.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters you see — something like an eye-chart test to stop spambots.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

hm.. next question was..

how often does crawler actually come and do their work?
.. someone had monitoring it and concluded that..
Below you will find the 10 latest Google Web Crawler events on Google Dance Tool after August 20, 2003 (the date the script was installed). This will help you develop ideas of your own concerning how Google operates, their crawling patterns, and maybe help figure out the specific function of each google crawler.

As time goes on more information will be collected from Google’s crawlers and more information on Google’s crawlers will be posted… 🙂 so check back frequently for updates.

Bot Date Crawled IP Address Crawler
May 15, 2009
6:53pm
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
3:16pm
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
2:15pm
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
11:20am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
10:04am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
9:19am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
6:04am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
5:02am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
4:38am
213.199.128.149 tide75.microsoft.com
May 15, 2009
3:02am
88.54.127.81 host81-127-static.54-88-b.business.telecomitalia.it

third question.. what adsense had to do with crawler ?

How do your crawler and site diagnostic reports work?
Print

Here are some facts about our crawler and Site Diagnostic reports that you may find helpful:

* The crawler report is updated weekly.
The crawl is performed automatically and we’re not able to accommodate requests for more frequent crawling.
* The AdSense crawler is different from the Google crawler
The two crawlers are separate, but they do share a cache. We do this to avoid both crawlers requesting the same pages, thereby helping publishers conserve their bandwidth. Similarly, the Sitemaps crawler is separate.
* Resolving AdSense crawl issues will not resolve issues with the Google crawl.
Resolving the issues listed on your Site Diagnostics page will have no impact on your placement within Google search results. For more information on your site’s ranking on Google, review our entry on getting included in Google search results.
* The crawler indexes by URL.
Our crawler will access site.com and www.site.com separately. However, our crawler will not count site.com and site.com/#anchor separately.
* The crawler won’t access pages or directories prohibited by a robots.txt file.
Both the Google and AdSense Mediapartners crawlers honor your robots.txt file. In case it prohibits access to certain pages or directories, they will not be crawled.
* The crawler will attempt to access only URLs that request Google ads.
Only pages displaying Google ads should be sending requests to our systems and being crawled.
* The crawler will attempt to access pages that redirect.
When you have “original pages” that redirect to other pages, our crawler must access the original pages to determine that a redirect is in place. Therefore, our crawler’s visit to the original pages will appear in your access logs.

fouth.. how do I properly monitor what Google crawler stats and all?

I use the following :
1. for realtime .. i just see the “User Online” and see if got Google Bot.
2. for just curious.. i sometime examine the access.log of the httpd server itself (was wondering if other webadmin also doing so..)
3. nice graph from the awstats.
4. hmm.. using a tool provided by google ( Google Webmaster Tools) at http://www.google.com/webmasters/
submitted a sitemap and it generate all the nice graph and all.
5. manually query for the expected keyword in google search and see the result. (and sometime asked other user to search for a keyword from their own location and compare the result.. interestingly it is varied..)

fifth .. does my trend of updating my blog/site will affect Google crawler ?

1. from ahmadazwan.com : he said.. crawler will take longer time period to crawl my blog if didn’t update for so long.
more frequent update will invite more frequent crawler visit..

2. i think some engine in wordpress do notify the crawler whenever new post created and published.
avoid immediate editing after published.. it will have longer time to fully cached and indexed..
make use of draft.. if the content is not ready to be accessed by your reader.. do not publish. (as i tend to publish it immediately .though..)

3. WordPress scheduler will have some sort of early notification but not visible prior to date of publishing..

4. my visitor sometime report the time it got indexed into google.. so can guess.. is about 2hour is normal..

from my observation.. the earliest index i ever got is .. 5minute after update…kinda posting with something from the news.google.com
it was midnite.. and the topic is quite hot..
the normal was about 2 hour.. and if with editing after publishing and all sort of non-sense.. will took 8hours.. hahaha..

5. if putting the word Google somewhere in title or in text.. it tend to get indexed faster. hahahaa. this is just assumption

p/s : if the right word search could bring the needy user to the exact page where that the information lies.. it tend to invite nice comment.. afterward.. and also spambot.. (especially if put twitter as keyword..) .. 😎

References :

1. http://www.googleguide.com/google_works.html
2. http://www.google-dance-tool.com/google_crawler_history.html
3. http://en.wikipedia.org/wiki/Web_crawler
4.http://www.ahmadazwan.com

Get Free Email Updates!

Signup now and receive an email once I publish new content.

I will never give away, trade or sell your email address. You can unsubscribe at any time.

Like

Related Post

3 Responses

  1. HawkEYE says:

    Increase Your Google Adsense Earnings – Competitive Ad Filter
    79
    rate or flag this page

    By Mrvoodoo

    Possibly the most important thing you can do to increase your Adsense earnings:

    Now there are many things you can do in order to fully optimise your web content in order to reap maximum Adsense revenue from your online work. Write quality content, good SEO strategies, link building, getting involved in social media, etc etc. all of which are extremely important in increasing your online earnings.

    However, just as important as writing quality content and spending days/weeks/years finding ways in which to bring in traffic to your site is to take full advantage of that traffic once it’s reached your content.

    Making the most of what you’ve got:

    There are a number of techniques through which you can increase your click through rate and Adsense earnings, the most important being through making your Adsense ads more appealing to visitors. This can be done through optimising Adsense unit placement, e.g visitors are far more likely to click on an ad unit placed above the fold of the screen (i.e. before a user needs to scroll down to read more) then an ad unit say at the bottom right of the screen. Another method of increasing the Adsense click through rate (CTR) and Adsense earnings is through ad unit colour palette optimisation. Successful Adsense units usually follow one of two methods, the Adsense units are blended into the site utilising a similar colour palette to the site itself, or they may use a completely contrasting colour palette, in order to make the Adsense ad units stand out from the rest of the content.

    But possibly the most important method through which to increase your Adsense CTR and Adsense earnings; is through taking full advantage of the Adsense competitive ad filter.

    Designed to enable Adsense publishers to block adverts advertising their competitors the Adsense competitive ad filter allows any Adsense publisher to block any unwanted ads from appearing within their content simply be typing in the URL or web address of the offending ad.

    Why is this useful?

    Let’s say that I’d created a website or written a Hubpage, etc. about PC file extensions, lots of people were searching for the file extension information I’d published, and with a few links built I had a fairly decent amount of traffic coming in. However despite all this traffic I had an extremely low click through rate and wasn’t making much money at all, why?

    After a few visits to my site I noticed that most of the ads being served to my content were for things like ‘loft extensions’ and ‘ring-binder files’, things that were totally unrelated to the needs of my visitors. Google Adsense works contextually, which means that it searches your content for the most common words or themes, compares that to its database of advertisers, and serves up ads accordingly.

    It’s possible that an ad for a loft extension might have a higher click through value than an ad for a file extension utility, but the chances of anyone currently searching for file extension information clicking on it are slim. By making the ads that appear within your content as relevant as possible you will surprised at how much this can increase your CTR and Adsense earnings. By spending just a few days visiting one page with relatively good traffic and blocking the irrelevant ads I managed to increase Adsense earnings from that page ten-fold.

    So how do I go about doing this?

    In order to increase your Adsense CTR and adsene earnings by using the competitive ad filter you first need to take note of which ads you would like to block. Visit your site/page/hub/etc. and see what’s being displayed within your content.

    Biting the hand that feeds you:

    ‘IMPORTANT’ – Never ever click on your own ads, this is the number one guaranteed way to get booted from the Adsense program, you are not smarter than Google, you will be caught, and once you’re out there is very little you can do to ever get back in.

    So jot down the URL or web address of the adverts publisher if you can see it or if you can’t:

    – Right-click on the ad title and select either ‘copy shortcut’ (using Internet explorer) or ‘copy link location’ (using Netscape).

    · Open a text editor, such as Notepad, and paste the selection into the text editor. To paste, select Paste from the text editor’s Edit menu.

    · The destination URL of the ad is the text between ‘adurl=’ and ‘&’. As an example, the copied URL will appear as follows (the example has been shortened with ellipses):

    http://pagead2.googlesyndication.com/pagead/adclick?sa=l&(…)&adurl=http://www.blogger.com/signup.g&client=

    or

    http://www.googleadservices.com/pagead/adclick?adurl=http://www.blogger.com/signup.g&sa=

    The destination URL in the examples above is http://www.blogger.com/signup.g

    How to use the Adsense competitive ad filter:

    Now that you know the URLs of the Adsense publishers you’ve decided to block you need to visit your Adsense account and log in. Next click on the tab at the top named ‘Adsense setup’, just below the main tabs you will find a link titled ‘competitive ad filter, this is what you’re looking for so click it. You will now find yourself faced with some brief instructions and a large textbox within which you can enter the URLs of those unwanted ads. You do not need to include the ‘www.’ Part of the URL i.e. instead of entering ‘www.liftextensions.com’ you would simply enter ‘liftextensions.com’.

    Enter the URLs of the ads you want to block and then click on ‘save changes’, it will usually take a few hours or so before the unwanted ads begin to vanish from your site, but once they’ve gone, in theory, and usually in practise, you should be left with far more relevant targeted ads which will in turn increase your Adsense click through rate and Adsense earnings.

    To fully optimise on this strategy to increase your Adsense earnings it is worth revisiting your site/content every couple of days or so just to see what’s being displayed, and if necessary block ads that aren’t relevant.

    · It is important to consider your entire portfolio of online content when utilising the competitive filter method to increase your Adsense earnings. Do not block ads for say ‘cheap holiday flights’ if for some reason they mistakenly appear on a low traffic page, if you have a high traffic page elsewhere about bargain holidays abroad. Blocking an ad using the Adsense competitive ad filter will block that ad from appearing across your whole online content portfolio.

    http://hubpages.com/hub/Most-Important-Thing-You-Can-Do-To-Increase-Your-Google-Adsense-Earnings-competitive-ad-filter

    Reply
  2. HawkEYE says:

    AdsBlackList.com is a unique project designed to enable you to dramatically reduce the amount of MFA (made for ads) and LCPC (low cost per click) sites which appear through the use of PPC systems such as Google Adsense™, Yahoo Publisher Network™ and Chitika eMiniMalls™

    http://www.adsblacklist.com/

    Reply
  3. green coffee extract weight loss says:

    Oh my goodness! Amazing article dude! Thanks, However I am
    encountering problems with your RSS. I don’t understand the reason why I cannot join it. Is there anybody else having similar RSS issues? Anyone that knows the answer will you kindly respond? Thanks!!

    Reply

Anything to add?

X