How does Google Crawler actually work?

I was wondering, how does a Web Crawler [wikipedia.org] actually work.

Did a search and found out…

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.


[ad#ad-1] Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Google Guide > Part II: Understanding Results > How Google Works
Next: Results Page »

How Google Works

If you aren’t interested in learning how Google creates the index and the database of documents that it accesses when processing a query, skip this description. I adapted the following overview from Chris Sherman and Gary Price’s wonderful description of How Search Engines Work in Chapter 2 of The Invisible Web (CyberAge Books, 2001).

Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

* Googlebot, a web crawler that finds and fetches web pages.
* The indexer that sorts every word on every page and stores the resulting index of words in a huge database.
* The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

Let’s take a closer look at each part.
1. Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Screen shot of web page for adding a URL to Google.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters you see — something like an eye-chart test to stop spambots.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

hm.. next question was..

how often does crawler actually come and do their work?
.. someone had monitoring it and concluded that..
Below you will find the 10 latest Google Web Crawler events on Google Dance Tool after August 20, 2003 (the date the script was installed). This will help you develop ideas of your own concerning how Google operates, their crawling patterns, and maybe help figure out the specific function of each google crawler.

As time goes on more information will be collected from Google’s crawlers and more information on Google’s crawlers will be posted… 🙂 so check back frequently for updates.

Bot Date Crawled IP Address Crawler
May 15, 2009
6:53pm
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
3:16pm
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
2:15pm
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
11:20am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
10:04am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
9:19am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
6:04am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
5:02am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
4:38am
213.199.128.149 tide75.microsoft.com
May 15, 2009
3:02am
88.54.127.81 host81-127-static.54-88-b.business.telecomitalia.it

third question.. what adsense had to do with crawler ?

How do your crawler and site diagnostic reports work?
Print

Here are some facts about our crawler and Site Diagnostic reports that you may find helpful:

* The crawler report is updated weekly.
The crawl is performed automatically and we’re not able to accommodate requests for more frequent crawling.
* The AdSense crawler is different from the Google crawler
The two crawlers are separate, but they do share a cache. We do this to avoid both crawlers requesting the same pages, thereby helping publishers conserve their bandwidth. Similarly, the Sitemaps crawler is separate.
* Resolving AdSense crawl issues will not resolve issues with the Google crawl.
Resolving the issues listed on your Site Diagnostics page will have no impact on your placement within Google search results. For more information on your site’s ranking on Google, review our entry on getting included in Google search results.
* The crawler indexes by URL.
Our crawler will access site.com and www.site.com separately. However, our crawler will not count site.com and site.com/#anchor separately.
* The crawler won’t access pages or directories prohibited by a robots.txt file.
Both the Google and AdSense Mediapartners crawlers honor your robots.txt file. In case it prohibits access to certain pages or directories, they will not be crawled.
* The crawler will attempt to access only URLs that request Google ads.
Only pages displaying Google ads should be sending requests to our systems and being crawled.
* The crawler will attempt to access pages that redirect.
When you have “original pages” that redirect to other pages, our crawler must access the original pages to determine that a redirect is in place. Therefore, our crawler’s visit to the original pages will appear in your access logs.

fouth.. how do I properly monitor what Google crawler stats and all?

I use the following :
1. for realtime .. i just see the “User Online” and see if got Google Bot.
2. for just curious.. i sometime examine the access.log of the httpd server itself (was wondering if other webadmin also doing so..)
3. nice graph from the awstats.
4. hmm.. using a tool provided by google ( Google Webmaster Tools) at http://www.google.com/webmasters/
submitted a sitemap and it generate all the nice graph and all.
5. manually query for the expected keyword in google search and see the result. (and sometime asked other user to search for a keyword from their own location and compare the result.. interestingly it is varied..)

fifth .. does my trend of updating my blog/site will affect Google crawler ?

1. from ahmadazwan.com : he said.. crawler will take longer time period to crawl my blog if didn’t update for so long.
more frequent update will invite more frequent crawler visit..

2. i think some engine in wordpress do notify the crawler whenever new post created and published.
avoid immediate editing after published.. it will have longer time to fully cached and indexed..
make use of draft.. if the content is not ready to be accessed by your reader.. do not publish. (as i tend to publish it immediately .though..)

3. WordPress scheduler will have some sort of early notification but not visible prior to date of publishing..

4. my visitor sometime report the time it got indexed into google.. so can guess.. is about 2hour is normal..

from my observation.. the earliest index i ever got is .. 5minute after update…kinda posting with something from the news.google.com
it was midnite.. and the topic is quite hot..
the normal was about 2 hour.. and if with editing after publishing and all sort of non-sense.. will took 8hours.. hahaha..

5. if putting the word Google somewhere in title or in text.. it tend to get indexed faster. hahahaa. this is just assumption

p/s : if the right word search could bring the needy user to the exact page where that the information lies.. it tend to invite nice comment.. afterward.. and also spambot.. (especially if put twitter as keyword..) .. 😎

References :

1. http://www.googleguide.com/google_works.html
2. http://www.google-dance-tool.com/google_crawler_history.html
3. http://en.wikipedia.org/wiki/Web_crawler
4.http://www.ahmadazwan.com

Related Post

24 Responses

  1. ahstod says:

    … how long did you manage to get away without installing this one? 😎

    Reply
  2. namran says:

    approximately within 48hours.. hahaha.. (very slow-pace mode) as the file exist since..

    Mon Mar 9 15:28:31 2009 UTC (32 hours, 49 minutes ago) by ..

    quite weird when most of the item are just disappeared..
    .. at first thought some config problem.. removing the whole project directory and re-checkout.. still the same..
    then only figured one by one.. lol..

    Reply
  3. Andre says:

    Worked for me. Many thanks!

    Reply
    • HawkEYE says:

      @Andre : you’re welcome.. hopefully it help those in need.

      Reply
  4. Spyder461 says:

    I’m still having this problem. If done the yum install php-xml, but I still get the error about missing DomDocument class. I have read some posts that talk about executing a enable lib-xml extension command, but when I do this command, I get a error that lib-xml and extension is not a builtin command. Any suggestions would be appreciated.

    Reply
  5. danyal says:

    really very helpful it works

    Reply
  6. sachin says:

    Good, It worked

    Reply
  7. nszumowski says:

    Awesome! Everywhere else I was looking referenced libxml2 which I had already installed. This fixed my issue!

    Reply
  8. nico says:

    Awesome! thanks so much.

    Reply
  9. battisti says:

    Thx, after a lot of hours spend in the source the problem was in the server! 🙁

    Reply
  10. dungkal says:

    Updating the php-xml (note: it had already been installed long before the problem cropped up) on my CentOS server did the trick.

    Thanks for the help.

    Reply
  11. wika says:

    Thanks, it worked !!

    Reply
  12. rc says:

    Just a note, if you’re on centos and you had to do a custom install of php 5.3, yum install php53-xml will do the trick

    Reply
  13. elliot says:

    @rc – thank you!

    Reply
  14. Andrea says:

    Woooow, thank you!
    U save my ass…

    Reply
  15. littleguy says:

    This worked great, thanks!

    Reply
  16. mark says:

    Thank you , was baffled with this error, your post saved the day

    Reply
  17. Smoker says:

    It worked for me like a charm. Thank you verry much.

    Reply
  18. Thanga says:

    ANSWER IS HANDY :)THANKS A LOT

    Reply
  19. axlotl says:

    Thank you, sir.

    Reply
  20. dduane says:

    I also had to add the line:
    extension=dom.so
    to my php.ini file and restart apache. I’m using Fedora 16.

    Reply
  21. vb says:

    Thanks. saved a lot of time. Had that issue with owncloud

    Reply
  22. khyox says:

    Thank you very much indeed! It solved my issue viewing the internal wiki syntax page with dokuwiki running over lighttpd in Scientific Linux.

    Reply
  23. Andre says:

    Muito obrigado valeu muito o bizu de instalar esta yum install php-xml, excelente ajuda, estava precisando muito

    Reply

Leave a Reply to Dez Bryant Jersey Cancel reply

Your email address will not be published. Required fields are marked *