What's Web Crawler ?
Web crawler is a program that acts as an automated script which browses through the internet in a systematic way. The web crawler looks at the keywords in the pages, the kind of content each page has and the links, before returning the information to the search engine. This process is known as Web crawling.
The page you need is indexed by a software known as web crawler or spider. A web crawler gathers pages from the web and then, indexes them in a methodical and automated manner to support search engine queries. Crawlers would also help in validating HTML codes and checking links.
These web crawlers go by different names, like bots, automatic indexers and robots. Once you type a search query, these crawlers scan all the relevant pages that contain these words and turn it into a huge index.
For example, if you are using Google’s search engine, then the crawlers would go through each of the pages indexed in their database and fetch those pages to Google’s servers. The web crawler follows all the hyperlinks in the websites and visits other websites as well.
Web crawlers are configured to monitor the web regularly so the results they generate are updated and timely.
How Google Search Works ?
Google gets information from many different sources, including: web pages, user-submitted content such as Google My Business and Maps user submissions, book scanning, public databases on the Internet, and many other sources.
Google follows three basic steps to generate results from web pages:
1. Crawling
Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index.
We use a huge set of computers to fetch or crawl billions of pages on the web. The program that does the fetching is called Googlebot (also known as a robot, bot, or spider). Googlebot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site.
Google's crawl process begins with a list of web page URLs, generated from previous crawl processes, and augmented with Sitemap data provided by webmasters. As Googlebot visits each of these websites it detects links on each page and adds them to its list of pages to crawl. New sites, changes to existing sites, and dead links are noted and used to update the Google index.
Google says there are more than over 60 trillion individual pages in the World Wide Web. Web Crawlers crawl through these pages to bring back the results demanded by customers. Site owners can decide which of their pages they want the web crawlers to index, and they can block the pages that needn’t be indexed.
How Does Google Find a Page ?
Google uses many techniques to find a page, including:
a. Following links from other sites or pages
b. Reading sitemaps
How Does Google Know Which Pages Not to Crawl ?
a. Pages blocked in robots.txt won't be crawled, but still might be indexed if linked to by another page. (Google can infer the content of the page by a link pointing to it, and index the page without parsing its contents.)
b. Google can't crawl any pages not accessible by an anonymous user. Thus, any login or other authorization protection will prevent a page from being crawled.
2. Indexing
Googlebot processes each of the pages it crawls in order to compile a massive index of all the words it sees and their location on each page. Googlebot can process many, but not all, content types. For example, we cannot process the content of some rich media files.
If the page his blocked by a robots.txt file, a login page, or other device, it is possible that the page might be indexed even if Google didn't visit it!
3. Serving Results
When a user enters a query, google's machines search the index for matching pages and return the results we believe are the most relevant to the user. Relevancy is determined by over 200 factors, and we always work on improving our algorithm. Google considers the user experience in choosing and ranking results, so be sure that your page loads fast and is mobile-friendly.
How Search Algorithms Work ?
Here are some of the ways Google uses search algorithms to return useful information from the web:
a. Analyzing Your Words
Understanding the meaning of your search is crucial to returning good answers. So to find pages with relevant information, first step is to analyze what the words in your search query mean. Google builds language models to try to decipher what strings of words we should look up in the index.
This involves steps as seemingly simple as interpreting spelling mistakes, and extends to trying to understand the type of query you’ve entered by applying some of the latest research on natural language understanding. For example, Google's synonym system helps Search know what you mean, even if a word has multiple definitions.
b. Matching Your Search
Next, Google looks for web pages with information that matches your query. When you search, at the most basic level, Google's algorithms look up your search terms in the index to find the appropriate pages. Google analyzes how often and where those keywords appear on a page, whether in titles or headings or in the body of the text.
As well as matching keywords, algorithms look for clues to measure how well potential search results give users what they are looking for. Google tries to figure out if the page contains an answer to your query and doesn’t just repeat your query. So Search algorithms analyze whether the pages include relevant content such as pictures of dogs, videos, or even a list of breeds. Finally, Google checks to see if the page is written in the same language as your question in order to prioritize pages in your preferred language.
c. Ranking Useful Pages
So as to help rank the best pages first, Google also writes algorithms to evaluate how useful these web pages are.
These algorithms analyze hundreds of different factors to try to surface the best information the web can offer, from the freshness of the content, to the number of times your search terms appear and whether the page has a good user experience. In order to assess trustworthiness and authority on its subject matter, Google looks for sites that many users seem to value for similar queries. If other prominent websites on the subject link to the page, that’s a good sign the information is high quality.
There are many spammy sites on the web that try to game their way to the top of search results through techniques like repeating keywords over and over or buying links that pass PageRank. These sites provide a very poor user experience and may even harm or mislead Google’s users. So Google writes algorithms to identify spam and remove sites that violate Google’s webmaster guidelines from our results.
d. Considering Context
Information such as your location, past search history and search settings all help us to tailor your results to what is most useful and relevant for you in that moment.
Google use your country and location to deliver content relevant for your area. For instance, if you’re in Chicago and you search “football”, Google will most likely show you results about American football and the Chicago Bears first. Whereas if you search “football” in London, Google will rank results about soccer and the Premier League higher. Search settings are also an important indicator of which results you’re likely to find useful, such as if you set a preferred language or opted in to SafeSearch (a tool that helps filter out explicit results).
In some instances, Google may also personalize your results using information about your recent search activity. For instance, if you search for “Barcelona” and recently searched for “Barcelona vs Arsenal”, that could be an important clue that you want information about the football club, not the city. You can control what search activity is used to improve your Search experience, including adjusting what data is saved to your Google account, at myaccount.google.com.
e. Returning the Best Results
Before Google serve your results, Google evaluates how all the relevant information fits together: is there only one topic among the search results, or many? Are there too many pages focusing on one narrow interpretation? Google strives to provide a diverse set of information in formats that are most helpful for your type of search.
Furthermore, Google provides a number of features that make your search more effective, such as:
a. Spelling — In case there is an error in the word you typed, Google comes up with a number of alternatives to help you get on track.
b. Google Instant — Instant results as you type.
c. Search Methods — Different options for searching, other than just typing out the words. This includes images and voice search.
d. Auto Complete — Anticipates what you need from what you type.
Edited by: 浪子
Bibliography
Cabot Technology Solutions. (2017). Web Crawlers — Everything You Need to Know. Retrieved from
https://medium.com/@cabot_solutions/web-crawlers-everything-you-need-to-know-6dce26ee8ad8
Google. (n.d.). How Google Search Works. Retrieved from
https://support.google.com/webmasters/answer/70897
Google. (n.d.). How Search Algorithms Work. Retrieved from
https://www.google.com/search/howsearchworks/algorithms/
Web crawler is a program that acts as an automated script which browses through the internet in a systematic way. The web crawler looks at the keywords in the pages, the kind of content each page has and the links, before returning the information to the search engine. This process is known as Web crawling.
The page you need is indexed by a software known as web crawler or spider. A web crawler gathers pages from the web and then, indexes them in a methodical and automated manner to support search engine queries. Crawlers would also help in validating HTML codes and checking links.
These web crawlers go by different names, like bots, automatic indexers and robots. Once you type a search query, these crawlers scan all the relevant pages that contain these words and turn it into a huge index.
For example, if you are using Google’s search engine, then the crawlers would go through each of the pages indexed in their database and fetch those pages to Google’s servers. The web crawler follows all the hyperlinks in the websites and visits other websites as well.
Web crawlers are configured to monitor the web regularly so the results they generate are updated and timely.
Google gets information from many different sources, including: web pages, user-submitted content such as Google My Business and Maps user submissions, book scanning, public databases on the Internet, and many other sources.
Google follows three basic steps to generate results from web pages:
1. Crawling
Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index.
We use a huge set of computers to fetch or crawl billions of pages on the web. The program that does the fetching is called Googlebot (also known as a robot, bot, or spider). Googlebot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site.
Google's crawl process begins with a list of web page URLs, generated from previous crawl processes, and augmented with Sitemap data provided by webmasters. As Googlebot visits each of these websites it detects links on each page and adds them to its list of pages to crawl. New sites, changes to existing sites, and dead links are noted and used to update the Google index.
Google says there are more than over 60 trillion individual pages in the World Wide Web. Web Crawlers crawl through these pages to bring back the results demanded by customers. Site owners can decide which of their pages they want the web crawlers to index, and they can block the pages that needn’t be indexed.
How Does Google Find a Page ?
Google uses many techniques to find a page, including:
a. Following links from other sites or pages
b. Reading sitemaps
How Does Google Know Which Pages Not to Crawl ?
a. Pages blocked in robots.txt won't be crawled, but still might be indexed if linked to by another page. (Google can infer the content of the page by a link pointing to it, and index the page without parsing its contents.)
b. Google can't crawl any pages not accessible by an anonymous user. Thus, any login or other authorization protection will prevent a page from being crawled.
2. Indexing
Googlebot processes each of the pages it crawls in order to compile a massive index of all the words it sees and their location on each page. Googlebot can process many, but not all, content types. For example, we cannot process the content of some rich media files.
If the page his blocked by a robots.txt file, a login page, or other device, it is possible that the page might be indexed even if Google didn't visit it!
When a user enters a query, google's machines search the index for matching pages and return the results we believe are the most relevant to the user. Relevancy is determined by over 200 factors, and we always work on improving our algorithm. Google considers the user experience in choosing and ranking results, so be sure that your page loads fast and is mobile-friendly.
How Search Algorithms Work ?
Here are some of the ways Google uses search algorithms to return useful information from the web:
a. Analyzing Your Words
Understanding the meaning of your search is crucial to returning good answers. So to find pages with relevant information, first step is to analyze what the words in your search query mean. Google builds language models to try to decipher what strings of words we should look up in the index.
This involves steps as seemingly simple as interpreting spelling mistakes, and extends to trying to understand the type of query you’ve entered by applying some of the latest research on natural language understanding. For example, Google's synonym system helps Search know what you mean, even if a word has multiple definitions.
b. Matching Your Search
Next, Google looks for web pages with information that matches your query. When you search, at the most basic level, Google's algorithms look up your search terms in the index to find the appropriate pages. Google analyzes how often and where those keywords appear on a page, whether in titles or headings or in the body of the text.
As well as matching keywords, algorithms look for clues to measure how well potential search results give users what they are looking for. Google tries to figure out if the page contains an answer to your query and doesn’t just repeat your query. So Search algorithms analyze whether the pages include relevant content such as pictures of dogs, videos, or even a list of breeds. Finally, Google checks to see if the page is written in the same language as your question in order to prioritize pages in your preferred language.
c. Ranking Useful Pages
So as to help rank the best pages first, Google also writes algorithms to evaluate how useful these web pages are.
These algorithms analyze hundreds of different factors to try to surface the best information the web can offer, from the freshness of the content, to the number of times your search terms appear and whether the page has a good user experience. In order to assess trustworthiness and authority on its subject matter, Google looks for sites that many users seem to value for similar queries. If other prominent websites on the subject link to the page, that’s a good sign the information is high quality.
There are many spammy sites on the web that try to game their way to the top of search results through techniques like repeating keywords over and over or buying links that pass PageRank. These sites provide a very poor user experience and may even harm or mislead Google’s users. So Google writes algorithms to identify spam and remove sites that violate Google’s webmaster guidelines from our results.
d. Considering Context
Information such as your location, past search history and search settings all help us to tailor your results to what is most useful and relevant for you in that moment.
Google use your country and location to deliver content relevant for your area. For instance, if you’re in Chicago and you search “football”, Google will most likely show you results about American football and the Chicago Bears first. Whereas if you search “football” in London, Google will rank results about soccer and the Premier League higher. Search settings are also an important indicator of which results you’re likely to find useful, such as if you set a preferred language or opted in to SafeSearch (a tool that helps filter out explicit results).
In some instances, Google may also personalize your results using information about your recent search activity. For instance, if you search for “Barcelona” and recently searched for “Barcelona vs Arsenal”, that could be an important clue that you want information about the football club, not the city. You can control what search activity is used to improve your Search experience, including adjusting what data is saved to your Google account, at myaccount.google.com.
e. Returning the Best Results
Before Google serve your results, Google evaluates how all the relevant information fits together: is there only one topic among the search results, or many? Are there too many pages focusing on one narrow interpretation? Google strives to provide a diverse set of information in formats that are most helpful for your type of search.
Furthermore, Google provides a number of features that make your search more effective, such as:
a. Spelling — In case there is an error in the word you typed, Google comes up with a number of alternatives to help you get on track.
b. Google Instant — Instant results as you type.
c. Search Methods — Different options for searching, other than just typing out the words. This includes images and voice search.
d. Auto Complete — Anticipates what you need from what you type.
Edited by: 浪子
Bibliography
Cabot Technology Solutions. (2017). Web Crawlers — Everything You Need to Know. Retrieved from
https://medium.com/@cabot_solutions/web-crawlers-everything-you-need-to-know-6dce26ee8ad8
Google. (n.d.). How Google Search Works. Retrieved from
https://support.google.com/webmasters/answer/70897
Google. (n.d.). How Search Algorithms Work. Retrieved from
https://www.google.com/search/howsearchworks/algorithms/
How Google Search Works ?
Reviewed by 浪子
on
December 14, 2018
Rating: