Friday 9 March 2012

What is search engine


A Web search engine is a tool designed to search for information on the World Wide Web. Information may consist of web pages, images, information and other types of files.



What is spider


A program that automatically fetches Web pages. Spiders are used to feed pages to search engines. It's called a spider because it crawls over the Web. Another term for these programs is webcrawler. Because most Web pages
contain links to other pages, a spider can start almost anywhere. As soon as it sees a link to another page, it goes off and fetches it. Large search engines , like Alta Vista, have many spiders working in parallel.


How Web Search Engines Work


Crawler-based search engines are those that use automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the site's meta tags and also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the administrators of the search engine.
 
Human-powered search engines rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index. In both cases, when you query a search engine to locate information, you're actually searching through the index that the search engine has created —you are not actually searching the Web. These indices are giant databases of information that is collected and stored and subsequently searched. This explains why sometimes a search on a commercial search engine, such as Yahoo! or Google, will return results that are, in fact, dead links. Since the search results are based on the index, if the index hasn't been updated since a Web page became invalid the search engine treats the page as still an active link even though it no longer is. It will remain that way until the index is updated.

Major search engines

Google
Yahoo
MSN/Bimg

Robots

Google:Googlebot
MSN / Bing: MSNBOT/0.1
Yahoo:  Yahoo! Slurp

 

Robot.txt file

Robot.txt is a file that gives instructions to all search engine spiders to index or follow certain page or pages of a website. This file is normally use to disallow the spiders of a search engines from indexing unfinished page of a website during it's development phase. Many webmasters also use this file to avoid spamming. The creation and uses of Robot.txt file are listed below:

Robot.txt Creation:

To all robots out
User-agent: *
Disallow: /

To prevent pages from all crawlers
User-agent: *
Disallow: /page name/

To prevent pages from specific crawler
User-agent: GoogleBot
Disallow: /page name/

To prevent images from specific crawler
User-agent: Googlebot-Image
Disallow: /

To allows all robots
User-agent: *
Disallow:

Finally, some crawlers now support an additional field called "Allow:", most notably, Google.

To disallow all crawlers from your site EXCEPT Google:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /


"Robots" Meta Tag

If you want a page indexed but do not want any of the links on the page to be followed, you can use the following instead:
< meta name="robots" content="index,nofollow"/>

If you don't want a page indexed but want all links on the page to be followed, you can use the following instead:
< meta name="robots" content="noindex,follow"/>

If you want a page indexed and all the links on the page to be followed, you can use the following instead:
< meta name="robots" content="index,follow"/>

If you don't want a page indexed and followed, you can use the following instead:
< meta name="robots" content="noindex,nofollow"/>

Invite robots to follow all pages
< meta name="robots" content="all"/>

Stop robots to follow all pages
< meta name="robots" content="none"/>

Robots.txt Vs Robots Meta Tag


Robots.txt
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.

In order to use a robots.txt file, you'll need to have access to the root of your domain (if you're not sure, check with your web hoster). If you don't have access to the root of a domain, you can restrict access using the robots meta tag.

Robots Meta Tag
To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index.

When we see the noindex meta tag on a page, Google will completely drop the page from our search results, even if other pages link to it. Other search engines, however, may interpret this directive differently. As a result, a link to the page can still appear in their search results.

Note that because we have to crawl your page in order to see the noindex meta tag, there's a small chance that Googlebot won't see and respect the noindex meta tag. If your page is still appearing in results, it's probably because we haven't crawled your site since you added the tag. (Also, if you've used your robots.txt file to block this page, we won't be able to see the tag either.)

If the content is currently in our index, we will remove it after the next time we crawl it. To expedite removal, use the URL removal request tool in Google Webmaster Tools.

Validate Your Code


There are several ways to validate the accuracy of your website's source code. The four most important, in my opinion, are validating your search engine optimization, HTML, CSS and insuring that you have no broken links or images.

Start by analyzing broken links. One of the W3C's top SEO tips would be for you to use their tool to validate links. If you have a lot of links on your website, this could take awhile.
Next, revisit the W3C to analyze HTML and CSS. Here is a link to the W3C's HTML Validation Tool and to their CSS Validation Tool.

The final step in the last of my Top  SEO Tips is to validate your search engine optimization. Without having to purchase software, the best online tool I've used is ScrubTheWeb's Analyze Your HTML tool. STW has built an extremely extensive online application that you'll wonder how you've lived with out.
One of my favorite features of STW's SEO Tool is their attempt to mimic a search engine. In other words, the results of the analysis will show you (theoretically) how search engine spiders may see the website.

Install a sitemap.xml for Google


Though you may feel like it is impossible to get listed high in Google's search engine result page, believe it or not that isn't Google's intention. They simply want to insure that their viewers get the most relevant results possible. In fact, they've even created a program just for webmasters to help insure that your pages get cached in their index as quickly as possible. They call the program Google Sitemaps. In this tool, you'll also find a great new linking tool to help discover who is linking to your website.

For Google, these two pieces in the top SEO tips would be to read the tutorial entitled How Do I Create a Sitemap File and to create your own. To view the one on this page, website simply right-click this SEO Tips Sitemap.xml file and save it to your desktop. Open the file with a text editor such as Notepad.
Effective 11/06, Google, Yahoo!, and MSN will be using one standard for sitemaps. Below is a snippet of the standard code as listed at Sitemaps.org. Optional fields are lastmod, changefreq, and priority.


<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset> 

The equivilant to the sitemap.xml file is the urllist.txt for Yahoo!. Technically you can call the file whatever you want, but all it really contains is a list of every page on your website. Here's a screenshot of my urllist.txt:

Include a robots.txt File


By far the easiest top SEO tips you will ever do as it relates to search engine optimization is include a robots.txt file at the root of your website. Open up a text editor, such as Notepad and type "User-agent: *". Then save the file as robots.txt and upload it to your root directory on your domain. This one command will tell any spider that hits your website to "please feel free to crawl every page of my website".

Here's one of my best top SEO tips: Because the search engine analyzes everything it indexes to determine what your website is all about, it might be a good idea to block folders and files that have nothing to do with the content we want to be analyzed. You can disallow unrelated files to be read by adding "Disallow: /folder_name/" or "Disallow: /filename.html". Here is an example of the robots.txt file on this site: