1. Home
  2. Tutorials
  3. Search Engines
Yolinux.com Tutorial

YoLinux: Search Engine Review

How to add web page search and web page indexing capability to your web site.

Search Engines:

Adding search to your website:

There are a plethora of search options available for your web site.

  • Outsourced Search: Use a search service to spider and index your site and provide the search box on your web pages, search engine and provide the search results which point back to your site. The actual search index and spidering of the web site is handled by the search service. Google and others can provide this service. It is the equivalent of a Google search with:
    search-word site:your-domain.com
    The HTML form which calls the Google search engine can also embed the domain to acheive the same effect. Google and other search firms provide free and for fee services.

  • Your own Search: Index your site and provide your own search capabilities. Commercial and open source solutions exist. This can be a cgi program which performs a grep/search on the site contents when called or it can use a previously generated index of the contents of the web site for faster results.

  • Search Appliance: A separate search "appliance" or search server can spider your web site and provide the search facility for your site. This works best for sites with multiple web servers or for intranets with multiple file and web servers.

Commercial Search Services:

Product Vendor Web Site
Google site search Google http://www.google.com/help/features.html#sitesearch
or
Google API's
or
AdSense for search
Yahoo Searchbox Yahoo Yahoo small business searchbox
Bloodhound Bloodhound http://www.bloodhound.com/

Commercial Search Engine Software Vendors:

Vendor Product
Focuseek Searchbox2: Index HTML. PDF, MS/Word, RTF and plain text documents
Folio Folio Site Director
Google Google enterprise solutions: (Based on Stanford research)
SLI Systems Learning search (for eCommerce sites)
Lycos Inmagic
Maxum Development Corp. Phantom
Netscape Compass Server
Quadralay Corp Web Works Search
HotBot www.hotbot.com
Opentext Livelink

Multi-Media

Product Vendor Web Site Use
UKMax - - -
ICQ - - -
Copernic Technologies Inc. (Quebec City) - www.copernic.com Queries multiple search engines.
Clever IBM www.almaden.ibm.com/cs/k53/clever.html Ranks search results.
Most authorative first.
Thunderstone - www.thunderstone.com -
Direct Hit - www.directhit.com Uses personal info to modify search.
Incorporates relevance ranking.
Islip - www.islip.com Indexes video closed captioned text.
Network Wizards - www.nw.com -

List of Open Source Software Search options:

Search Engine Web Site
perl_site_search Simplest search to implement
SWISH Version 1.1: Use on low number of local pages only.
SWISH++ The fastest SWISH. Written in C++.
Lucene From the Apache group. Written in Java and runs on Tomcat.
WebGlimpse/Glimpse Original U of Arizona and commercial versions. Written in Perl and C.
HTML, PDF, Word and other formats.
freeWais Can perform "And", "Or" and "Not" type searches.
Also:
freeWais-sf One of the first available content indexing/search engines.
The SF is for "Structured Fields". These fields are used for informations types such as author, title, date... Can perform "And", "Or" and "Not" type searches.
Info:
DataParkSearch HTML, plain text, audio MP3 and GIF images. Supports synonyms, and fuzzy search. Multi-character support. Index and CGI. GPL
Spider/Robot Index and Engine
ht/Dig Search/Index single site resident on server or spider remote WWW servers. Supports robots.txt exclusions. HTML and plain text documents. GPL. (San Diego State U.)

See the YoLinux htDig Web Site Search installation and configuration tutorial (default Red Hat/Fedora/CentOS web site search)

Harvest (Robot Indexer) Supports HTML include TeX, DVI, PS, full text, mail, man pages, news, troff, WordPerfect, RTF, Microsoft Word/Excel, SGML, C sources and PDF (using Xpdf) Modular. Written in Perl.
Solr and Lucene From the Apache group. Software works together to provide an enterprise search solution:
Solr: fulltext search, HTML administration interface, distributed seach, Hit highlighting, ...
Lucene: Available in Java, C++, PHP, Python, ... Will index text from PDFs, HTML, Microsoft Word, and OpenDocument documents, ... Indexing and search.

Adding Search to your web site:

Search Recommendations for your web site:
  • The most simple solution is to use outsourced search. Google and others can provide a search box for your web page and the service to index your site and provide this search capability. Let them handle it.

  • The next most simple solution is only for small simple web sites with static web pages: perl_site_search. It can index your local pages on your hard drive and provide a simple search CGI. This can not be used for dynamic content or with server side includes. The entire web page must reside in a single HTML file.

  • If your site is more complex and produces dynamic content, a spider must make HTTP requests from your web server to gather and index the content. I have found ht/Dig to be easy to employ as it is provided with most Linux distributions and just requires configuration.
    See the YoLinux htDig Web Site Search installation and configuration tutorial (default Red Hat/Fedora/CentOS web site search)

  • For a fully featured, high performance and very sophisticated enterprise search, look at Solr/Lucene. This will require the installation of a Java App server such as Tomcat and a fair bit of configuration and system administration.

YoLinux.com Site Search Setup Tutorials:

  • ht://Dig - Comes with most Linux distributributions
  • WAIS - One of the originals - Wide Area Information Server
Indexing (general):

Spyders and Robots:

On line reviews.

Search Portal Lists:

Comprehensive list of search sites. See:

On-lineReviews of Search Engines:

Links:

Book imageBooks:

"Lucene in Action"
by Michael McCandless, Erik Hatcher and Otis Gospodnetic
ISBN #1933988177, Manning Publications
July 2010

Instructional examples, advice and best practices.

Amazon.com
"Solr Enterprise Seach Server"
by David Smiley and Eric Pugh
ISBN #1847195881, Packt Publishing
August 2009

Covers Solr and SolrJ (embedded Java client API)

Amazon.com