Review: Implementing the Google Search Appliance in an Intranet environment
Our corporate intranet is a non-framed environment with both Lotus Domino and IIS (.Net and classic ASP) applications and content. We have between 300,000-500,000 pages of web content and documents across more than 1200 "sites" on approximately 30 unique domains. We used to have Inktomi's UltraSeek Server 3.0 as our intranet search engine which was beginning to look like its age (purchased in 1998). The Inktomi product did not handle attachments well (DOC, PPT, PDF, etc.), would not crawl our secured sites, and was no longer supported by the vendor. We did a cursory review of the search vendors and were immediately attracted to Google's 30 day trial offer for their Google Search Appliance (GSA). After signing a standard agreement, they shipped us a brand new shiny yellow unit which we could test for 30 days before returning or purchasing.
Product info
The GSA is a "black box" 1U standard rack-mountable server. By "black box" I mean, Google gives you a web interface to administer the device but do not want you to access the Operating System (a heavily Google-customized version of Linux). In fact, the license agreement stipulates that you will not tamper with the hardware or OS of the appliance in any way. The device has no need for a keyboard, mouse or video – all you need for normal operation is a network cable and standard power input.
The GSA comes in different flavors to fit different needs varying by size of the hardware and correspondingly size of the license. (Licensing is based on the number of URLs crawled by the appliance.) There are 3 different hardware configurations; the GB-1001, GB-5005, and GB-800. These are broken down as follows;
- GB-1001 – 150K documents for $28K, 300K documents for $50K
- GB-5005 – 1.5M documents for $230K
- GB-8008 – 4M documents for $450K
Why Google?
As advertised, the GSA met all of our needs being able to index the large variety of filetypes we have in our environment, access secured content, having a documented API, etc. The Google brand power was another big selling factor. When we told our users that they were going to get a Google-based search engine they knew their days of troubled searching were over. Lastly, the 30-day trial run experience we had with the GSA sealed the deal. The appliance is the easiest enterprise solution I've ever had to install, configure and maintain. We were literally up and running within an hour of opening the shipping box.
Installation
The appliance has two network ports on the back panel; one for normal operation and the other used exclusively for network configuration. To configure the network settings we connected a laptop to the appliance via a special (some pin-outs are non-standard) orange Ethernet cable which is included. The installation process was about as easy as one can imagine for a "black box."
First we plugged in the normal operation network cable and then the power. The power plug on the appliance IS the power switch; plug it in to turn on and unplug it to turn it off. After plugging it in, we waited about 5 minutes for the appliance to play a tune which is the signal to continue. Next, we hooked up our laptop (already set to DHCP mode) to the appliance and powered it up. After logging in to our laptop and making sure we had the correct IP assigned by the appliance's built-in DHCP server we are ready to configure the network settings. Total elapsed time (excluding rack mounting): 10 minutes.
Configuration
Network configuration, like normal administration, is done entirely through a browser and is a simple 5 step process. The first screens ask you for basic network information; IP address, subnet mask, default gateway, and DNS. Subsequent screens collect SMTP server, "From" address for GSA notification messages, time zone, NTP (time) servers and the admin account name/password. The last step is to test a few URLs which you will be crawling to make sure you've done the setup correctly. After a final settings review screen configuration is complete and you can then unplug your laptop and get to the good part; start crawling. Total elapsed time: 10 minutes.
Crawling the site(s)
Using the URL provided, all administration of the GSA is done remotely. After logging in with the ID/password we provided in the previous step, we were presented with the Administration console. We created a new collection to hold our index, put in the "Start crawling from" URL, copied that same URL into the "Follow and Crawl only URLs with the Following patterns" box and we were done. We saved our settings and then clicked the "Start crawling" button. We then went over to the "Crawl status" screen and watched the "Crawled URLs" counter increase. Google advertises that it can crawl about 4,000 URLs in about 15 minutes or so. We found the crawl time would increase significantly if there are documents (Word, PDF, Excel, etc.) linked to from those URLS.
After the crawl is done the collection is automatically indexed and then checked against the Serving Prerequisites (any criteria you wish to use to determine whether to move an indexed collection to production) and the collection will either be moved to Production (and consequently searchable) or be moved to Staging. The Staging area lets you validate new crawls before letting users search against them.
Crawling configuration
After your first crawl you may find the need to go back and tweak the crawling parameters. Google gives you a good amount of control over how sites are crawled, the frequency, how many threads are used, etc. For sites with security, the GSA supports Basic Authentication and an additional security module is available which supports Forms Authentication. The most challenging configuration aspects for us were determining the right combination of URL patterns to exclude from the search. If you are a Domino shop and looking to use the GSA you may need to spend some time getting the crawler configuration just right to support the sometimes convoluted Domino query string parameters.
After we got the crawl parameters tuned and the first complete crawl done we did some testing to see if the crawler grabbed all the content. Browsing our site and testing with some strings buried deep inside the taxonomy we always found the GSA had crawled them accurately. We also did some testing with strings inside PDF documents, PowerPoint presentations and the like. When we did come across something that hadn't been crawled a careful analysis led us to discover that we needed to do some more tweaking of the crawl settings.
Other notable features
Google also gives you a KeyMatch tool that allows you to specify which indexed documents should appear at the top of the results page for a given query. These manifest themselves almost identically to the Sponsored Links at the top of the results page of the Google we all use. A Synonym tool allows you to specify alternate words or phrases for search queries. For example, if someone searches for WCM, you can suggest "Web Content Management" at the top of the results page.
An output format feature lets you control (via an XSLT) the presentation of the search results. You can use this for changing the fonts, colors, logo, header, etc. of the results page. We were able to easily remove the "Cached" feature on the results page with some XSLT modifications.
The Reporting tool lets you run reports on search queries over various time ranges. It will show you the number of searches per day, per hour, the top 100 keywords and top 100 queries for the time period specified.
Downsides
The GSA is not for organizations looking to index their shared network drives as the appliance has no facility for crawling file systems. This is really too bad as many companies struggle with the huge quantities of unstructured content on stored on their networks. Of course, there are a plethora of other products out there for exactly this issue.
Access directly to databases (e.g. SQL, Oracle, etc.) is another area which is off-limits for the GSA as well as any kind of integration with content or document management systems.
Conclusion
The Google Search Appliance (GSA) is an excellent search product for HTTP-accessible content. It gives great control over administrative features such as crawler configuration and results serving and sufficient reporting capabilities as well. Those looking for a solution to integrate directly with a content/document management system, databases, or indexing network drives should look to another product. However, if you have a intranet or intranet site with plenty of HTML-based content the GSA may be just what you need.
Bryan Mjaanes is the creator/editor of Intranet101.com, a community-based forum for Intranet professionals.
Related Articles:
One Way Links and Reciprocal Link Exchange and Traffic
While reciprocal links are still valid and help you gain link popularity and page rank, many SEO experts agree that one way links are more valuable. One way links are also known as non-reciprocal links. Acquiring one way links are much more difficult than reciprocal links. One way links are a tool that can be quite beneficial to the webmaster. The very best one way links are those that are included in the content of another website, directing visitors to your website. One way links are those where you point to a site, or a site points to you without a link being returned. One way links are the best way to increase the link popularity of the site and get theme based links for natural search engine optimization.
Boost Your Search Engine Ranking And Generate Free Traffic With Reciprocal Links
Reciprocal links are an important step in your overall plan to get site visitors.What are they? Reciprocal links are mutual links you and some other web site owner agree to post on your respective sites.
Using Back Links to Get Top Search Engine Ranking
There are no hidden secrets on how to rank high with the major search engines. All that is needed is a basic understanding of how search engines work and a bit of know how.
15 Proven Ways For Link Building To Improve Search Engine Rank
Almost all webmasters know that incoming links are food for website. The website will rank high in the search engine result pages as long as they have great quality incoming links with related anchor text.
Link Building: To Link, or Not to Link, That is the Question
Lately, there have been a lot of heated discussions regarding link building. Is it ethical to create a link building campaign? Does Google or any other search engine penalize for "link farms" (a bunch of non-related links created for the SOLE purpose of increasing search engine ratings)? Is the "link building era" over?Link FarmsMany webmasters claim that Google penalizes websites for link farms.
Link Survey Version 1.6: Improve Search Engine Ranking by Learning About Competitors
AntsSoft today announced the release of Link Survey version 1.6, the first software in the world which can check link popularity of multiple relative websites, make comprehensive analysis, and generate a detailed report.
Traffic One Way Links And Reciprocal Link Exchange
While reciprocal links are still valid and help you gain link popularity and page rank, many SEO experts agree that one way links are more valuable. One way links are also known as non-reciprocal links. Acquiring one way links are much more difficult than reciprocal links. One way links are a tool that can be quite beneficial to the webmaster. The very best one way links are those that are included in the content of another website, directing visitors to your website. One way links are those where you point to a site, or a site points to you without a link being returned. One way links are the best way to increase the link popularity of the site and get theme based links for natural search engine optimization.
Reciprocal Link Exchange Rampup - Winning the Link Popularity Contest
Getting links to your website all over the Internet is one of your best methods to get a higher ranking in the search engines. The reason for this is the more links you have, the more popular you are in the eyes of the search engines, and as a result you will be rewarded with a higher relevancy ranking than other websites with fewer links.
Link Building and Link Strategy for Increased Web Traffic
Toronto, ON November 26, 2007 ? There are millions of websites in cyberspace. The challenge becomes how to ensure that your website is found on search engines and is seen by potential customers.
SEO Link Building and Copywriting Service Relaunched by Search Engine Optimization Firm Brick Marketing
Brick Marketing has realized the importance of incorporating and combining link building into all aspects of online marketing for each of their clients. They have since re-launched their link building service for those interested in introducing online marketing in their business model.
25 Common Link Exchange & Search Engine Terms
In today's world of website promotion and traffic building, a whole new set of terms and definitions have developed. To be a successful webmaster and/or website owner, it is important to know the meanings of some of the most popular link exchange and search engine terms.
Build Links, Increase Page Rank, Increase Traffic
Search Engines in the last couple of years are giving more weight to one way links with a similar theme, these links are a vote of trust and confidence for your website, they are so important that they help your site in the rankings of search engines. One search engine in particular uses link popularity, that search engine is Google. When you improve your link popularity it will eventually move your site up in the serps, this is the goals of every webmaster.
Buying Links - How To Make Sure That The Links You Buy Are Worth It
Before you start looking at links to buy you need to know that not all links for sale are worth it There are many things that you need to look at before you buy those links
Linking for Traffic: The Shift from Link Directories to Hyper-Targeted Linking
There's a stiff wind blowing in a new direction on the web. And you'd benefit from taking the time to notice the direction its headed.
Rock Your Rank With a Dynamite Text Link - Yahoo Directory Explodes Rankings
Last week a client called me excitedly exclaiming that their
Google PageRank had jumped a notch and their targeted keyword
term now ranked #23 (up from #45) for their competitive search
phrase. I asked the client if he'd been notified by Yahoo that
his site was now included in the index after we had submitted
it three weeks ago.