extra-dependencies/debian/htdig/htdig-3.2.0b6/htdoc/hts_method.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <head>
	<title>
	  ht://Dig: htsearch
	</title>
  </head>
  <body bgcolor="#eef7ff">
	<h1>
	  htsearch
	</h1>
	<p>
	  ht://Dig Copyright &copy; 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br>
	  Please see the file <a href="COPYING">COPYING</a> for
	  license information.
	</p>
	<hr size="4" noshade>
	<h2>
	  Search Method Used
	</h2>
	<p>
	  The way htsearch performs it search and applies its ranking
	  rules are fairly complicated. This is an attempt at explaining
	  in global terms what goes on when htsearch searches.
	</p>
	<p>
	  htsearch gets a list of (case insensitive) words from the HTML
	  form that invoked
	  it. If htsearch was invoked with boolean expression parsing
	  enabled, it will do a quick syntax check on the input words.
	  If there are syntax errors, it will display the syntax error
	  file that is specified with the
	  <a href="attrs.html#syntax_error_file">syntax_error_file</a>
	  attribute.
	</p>
	<p>
	  If the boolean parser was not enabled, the list of words is
	  converted into a boolean expression by putting either "and"s
	  or "or"s between the words. (This depends on the search
	  type.)  Phrases within double quotes (") specify that the words
	  must occur sequentially within the document.
	</p>
	<p>
	  If a word is immediately preceeded by a field specifer
	  (title:, heading:, author:, keyword:, descr:, link:, url:)
	  then it will only match documents in which the word occurred
	  within field.  For example, descr:foo only matches documents
	  containing &lt;meta value="description" value="... foo ..."&gt;.
	  The link: field refers to the text in the hyperlinks to a document,
	  rather than text within the document itself.  Similarly url:
	  (will eventually) refer to the actual URL of the document, not any
	  of its contents.
	  The prefixes exact: and hidden: are also accepted.
	  The former (will) cause the
	  <a href="attrs.html#search_algorithm">fuzzy search algorithm</a>
	  not to be applied to this word, while the latter causes the word
	  not to be displayed in the query string of the results page.
	</p>
	<p>
	  Each of the words in the list (but not within a phrase) is now
	  expanded using the search algorithms that were specified in the
	  <a href="attrs.html#search_algorithm">search_algorithm</a>
	  attribute. For example, the endings algorithm will convert a
	  word like "person" into "person or persons". In this fashion,
	  all the specified algorithms are used on each of the words
	  and the result is a new boolean expression.
	</p>
	<p>
	  The next step is to perform database lookups on the words in
	  the expression. The result of these lookups are then passed
	  to the boolean expression parser.
	</p>
	<p>
	  The boolean expression parser is a simple recursive descent
	  parser with an operand stack. It knows how to deal with
	  "not", "and", "or" and parenthesis. The result of the parser
	  will be one set of matches.<br>
	  Note that the operator "not" is used as the word 'without' and
	  is binary: You can not write "cat and not dog" or just "not
	  dog" but you can write "cat not dog".
	</p>
	<p>
	  At this point, the matches are ranked. The rank of a match is
	  determined by the weight of the words that caused the match
	  and the weight of the algorithm that generated the word. Word
	  weights are generally determined by the importance of the
	  word in a document. For example, words in the title of a
	  document have a much higher weight than words at the bottom
	  of the document.
	</p>
	<p>
	  Finally, when the document ranks have been determined and the
	  documents sorted, the resulting matches are displayed. If
	  paged output is required, only a subset of all the matches
	  will be displayed.
	</p>
	<hr size="4" noshade>

	Last modified: $Date: 2004/05/28 13:15:18 $

  </body>
</html>