You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

103 lines
3.9 KiB

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>
ht://Dig: htsearch
</title>
</head>
<body bgcolor="#eef7ff">
<h1>
htsearch
</h1>
<p>
ht://Dig Copyright &copy; 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br>
Please see the file <a href="COPYING">COPYING</a> for
license information.
</p>
<hr size="4" noshade>
<h2>
Search Method Used
</h2>
<p>
The way htsearch performs it search and applies its ranking
rules are fairly complicated. This is an attempt at explaining
in global terms what goes on when htsearch searches.
</p>
<p>
htsearch gets a list of (case insensitive) words from the HTML
form that invoked
it. If htsearch was invoked with boolean expression parsing
enabled, it will do a quick syntax check on the input words.
If there are syntax errors, it will display the syntax error
file that is specified with the
<a href="attrs.html#syntax_error_file">syntax_error_file</a>
attribute.
</p>
<p>
If the boolean parser was not enabled, the list of words is
converted into a boolean expression by putting either "and"s
or "or"s between the words. (This depends on the search
type.) Phrases within double quotes (") specify that the words
must occur sequentially within the document.
</p>
<p>
If a word is immediately preceeded by a field specifer
(title:, heading:, author:, keyword:, descr:, link:, url:)
then it will only match documents in which the word occurred
within field. For example, descr:foo only matches documents
containing &lt;meta value="description" value="... foo ..."&gt;.
The link: field refers to the text in the hyperlinks to a document,
rather than text within the document itself. Similarly url:
(will eventually) refer to the actual URL of the document, not any
of its contents.
The prefixes exact: and hidden: are also accepted.
The former (will) cause the
<a href="attrs.html#search_algorithm">fuzzy search algorithm</a>
not to be applied to this word, while the latter causes the word
not to be displayed in the query string of the results page.
</p>
<p>
Each of the words in the list (but not within a phrase) is now
expanded using the search algorithms that were specified in the
<a href="attrs.html#search_algorithm">search_algorithm</a>
attribute. For example, the endings algorithm will convert a
word like "person" into "person or persons". In this fashion,
all the specified algorithms are used on each of the words
and the result is a new boolean expression.
</p>
<p>
The next step is to perform database lookups on the words in
the expression. The result of these lookups are then passed
to the boolean expression parser.
</p>
<p>
The boolean expression parser is a simple recursive descent
parser with an operand stack. It knows how to deal with
"not", "and", "or" and parenthesis. The result of the parser
will be one set of matches.<br>
Note that the operator "not" is used as the word 'without' and
is binary: You can not write "cat and not dog" or just "not
dog" but you can write "cat not dog".
</p>
<p>
At this point, the matches are ranked. The rank of a match is
determined by the weight of the words that caused the match
and the weight of the algorithm that generated the word. Word
weights are generally determined by the importance of the
word in a document. For example, words in the title of a
document have a much higher weight than words at the bottom
of the document.
</p>
<p>
Finally, when the document ranks have been determined and the
documents sorted, the resulting matches are displayed. If
paged output is required, only a subset of all the matches
will be displayed.
</p>
<hr size="4" noshade>
Last modified: $Date: 2004/05/28 13:15:18 $
</body>
</html>