You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
103 lines
3.9 KiB
103 lines
3.9 KiB
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<title>
|
|
ht://Dig: htsearch
|
|
</title>
|
|
</head>
|
|
<body bgcolor="#eef7ff">
|
|
<h1>
|
|
htsearch
|
|
</h1>
|
|
<p>
|
|
ht://Dig Copyright © 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br>
|
|
Please see the file <a href="COPYING">COPYING</a> for
|
|
license information.
|
|
</p>
|
|
<hr size="4" noshade>
|
|
<h2>
|
|
Search Method Used
|
|
</h2>
|
|
<p>
|
|
The way htsearch performs it search and applies its ranking
|
|
rules are fairly complicated. This is an attempt at explaining
|
|
in global terms what goes on when htsearch searches.
|
|
</p>
|
|
<p>
|
|
htsearch gets a list of (case insensitive) words from the HTML
|
|
form that invoked
|
|
it. If htsearch was invoked with boolean expression parsing
|
|
enabled, it will do a quick syntax check on the input words.
|
|
If there are syntax errors, it will display the syntax error
|
|
file that is specified with the
|
|
<a href="attrs.html#syntax_error_file">syntax_error_file</a>
|
|
attribute.
|
|
</p>
|
|
<p>
|
|
If the boolean parser was not enabled, the list of words is
|
|
converted into a boolean expression by putting either "and"s
|
|
or "or"s between the words. (This depends on the search
|
|
type.) Phrases within double quotes (") specify that the words
|
|
must occur sequentially within the document.
|
|
</p>
|
|
<p>
|
|
If a word is immediately preceeded by a field specifer
|
|
(title:, heading:, author:, keyword:, descr:, link:, url:)
|
|
then it will only match documents in which the word occurred
|
|
within field. For example, descr:foo only matches documents
|
|
containing <meta value="description" value="... foo ...">.
|
|
The link: field refers to the text in the hyperlinks to a document,
|
|
rather than text within the document itself. Similarly url:
|
|
(will eventually) refer to the actual URL of the document, not any
|
|
of its contents.
|
|
The prefixes exact: and hidden: are also accepted.
|
|
The former (will) cause the
|
|
<a href="attrs.html#search_algorithm">fuzzy search algorithm</a>
|
|
not to be applied to this word, while the latter causes the word
|
|
not to be displayed in the query string of the results page.
|
|
</p>
|
|
<p>
|
|
Each of the words in the list (but not within a phrase) is now
|
|
expanded using the search algorithms that were specified in the
|
|
<a href="attrs.html#search_algorithm">search_algorithm</a>
|
|
attribute. For example, the endings algorithm will convert a
|
|
word like "person" into "person or persons". In this fashion,
|
|
all the specified algorithms are used on each of the words
|
|
and the result is a new boolean expression.
|
|
</p>
|
|
<p>
|
|
The next step is to perform database lookups on the words in
|
|
the expression. The result of these lookups are then passed
|
|
to the boolean expression parser.
|
|
</p>
|
|
<p>
|
|
The boolean expression parser is a simple recursive descent
|
|
parser with an operand stack. It knows how to deal with
|
|
"not", "and", "or" and parenthesis. The result of the parser
|
|
will be one set of matches.<br>
|
|
Note that the operator "not" is used as the word 'without' and
|
|
is binary: You can not write "cat and not dog" or just "not
|
|
dog" but you can write "cat not dog".
|
|
</p>
|
|
<p>
|
|
At this point, the matches are ranked. The rank of a match is
|
|
determined by the weight of the words that caused the match
|
|
and the weight of the algorithm that generated the word. Word
|
|
weights are generally determined by the importance of the
|
|
word in a document. For example, words in the title of a
|
|
document have a much higher weight than words at the bottom
|
|
of the document.
|
|
</p>
|
|
<p>
|
|
Finally, when the document ranks have been determined and the
|
|
documents sorted, the resulting matches are displayed. If
|
|
paged output is required, only a subset of all the matches
|
|
will be displayed.
|
|
</p>
|
|
<hr size="4" noshade>
|
|
|
|
Last modified: $Date: 2004/05/28 13:15:18 $
|
|
|
|
</body>
|
|
</html>
|