kpilot/conduits/docconduit/bmkSpecification.txt

KPilot PalmDoc Conduit bookmark Specification
=============================================

(c) 2003 Reinhold Kainhofer, reinhold@kainhofer.com

This document is licensed under the FDL (Free Documentation License) 
as published by the FSF. Any version of the FDL can be applied 
at your convenience.


The PalmDoc conduit has three ways to indicate bookmarks for a text:
  -) Inline tags of the form <* bookmarkname *>
	-) Endtags of the form <bookmarkname> at the end of the document
	-) Regular expressions in a separate textname.bmk file 
	   (textname.bmk ist the filename of the text with the .txt replaced by .bmk)
		 

In the design of the .bmk file, I tried to stay close to the 
syntac of MakeDocJ bookmark files, but it turned out that I 
needed to extend the syntax a little. Also, MakeDocJ uses Java 
RegExps, while the PalmDoc conduit uses the QRegExp, which have
some slight differences (especially concerning the ^ and $ 
patterns as well as backreferences). So if you used MakeDocJ, 
the .bmk file syntax will be quite familiar, but you will still
have to adapt your bookmark files for TQt regular expressions 
instead of Java regular expressions

 
1) INLINE TAGS

Whenever a tag of the form <* someText *> appears in the text, 
this sequence is removed from the text, and a bookmark is set 
there with the bookmark name "someText" (the part between the 
<* and the *>).


2) ENDTAGS

If the text ends with tags of the form <someText>, the string 
in braces is used as bookmark name, and wherever it appears in 
the text, a bookmark is set. 
After the > any number of whitespace is allowed, but no other 
characters like letters, numbers, or punctuation. Also, inside 
the braces no line break must occur. The conduit searches the 
text from the end and if it finds a line break inside a <...> 
sequence, the tag and everything before it is assumed to belong 
to the text and doesn't form a bookmark tag. 
Between endtags any number of whitespace (spaces, tabs, line 
feeds etc.) is allowed.

As an example, assume you have a text ending in:
... the bad guy was punished, and they lived happily 
ever after! 
<Tag with
line feed>
       <bad guy> <princess>
<married>

The conduit starts at the end, ignores all whitespace between 
the tags, so it finds the tags "married", "princess", and "bad guy". 
The "Tag with line feed" has a line feed, so it is assumed to belong 
to the text. 
Assume now you have a text ending in:
... the bad guy was punished, and they lived happily 
ever after! 
<bad guy> The End <princess>
<married>

Here, only "married" and "princess" are found as bookmarks. Because 
of the letters before the "princess" tags, the search for the 
bookmarks ends at the letter "d" of "The End" (the conduit starts 
from the end and moves backward until it finds some text which 
cannot be seen as a endtag.


3) REGULAR EXPRESSIONS IN A SEPARATE FILE

This is by far the most complex way to specify bookmarks, but 
it is also the mose powerful. 
If you have a text with filename "My fairy tale.txt", the 
bookmarks will be specified in a file called "My fairy tale.bmk" 
(just the text filename with the .txt replaced by .bmk). This 
file contains the bookmark definitions, one in each line. Lines 
starting with a # are seen as comments, and empty lines are also 
ignored.


In the .bmk file, each bookmark line has one of the following syntaces 
(I will explain all fields later on). Fields in [..] are optional:

bmkName
bmkPosition, bmkName
+, bmkPatternRegExp[, bmkNameAsString[, firstIncludedBmk[, lastIncludedBmk]]]
+, bmkPatternRegExp[, bmkNameIndexOfSubexpression[, firstIncludedBmk[, lastIncludedBmk]]]
-, bmkPatternRegExp[, bmkNameAsString]
-, bmkPatternRegExp[, bmkNameIndexOfSubexpression]

  If the first field is a string, it is used as the bookmark name 
and pattern to search for. 
  If the first field is a number, it means the position of the 
bookmark, and the second field is the name of the bookmark.
  If the first field is either + or -, the second field gives 
a regular expression that is used to find the position of the 
bookmark. If the first field is a -, the search is done only 
once and only the first match will be added as bookmark. If 
the first field is a +, the search is done until the regular 
expression can no longer be found (the fourth and fifth fields 
can be used to include only a certain range of hits). If there 
is a third field, and it is a string, it gives the name of the 
bookmark as a regular expression (i.e. \1 are replaced by the 
first subexpression of the search, where subexpressions are 
specified by round brackets in the regexp of the second field). 
If there is a third field, and it is a number, it gives the index 
of the subexpression of bmkPatternRegExp that is used as the 
bookmark name.
If there is no third field, the whole matched text will be used 
as bookmark name.
The optional fourth and fifth fields can be used to set bookmarks 
only after the first few ocurrences of the regexp in the text, and 
to stop the search after the expression has been found a certain 
number of times.


If the PDB->PC sync is set up to store the bookmarks in a bookmark file, 
it will create a file "My fairy tale.bm" (no "k") with entries of the form
position,bmkName
The .bmk file will be used if it exists, but if no .bmk file exists, the .bm file
will be used. This way you can override the bookmark settings, while 
at the same time the PDB->TXT sync does not destroy your possibly 
existing .bmk file.


Examples:

1) Imagine you have a line like:
frog princess
In this case, the text is searched for "frog princess", and a 
bookmark is set whenever "frog princess" occurs in the text. 
The name of each of these bookmarks will be "frog princess".

2) A bookmark line:
55, Bookmark at offset 55
Here, a bookmark will be set at offset 55 (55th character of 
the text), and it will have the name "Bookmark at offs" (truncated 
to 16 characters)

3) A bookmark line
-,Chapter \d+
causes a bookmark to be set at the first ocurrence of "Chapter XXX", 
where XXX denotes one or more digits. The bookmark name will be 
"Chapter XXX" (XXX replaced by the actual digits).

4) A bookmark line
+,Chapter \d+
causes bookmarks to be set wherever "Chapter XXX" (XXX being one 
or more digits) appears in the text. The bookmark name will again 
be "Chapter XXX", but the search does not stop after the first hit.

5) A bookmark line
+,\n\s*(Chapter \d+)\D+, 1
causes a bookmark to be set whenever a new line starts with 
"Chapter XXX" (whitespace is allowed before the "Chapter"), and 
uses the first subexpression in (..) as the bookmark name. If you 
have a passage 
     Chapter 15: here it starts
The regular expression will match, so a bookmark will be set there 
and the subexpression "Chapter 15" (which matches the (Chapter \d+) ) 
will be used as bookmark text.

6) A bookmark line
+,\n\s*Part (\d+),\1\. part
sets a bookmark whenever a line starts with "Part XXX". The XXX 
will be stored as the first matched subexpression. The third field 
"\1\. part" is the regular expression for the bookmark name, where 
\1 is replaced by the first matched subexpression of the search (XXX 
in this case). So if a line starts with "   Part 17: ", the bookmark 
name will be "17. part".

7) A bookmark line
+,Table (\d+): ,\1\. Tabelle,5,25
will match whenever "Table XXX: " appears in the text, and the bookmark 
name will be "XXX. Tabelle". However, the fourth field means that the 
first four hits are ignored (the 5th hit is the first hit to be included 
as a bookmark), and the fifth field means that all further hits after the 
25th will be ignored, too.

8) In law texts, I use a regular expression
+,\n *(<28>\.? *\d+[a-z]?\.?) +, 1
to search for all paragraphs starting like "<22>. 15. " or "  <20>23 ", and set
a bookmark there using only the part from the <20> to the last digit or the 
full stop after the last digit (the pattern between the (), in our two 
cases the bookmark names will be "<22>. 15." and "<22>23" ).