You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

400 lines
13 KiB

INTRODUCTION
============
This DETAILS file accompanies doc2html version 3.0.1.
Read this file for instructions on the installation and use of the
doc2html scripts.
The set of files is:
DETAILS - this file
doc2html.pl - the main Perl script
doc2html.cfg - configuration file for use with wp2html
doc2html.sty - style file for use with wp2html
pdf2html.pl - Perl script for converting PDF files to HTML
swf2html.pl - Perl script for extracting links from Shockwave flash files.
README - brief description
doc2html.pl is a Perl5 script for use as an external converter with
htdig 3.1.4 or later. It takes as input the name of a file containing a
document in a number of possible formats and its MIME type. It uses
the appropriate conversion utility to convert it to HTML on standard
output.
doc2html.pl was designed to be easily adapted to use whatever conversion
utilities are available, and although it has been written around the
"wp2html" utility, it does not require wp2html to function.
NOTE: version 3.0.1 has only been tested on Unix.
pdf2html.pl is a Perl script which uses a pair of utilities (pdfinfo and
pdf2text) to extract information and text from an Adobe PDF file and
write HTML output. It can be called directly from htdig, but you are
recommended to call it via doc2html.pl.
swf2html.pl is a Perl script which calls a utility (swfparse) and
outputs HTML containing links to the URL's found in a Shockwave flash
file. It can be called directly from htdig, but you are recommended to
call it via doc2html.pl.
ABOUT DOC2HTML.PL
=================
doc2html.pl is essentially a wrapper script, and is itself only capable
of reading plain text files. It requires the utility programs described
below to work properly.
doc2html.pl was written by David Adams <d.j.adams@soton.ac.uk>, it is
based on conv_doc.pl written by Gilles Detillieux <grdetil@scrc.umanitoba.ca>.
This in turn was based on the parse_word_doc.pl script, written by
Jesse op den Brouw <MSQL_User@st.hhs.nl>.
doc2html.pl makes up to three attempts to read a file. It first tries
utilities which convert directly into HTML. If one is not found, or no
output is produced, it then tries utilities which output plain text. If
none is found, and the file is not of a type known to be unconvertable,
then doc2html.pl attempts to read the file itself, stripping out any
control characters.
doc2html.pl is written to be flexible and easy to adapt to whatever
conversion utilites are available. New conversion utilities may be
added simply by making additions to routine 'store_methods', with no
other changes being necessary. The existing lines in store_methods
should provide sufficient examples on how to add more converters. Note
that converters which produce HTML are entered differently to those that
produce plain text.
htdig provides three arguments which are read by doc2html.pl:
1) the name of a temporary file containing a copy of the
document to be converted.
2) the MIME type of the document.
3) the URL of the document (which is used in generating the
title in the output).
The test for document type uses both the MIME-type passed as second
argument and the "Magic number" of the file.
INSTALLATION
============
Installation requires that you acquire, compile and install the utilities
you need to do the conversions. Those already setup in the Perl scripts are
described below.
If you don't have Perl module Sys::AlarmCall installed, then consider
installing it, see section "TIMEOUT" below.
You may need to change the first line of each script to the location of
Perl on your system.
Edit doc2html.pl to include the full pathname of each utility you have
installed. For example:
my $WP2HTML = '/opt/local/wp2html-3.2/bin/wp2html';
If you don't have a particular utility then leave its location as a null
string.
Then place doc2html.pl and the other scripts where htdig can access them.
If you are going to convert PDF files then you will need to edit pdf2html.pl
and include its full path name in doc2html.pl.
If you are going to extract links from Shockwave flash files then you will
need to edit swf2html.pl and include its full path name in doc2html.pl.
Edit the htdig.conf configuration file to use the script, as in this example:
external_parsers: application/rtf->text/html /usr/local/scripts/doc2html.pl \
text/rtf->text/html /usr/local/scripts/doc2html.pl \
application/pdf->text/html /usr/local/scripts/doc2html.pl \
application/postscript->text/html /usr/local/scripts/doc2html.pl \
application/msword->text/html /usr/local/scripts/doc2html.pl \
application/Wordperfect5.1->text/html /usr/local/scripts/doc2html.pl \
application/msexcel->text/html /usr/local/scripts/doc2html.pl \
application/vnd.ms-excel->text/html /usr/local/scripts/doc2html.pl \
application/vnd.ms-powerpoint->text/html /usr/local/scripts/doc2html.pl \
application/x-shockwave-flash->text/html /usr/local/scripts/doc2html.pl \
application/x-shockwave-flash2-preview->text/html /usr/local/scripts/doc2html.pl
If you are using wp2html then place the files doc2html.cfg and doc2html.sty in the
wp2html library directory.
UTILITY WP2HTML
===============
Obtain wp2html from http://www.res.bbsrc.ac.uk/wp2html/
Note that wp2html is not free; its author charges a small fee for
"registration". Various pre-compiled versions and the source code are
available, together with extensive documentation. Upgrades are
available at no further charge.
wp2html converts WordPerfect documents (5.1 and later) to HTML.
Versions 3.2 and later will also convert Word7 and Word97 documents to
HTML. A feature of wp2html which doc2html.pl exploits is that the -q
option will result in either good HTML or no output at all.
wp2html is very flexible in the output it creates. The two files,
doc2html.cfg and doc2html.sty, should be placed in the wp2html library
directory along with the .cfg and .sty files supplied with wp2html.
Edit the line in doc2html.pl:
my $WP2HTML = '';
to set $WP2HTML to the full pathname of wp2html.
wp2html will look for the title in a document, and if it is found then
output it in <TITLE>....</TITLE> markup. If a title is not found
then it defaults to the file name in square brackets.
If wp2html is unable to convert a document, or is not installed,
then doc2html.pl can use the "catdoc" or "catwpd" utilities instead.
UTILITY CATDOC
==============
Obtain catdoc from http://www.ice.ru/~vitus/catdoc/, it is available
under the terms of the Gnu Public License.
Edit the line in doc2html.pl:
my $CATDOC = '';
to set the variables to the full pathname of catdoc. You might want
to use a different version of catdoc for Word2 documents or for MAC Word
files.
catdoc converts MS Word6, Word7, etc., documents to plain text. The
latest beta version is also able to convert Word2 documents. catdoc
also produces a certaint amount of "garbage" as well as the text of the
document. The -b option improves the likelihood that catdoc will
extract all the text from the document, but at the expense of increasing
the garbage as well. doc2html.pl removes some non-printing characters
to minimise the garbage. If a later version of catdoc than 0.91.4 is
obtained then the use of the -b option should be reviewed.
UTILITY CATWPD
==============
Obtain catwpd from the contribs section of the Ht://Dig web site where
you obtained doc2html. It extracts words from some versions of WordPerfect
files. You won't need it if you buy the superior wp2html.
If you do use it, then edit the line in doc2html.pl:
my $CATWPD = '';
to set the variables to the full pathname of catwpd.
UTILITY PPTHTML
===============
obtain ppthtml from http://www.xlhtml.org, where it is bundled in with
xlhtml.
In doc2html.pl, edit the line:
my $PPT2HTML = '';
to set $PPT2HTML to the full pathname of ppthtml.
ppthtml converts Microsoft Powerpoint files into HTML. It uses the input
filename as the title. doc2html.pl replaces this with the original
filename from the URL in square brackets.
UTILITY XLHTML
==============
Obtain xlhtml from http://www.xlhtml.org
In doc2html.pl, edit the line:
my $XLS2HTML = '';
to set $XLS2HTML to the full pathname of xlhtml.
xlhtml converts Microsoft Excel spreadsheets into HTML. It uses the input
filename as the title. doc2html.pl replaces this with the original
filename from the URL in square brackets.
The present version of xlHtml (0.4) writes HTML output, but does not
mark up hyperlinks in .xls files as links in its output.
An alternative to xlHtml is xls2csv, see below.
UTILITY RTF2HTML
================
Obtain rtf2html from http://www.ice.ru/~vitus/catdoc/
In doc2html.pl, edit the line:
my $RTF2HTML = '';
to set $RTF2HTML to the full pathname of rtf2html.
rtf2html converts Rich Text Font documents into HTML. It uses the input
filename as the title, doc2html.pl replaces this with the original
filename from the URL within square brackets.
UTILITY PS2ASCII
================
Ps2ascii is a PostScript to text converter.
In doc2html.pl, edit the line:
my $CATPS = '';
to the correct full pathname of ps2ascii.
ps2ascii comes with ghostscript 3.33 (or later) package, which is
pre-installed on many Unix systems. Commonly, it is a Bourne-shell
script which invokes "gs", the Ghostscript binary. doc2html.pl has
provision for adding the location of gs to the search path.
UTILITY PDFTOTEXT
=================
pdftotext converts Adobe PDF files to text. pdfinfo is a tool which
displays information about the document, and is used to obtain its
title, etc. Get them from the xpdf package at
http://www.foolabs.com/xpdf/
In script pdf2html.pl, change the lines:
my $PDFTOTEXT = "/... .../pdftotext";
my $PDFINFO = "/... .../pdfinfo";
to the correct full pathnames.
Edit doc2html.pl to include the full pathname of the pdf2html.pl script.
pdf2text may fail to convert PDF documents which have been truncated
because htdig has max_doc_size set to smaller than the documents full
size. Some PDF documents do not allow text to be extracted.
UTILITY CATXLS
==============
The Excel to .csv converter, xls2csv, is included with recent versions of
catdoc. This is an alternative to xlhtml (see above).
Edit the line:
my $CATXLS = '';
to the full pathname of xls2csv.
Xls2csv translates Excel spread sheets into comma-separated data.
UTILITY SWFPARSE
================
swfparse (aka swfdump) extracts information from Shockwave flash files,
and can be obtained from the contribs section of the Ht://Dig web site,
where you obtained doc2html.
Perl script swf2html.pl calls swfparse and writes HTML output containing
links to the URLs found in the Shockwave file. It does NOT extract text
from the file.
In script swf2html.pl, change the line:
my $SWFPARSE = "/... .../swfdump";
to the full pathname.
Edit doc2html.pl to include the full pathname of the swf2html.pl script.
LOGGING
=======
Output of logging information and error messages is controlled by the
environmental variable DOC2HTML_LOG, which may be set in the rundig
script. If it is not set then only error messages output by doc2html.pl
and by the conversion utilities it calls are returned to htdig and will
appear in its STDOUT. If DOC2HTML_LOG is set to a filename, then
doc2html.pl appends logging information and any error messages to the
file. If DOC2HTML_LOG is set but blank, or the file cannot be opened
for writing, logging information and error messages are passed back to
htdig and will appear its STDOUT.
In doc2html.pl, the variables $Emark and $EEmark, set in subroutine init,
are used to highlight error messages.
The number of lines of STDERR output from a utility which is logged or
passed back to htdig is controlled by the variable $Maxerr set in
routine "init" of doc2html.pl. This is provided in order to curb the
large number of error messages which some utilities can produce from
processing a single file.
TIMEOUT
=======
If possible, install Perl module Sys::AlarmCall, obtainable from CPAN if
you don't already have it. This module is used by doc2html.pl to
terminate a utility if it takes too long to finish. The line in
doc2html.pl:
$Time = 60; # allow 60 seconds for external utility to complete
may be altered to suit.
LIMITING INPUT AND OUTPUT
=========================
The environmental variable DOC2HTML_IP_LIMIT may be set in the rundig
script to limit the size of the file which doc2html.pl will attempt to
convert. The default value is 20000000. Doc2html.pl will return no
output to htdig if the file size is equal to or greater than this size.
You are recommended to set DOC2HTML_IP_LIMIT to the same as the
"max_doc_size" parameter in the htdig configuration file. Then no
attempt wil be made to extract text from files which have been truncated
by htdig. It is not possible to extract any text from .PDF files, for
example, if they have been truncated.
The environmental variable DOC2HTML_OP_LIMIT may be set in the rundig
script to limit the output sent back to htdig by a single call to
doc2html.pl. The default value is 10000000. Doc2html.pl will stop
returning output to htdig once the DOC2HTML_OP_LIMIT has been reached.
This is precaution against the unlikely event of a conversion utility
returning disproportionately large amounts of data.
CONTACT
=======
Any queries regarding doc2html are best sent to the mailing list
htdig-general@lists.sourceforge.net
The author can be emailed at D.J.Adams@soton.ac.uk
David Adams
Information Systems Services
University of Southampton
27-November-2002