%
%                   A N N O Y A N C E   F I L T E R
%
%                           by John Walker
%                      http://www.fourmilab.ch/
%
%   What's all this, you ask?  Well, this is a "literate program",
%   written in the CWEB language created by Donald E. Knuth and
%   Silvio Levy.  This file includes both the C source code for
%   the program and internal documentation in TeX   Processing
%   this file with the CTANGLE utility produces the C source file,
%   while the CWEAVE program emits documentation in TeX.  The
%   current version of these programs may be downloaded from:
%
%       http://www-cs-faculty.stanford.edu/~knuth/cweb.html
%
%   where you will find additional information on literate
%   programming and examples of other programs written in this
%   manner.
%
%   If you don't want to wade through all these details, don't
%   worry; this distribution includes a .c file already
%   extracted and ready to compile.  If "make" complains that it
%   can't find "ctangle" or "cweave", just "touch *.cc"
%   and re-make--apparently the process of extracting the files
%   from the archive messed up the date and time, misleading
%   make into believing it needed to rebuild those files.

%   How to talk about LaTeX without actually ever using it
\def\LaTeX{L\kern-.36em\raise.40ex\hbox{\sevenrm A}\kern-.15em\TeX}

% This verbatim mode assumes that ! marks are !! in the text being copied.
% Borrowed from the CWEB manual: cwebman.tex  Note that you may not
% use the "|" character in the text--use \vbar instead.
\def\verbatim{\begingroup
  \def\do##1{\catcode`##1=12 } \dospecials
  \parskip 0pt \parindent 0pt \let\!=!
  \catcode`\ =13 \catcode`\^^M=13
  \tt \catcode`\!=0 \verbatimdefs \verbatimgobble}
{\catcode`\^^M=13{\catcode`\ =13\gdef\verbatimdefs{\def^^M{\ \par}\let =\ }} %
  \gdef\verbatimgobble#1^^M{}}

\def\CPP/{\CPLUSPLUS/}	% Macro for C++, like \CEE/ and \UNIX/
\def\breakOK{\penalty 0}

\def\vbar{\char124} 	% Macros for characters difficult to quote in certain contexts
\def\bslash{\char92}
\def\atsign{\char64}
\def\caret{\char94}
\def\uline{\char95}
\def\realspace{\char32}
\def\tilde{\char126}

% Registered trademark symbol
\def\registered{{\ooalign{\hfil\hskip0.1em\hbox{\sc R}\hfil\crcr\mathhexbox20D}}}

\def\partitle#1{\medskip\noindent{\bf #1}\smallskip}
% The commented out definition below puts subsections in the table of
% contents with a fixed depth of 3, but it crashes when the title contains
% TeX gnarl.  Anybody know how to fix this?
%\def\subsection#1#2{\partitle{\secno.{#1}. {#2}}{\edef\zoo{\write\cont{\ZZ{{#2}}{3}{\secno}{\the\pageno}}}\zoo}}
\def\subsection#1#2{\partitle{\secno.{#1}. {#2}}}

% Macros used in the "Options." section for options and sub-options
\def\opt#1#2{\vbox{\noindent {\.{#1}}\par
    \vbox{\leftskip=8em\noindent#2}\smallskip}}
\def\aletter#1#2{\hfill\break\hbox to2em{}\hbox to4em{\.{#1} } #2}
  
@i cweb/c++lib.w
@s assert int

@** Introduction.

\vskip 15pt
\centerline{\titlefont The Annoyance Filter}
\vskip 15pt
\centerline{\pdfURL{by John Walker}{http://www.fourmilab.ch/}}

\vskip 15pt
\centerline{This program is in the public domain.}
\vskip 15pt

{\narrower\smallskip\noindent
    Business propaganda must be obtrusive and blatant.  It is its
    aim to attract the attention of slow people, to rouse latent
    wishes, to entice men to substitute innovation for inert
    clinging to traditional routine.  In order to succeed,
    advertising must be adjusted to the mentality of the
    people courted.  It must suit their tastes and speak their
    idiom.  Advertising is shrill, noisy, coarse, puffing, because
    the public does not react to dignified allusions.  It is the
    bad taste of the public that forces the advertisers to
    display bad taste in their publicity campaigns.\hfill\break
\narrower
\hbox to 42em{\hfil---Ludwig von Mises, {\it Human Action}}
\smallskip}

\vskip5ex

%\noindent
This program implements an adaptive Bayesian filter which
distinguishes junk mail from legitimate mail by scanning
archives of each and calculating the probability for each
word which appears a statistically significant number of times
in the body of text that the word will appear in junk mail.

\vskip1.5ex
%\noindent
After building a database of word probabilities, arriving mail
is parsed into a list of unique words which are looked up in
the probability database.  A short list of words with extremal
probability (most likely to identify a message as legitimate
or as junk) is used to compute an aggregate message probability
with Bayes' theorem.  This probability is then tested against
a threshold to decide whether the message as a whole is junk.
Mail determined to be junk or legitimate can be added to the
database to refine the probability values and adapt as
the content of mail evolves over time.  Ideally, this could
be triggered to a button in a mail reader which
dispatched a message to the appropriate category.

\vskip1.5ex
%\noindent
The technique and algorithms used by this program are
as described in Paul Graham's
\pdfURL{``{\it A Plan for Spam\footnote{$^1$}{\rm SPAM$^{\rm\registered}$ is
a registered trademark of
\pdfURL{Hormel Foods Corporation}{http://www.spam.com/}.
Use of the word to denote unsolicited commercial E-mail
is based on the Monty Python skit in which a bunch
of Vikings sing a chorus of ``SPAM, SPAM, SPAM,'' drowning out
all civil discourse.  To avoid confusion with processed meat products,
I use the term ``junk mail'' in this document.  Besides,
if ``spam'' is strictly defined as unsolicited commercial
E-mail, the mandate of this program covers the much broader
spectrum of {\it undesired} mail regardless of provenance
and motivation.}}''}{http://www.paulgraham.com/spam.html}.
This \CPP/ program was developed based on the model Common
Lisp code in his document which, in turn, was modeled on the
original code in the ``Arc'' language he is developing.

\vskip1.5ex
The concept of an adaptive advertising filter and the name
of this program first appeared in my 1989 science
fiction story
\pdfURL{``{\it We'll Return, After This Message}''}{http://www.fourmilab.ch/documents/sftriple/gpic.html}.

\vskip1.5ex
A complete development log giving the detailed history of this
program appears at the end of this document.

\vskip 30pt

\def\PRODUCT{\.{annoyance-filter}}

% PRODUCT and VERSION are defined in configure.in
@d REVDATE "2004-08-04"
@d Xfile   string("X-Annoyance-Filter")

@** User Guide.
\bigskip
\PRODUCT\ is invoked with a command line as follows:
\medskip
\hskip 4em \PRODUCT\ {\it options}
\medskip
\noindent where {\it options} specify processing modes as defined below
and are either long names beginning with two hyphens or single
letter abbreviations introduced by a single hyphen.

@*1 Getting started.

The Annoyance Filter is organised as a toolbox which can
be used to explore content-based mail filtering.  It
includes diagnostic tools and output which will
eventually be little used once the program is tuned
and put into production.

The program is normally run in two phases.  In the
{\it training} phase, collections of legitimate and
junk mail stored in \UNIX/ mail folders are read and
used to build a dictionary in which the probability of
a word's identifying a message as junk is computed.
This dictionary is then exported to be used in subsequent
runs to classify incoming messages based on the word
probabilities determined from prior messages.

\subsection{1}{Building}

If you have a more or less standard present-day \UNIX/ system,
you should be able to build and install the program with the
commands:

\medskip
\verbatim
!   ./configure
!   make
!   make check
!   make install
!endgroup
\smallskip

\subsection{2}{Training}

Now you must {\it train} the program to discriminate
legitimate junk and mail by showing it collections of
such mail you've hand sorted into a pile of stuff you
want to receive and another which you don't.
Assuming you have mail folders containing collections of
legitimate mail and junk named ``\.{m-good}'' and
``\.{m-junk}'' respectively, you can perform the
training phase and create a binary dictionary file
named ``\.{dict.bin}'' and a fast dictionary ``\.{fdict.bin}''
for classifying messages with the command:

\hskip4em{\tt \PRODUCT\ --mail m-good --junk m-junk --prune  \bslash}\hfill\break
\hbox to14em{}{\tt --write dict.bin --fwrite fdict.bin}

The arguments to the \.{--mail} and \.{--junk} options can be
either \UNIX/ ``mail folders'' consisting of one or more
E-mail messages concatenated into a single file, or the name
of a directory containing messages in individual files.  In
either case, the files may be compressed with \.{gzip}---\PRODUCT\
will automatically expand them.  You can supply as many
\.{--mail} and \.{--junk} options as you like on a
command line; the contents added cumulatively to the
dictionary.

It is {\it absolutely essential} that the collections of legitimate and
junk mail used to train \PRODUCT\ be completely clean---no junk in
the \.{--mail} collection or vice versa.  Pollution of either collection
by messages belonging in the other is very likely to corrupt the
calculation of probabilities, resulting in messages which belong in
one category being assigned to the other.  The \.{utilities/splitmail.pl}
program can help in manually sorting mail into the required two piles, and
I hope some day I will have the time to adequately document it.

You may find it worthwhile to add an archive of mail you've sent to
the legitimate category with \.{--mail}.  In many cases, the
words you use in mail you send are an excellent predictor of
how worthy an incoming message is of your attention.  I've found
this works well with my own archives, but I haven't tested how
effective it is for a broader spectrum of users.

When you compile the collections of junk and legitimate mail
to train \PRODUCT, it's important to include {\it all} the
copies of similar or identical messages you've received in
either category.  \PRODUCT\ bases its classifications on the
frequency of indicative words in the entire set of mail you
receive.  An obscure string embedded in a mail worm spewed
onto the net may not filter it out if you train
\PRODUCT\ with only one copy, but will certainly consign
it to the junk heap if you train \PRODUCT\ with the
twenty or thirty you receive a day.

\subsection{3}{Scoring}

Dictionary in hand, you can now proceed to the \.{scoring}
phase, where the dictionary is used, along with the list of
words appearing in a message, to determine its overall
probability of being junk. If you have a mail message in a file
``\.{mail.txt}'', you can compute and display its junk
probability with:

\hskip4em{\tt \PRODUCT\ --fread fdict.bin --test mail.txt}

\noindent
The probability is written to standard output.  The closer the
probability is to 1, the more likely the mail is junk.

\subsection{4}{Plumbing}

To use \PRODUCT\ as a front-end to another mail filtering
program, specify the \.{--transcript} option before
\.{--test}---the junk probability and classification will be
appended to the message header and written to the designated
transcript destination, standard output if ``\.{-}''.  For
example, to use \PRODUCT\ as a front-end to a mail sorting
program such as \.{Procmail}, you might invoke it with
the command:

\hskip4em{\tt \PRODUCT\ --fread fdict.bin --transcript - --test -}

\noindent
which reads the message to be classified from standard input and
writes the transcript, classification included, to standard output.
Note that since the command line options are processed as
commands, not stateless mode specifications, you must request
the \.{--transcript} before designating the message
to \.{--test}.

\subsection{5}{Progressive Refinement}

Junk mail evolves, but \PRODUCT\ evolves {\it with it}.  As incoming
mail arrives and \PRODUCT\ sorts it into legitimate and junk
categories, there will doubtless be the occasional error.  The
classification defaults used by \PRODUCT\ have been chosed that
the vast majority of such error are in the direction of considering
junk mail legitimate as opposed to the opposite, whose consequences
are much more serious.

As \PRODUCT\ sorts your incoming mail, you'll amass folders of junk
and non-junk it's classified, including the occasional error.
If you take the time to go through these folders and sort out
the occasional mis-classified messages, then add them to
the \PRODUCT\ dictionary, the precision with which it classifies
incoming messages will be increasingly refined.  For example,
suppose your current dictionary is \.{dict.bin} and you have
sorted out folders of legitimate mail \.{new-good} and junk
\.{new-junk} which have arrived since you built the dictionaty.
You can update the dictionary based on new messages
with the command:

\hskip4em{\tt \PRODUCT\ --read dict.bin --mail new-good --junk new-junk \bslash}\hfill\break
\hbox to14em{}{\tt --prune --write dict.bin --fwrite fdict.bin}

Perhaps some day a mail client will provide a ``Delete as junk'' button which
automatically discards the offending message and forwards it to
\PRODUCT\ to further refine its criteria for identifying junk.

@*1 Options.

Options are specified on the command line.  Options are treated as
commands---most instruct the program to perform some specific action;
consequently, the order in which they are specified is
significant; they are processed left to right. Long options
beginning with ``\.{--}'' may be abbreviated to any unambiguous
prefix; single-letter options introduced by a single ``\.{-}''
without arguments may be aggregated.

\bigskip

\opt{--annotate {\it options}}{Add the annotations requested by the
characters in {\it options} to the transcript generated
by the \.{--transcript} option.  Upper and lower case
{\it options} are treated identically.  Available annotations are:
\aletter{d}{Decoder diagnostics}
\aletter{p}{Parser warnings and error messages}
\aletter{w}{Most significant words and their probabilities}
}
    
\opt{--autoprune {\it n}}{As the dictionary is bring built by appending
mail to it with the \.{--mail} and \.{--junk} options, unique words
will automatically be pruned from it whenever the dictionary
exceeds approximately {\it n} bytes.  This is particularly handy
when loading large collections of messages with \.{--phrasemax}
set greater than one, as a very large number of unique phrases may
clutter the dictionary being built and exceed the memory capacity
of your computer.  You could split the mail collection into
multiple parts and explicitly \.{--prune} after each part, but
\.{--autoprune} is much more convenient.
}

\opt{--biasmail {\it n}}{The frequency of words appearing in legitimate
mail is inflated by the floating point factor {\it n}, which defaults
to 2.  This biases the classification of messages in favour of
``false negatives''---junk mail deemed legitimate, while
reducing the probability of ``false positives'' (legitimate
mail erroneously classified as junk, which is {\it bad}).  The higher
the setting of \.{--biasmail}, the greater the bias in favour of
false negatives will be.}

\opt{--binword {\it n}}{Binary character streams (for example, attachments
of application-specific files, including the executable code of
worm and virus attachments) are scanned and contiguous sequences of
alphanumeric ASCII characters {\it n} characters or longer are
added to the list of words in the message.  The dollar sign
(``\.{\$}'') is considered an alphanumeric character for these
purposes, and words may have embedded hyphens and apostrophes, but
may not begin or end with those characters.  If \.{--binword}
is set to zero, scanning of binary attachments is disabled entirely.
The default setting is 5 characters.}

\opt{--bsdfolder}{The next \.{--mail} or \.{--junk} folder will be
parsed using ``classic BSD'' rules for identifying the start of
individual messages in the folder.  In BSD-style folders, the
text ``\.{From\ }'' as the leftmost characters of a line always
denotes the start of a new message: any appearance of this text in
any other context is always quoted, often by prefixing a
``\.{>}'' character.  In the default \UNIX/ folder syntax,
``\.{From\ }'' only marks the start of a new message if it
appears following one or more blank lines.  Note that you must
specify \.{--bsdfolder} before each folder to be read with BSD
rules; it is not a modal setting.
}

\opt{--classify {\it fname}}{Classify mail in {\it fname}.  If it
equals or exceeds the junk threshold (see
\.{--threshjunk}), ``\.{JUNK}'' is written to standard
output and the program exits with status code 3. If the
message scores less than or equal to the mail threshold
(see \.{--threshmail}), ``\.{MAIL}'' is written to standard
output and the program exits with status 0.  If the
message's score falls between the two thresholds, its
content is deemed indeterminate; ``\.{INDT}'' is written to
standard output and the program exits with a status of 4.
The output can be used to set an environment variable in
\.{Procmail} to control the disposition of the message.
If {\it fname} is ``\.{-}'' the message is read from
standard input.}

\opt{--clearjunk}{Clear appearances of words in junk mail from database.
Used when preparing a database of legitimate mail.}

\opt{--clearmail}{Clear appearances of words in legitimate mail from database.
Used when preparing a database of junk mail.}

\opt{--copyright}{Print copyright information.}

\opt{--csvread {\it fname}}{Import a dictionary from a
comma-separated value (CSV) file {\it fname}.  Records are
assumed to be in the format written by \.{--csvwrite} but
need not be sorted in any particular order.  Words are added
to those already in memory.}

\opt{--csvwrite {\it fname}}{Export a dictionary as a
comma-separated value (CSV) {\it fname} with this option.  Such
files can be loaded into spreadsheet or database programs for
further processing.  Words are sorted first in ascending order
of probability they denote junk mail, then lexically.}

\opt{--fread{\rm, }-r {\it fname}}{Load a fast dictionary (previously
created with the \.{--fwrite} option) from file {\it fname}.}

\opt{--fwrite {\it fname}}{Write a dictionary to the file {\it fname}
in fast dictionary format.  Fast dictionaries are written in a binary
format which is {\it not} portable across machines with different
byte order conventions and cannot be added incrementally to assemble
a larger dictionary, but can be loaded in a small fraction of
the time required by the format created by the \.{--write} command.
Using a fast dictionary for routine classification of incoming
mail drastically reduces the time consumed in loading the
dictionary for each message.}

\opt{--help{\rm, }-u}{Print how-to-call information including a
list of options.}
    
\opt{--junk{\rm, }-j {\it fname}}{Add the mail in folder {\it fname}
to the dictionary as junk mail.  These folders may be compressed
by a utility the host system can uncompress; specify the complete
file name including the extension denoting its form of compression.
If {\it fname} is ``\.{-}'' the mail folder is read from
standard input.}

\opt{--list}{List the dictionary on standard output.}
    
\opt{--mail{\rm, }-m {\it fname}}{Add the mail in folder {\it fname}
to the dictionary as legitimate mail.  These folders may be compressed
by a utility the host system can uncompress; specify the complete
file name including the extension denoting its form of compression.
If {\it fname} is ``\.{-}'' the mail folder is read from
standard input.}

\opt{--newword {\it n}}{The probability that a word seen in mail which
does not appear in the dictionary (or appeared too few times to
assign it a probability with acceptable confidence) is indicative of
junk is set to {\it n}.  The default is 0.2---the odds are that novel
words are more likely to appear in legitimate mail than in junk.}

\opt{--pdiag {\it fname}}{Write a diagnostic file to the specified {\it fname}
containing the actual lines the parser processed (after decoding of MIME
parts and exclusion of data deemed unparseable).  Use this option when you
suspect problems in decoding or pre-parser filtering.}

\opt{--phraselimit {\it n}}{Limit the length of phrases assembled according to the
\.{--phrasemin} and \.{--phrasemax} options to {\it n} characters.  This
permits ignoring ``phrases'' consisting of gibberish from mail headers
and un-decoded content.  In most cases these items will be discarded by
a \.{--prune} in any case, but skipping them as they are generated keeps
the dictionary from bloating in the first place.  The default value is
48 characters.}

\opt{--phrasemin {\it n}}{Calculate probabilities of phrases consisting of
a minumum of {\it n} words.  The default of 1 calculates probabilities for
single words.}

\opt{--phrasemax {\it n}}{Calculate probabilities of phrases consisting of
a maximum of {\it n} words.  The default of 1 calculates probabilities for
single words.  If you set this too large, the dictionary may grow
to an absurd size.}

\opt{--plot {\it fname}}{After loading the dictionary, create a
plot in {\it fname}\.{.png} of the histogram of words, binned
by their probability of appearance in junk mail.  In order to
generate the histogram the \.{GNUPLOT} and \.{NETPbm}
utilities must be installed on the system; if they are absent,
the \.{--plot} option will not be available.}

\opt{--pop3port {\it n}}{The POP3 proxy server activated by a subsequent
\.{--pop3server} option will listen for connections on port \.{n}.  If
no \.{--pop3port} is specified, the server will listen on the default
port of 9110.  On most systems, you'll have to run the program as
root if you wish the proxy server to listen on a port numbered
1023 or less.}

\opt{--pop3server {\it server[:port]}}{Activate a POP3
proxy server which relays requests made on the previously
specified \.{--pop3port} or the default of 9110 if no port
is specified, to the specified {\it server}, which may be
given either as an IP address in ``dotted quad'' notion
such as \.{10.89.11.131} or a fully-qualified domain name
like \.{pop.someisp.tld}.  The {\it port} on which the
{\it server} listens for POP3 connections may be specified
after the {\it server} prefixed by a colon (``\.{:}''); if no
port is specified, the IANA assigned POP3 port 110 will be
used. The POP3 proxy server will pass each message received on
behalf of a requestor through the classifier and return the
annotated transcript to the requestor, who may then filter it
based on the classification appended to the message header. You
must load a dictionary before activating the POP3 proxy server,
and the \.{--pop3server} option must be the last on the command
line.  The server continues to run and service requests until
manually terminated.}

\opt{--pop3trace}{Write a trace of POP3 proxy server operations
to standard error.  Each trace message (apart from the dump of the
body of multi-line replies to clients) is prefixed with the
label ``\.{POP3:\ }''.}

\opt{--prune}{After loading the dictionary from \.{--mail} and \.{--junk}
folders, this option discards words which appear sufficiently
infrequently that their probability cannot be reliably
estimated.  One usually \.{--prune}s the dictionary before
using \.{--write} to save it for subsequent runs.}

\opt{--ptrace}{Include a token-by-token trace in the \.{--pdiag} output
file.  This helps when adjusting the parser's criteria for recognising
tokens.  Setting this option without also specifying a \.{--pdiag}
file will have no effect other than perhaps to exercise your fingers
typing it on the command line.}

\opt{--read{\rm, }-r {\it fname}}{Load a dictionary (previously
created with the \.{--write} option) from file {\it fname}.}

\opt{--sigwords {\it n}}{The probability that a message is junk will be computed
based on the individual probabilities of the {\it n} words with extremal
probabilities; that is, probabilities most indicative of junk or mail.  The
default is 15, but there's no obvious optimal setting for this parameter; it
depends in part on the average length of messages you receive.}

\opt{--sloppyheaders}{To evade filtering programs, some junk mail is sent with
MIME part headers which violate the standard but which most mail clients
accept anyway.  This option causes such messages to be parsed as
a browser would, at the cost of standards compliance.  If
\.{--sloppyheaders} is used, it should be specified both when
building the dictionary and when testing messages.}

\opt{--statistics}{After loading the dictionary from \.{--mail} and \.{--junk}
folders, print statistics of the distribution of junk probabilities of
words in the dictionary.  The statistics are written to standard output.}

\opt{--test{\rm, }-t {\it fname}}{Test mail in {\it fname} and
write the estimated probability it is junk to standard output
unless the \.{--transcript} option is also specified with
standard output (``\.{-}'') as the destination, in which case
the inclusion of the probability and classification in the
transcript is adjudged sufficient.  If the \.{--verbose} option
is specified, the individual probabilities of the ``most
interesting'' words in the message will also be output.  If
{\it fname} is ``\.{-}'' the message is read from standard
input.}

\opt{--threshjunk {\it n}}{Set the threshold for classifying a
message as junk to the floating point probability value
{\it n}.  The default threshold is 0.9; messages scored
above \.{--threshjunk} are deemed junk.}

\opt{--threshmail {\it n}}{Set the threshold for classifying a
message as legitimate mail to the floating point probability value
{\it n}.  The default threshold is 0.9, with messages scored
below \.{--threshmail} deemed legitimate.  Note that you may
leave a gap between the \.{--threshmail} and \.{--threshjunk}
values (although it makes no sense to set \.{--threshmail} higher).
Mail scored between the two thresholds will then be judged
of uncertain status.}

\opt{--transcript {\it fname}}{Write an annotated transcript of
the original message to the specified {\it fname}.  If
{\it fname} is ``\.{-}'', the transcript is written to
standard output.  At the end of the message header, an
\.{X-Annoyance-Filter-Junk-Probability} header item giving
the computed probability and an
\.{X-Annoyance-Filter-Classification}
item which gives the
classification of the message according to
the \.{--threshmail} and \.{--threshjunk} settings; the
classification is given as ``\.{Mail}'', ``\.{Junk}'',
or ``\.{Indeterminate}''.}

\opt{--verbose{\rm, }-v}{Print diagnostic information as the program
performs various operations.}
    
\opt{--version}{Print program version information.}

\opt{--write {\it fname}}{Write a dictionary to the file {\it fname}.
The dictionary is written in a binary format which may be
loaded on subsequent runs with the \.{--read} option.  Binary
dictionary files are portable among machines with different
architectures and byte order.}

@*1 Phrase-based classification.

\PRODUCT\ has the ability to classify messages based upon occurrences
of multiple-word phrases as well as individual words.  Here are results from
an empirical test of classifying messages by single word frequencies compared
to considering both individual words, phrases of 1--2 and 1--3 words,
and phrases of two to three words.  With this test
set (compiled by hand sorting three years of legitimate and junk mail),
adding classification by two word phrases reduces the number of false
negatives (junk mail erroneously classified as legitimate) by more than
90\%, while preserving 100\% accuracy in identifying legitimate mail.

\vskip 1ex
{\vbox{
\settabs\+\hskip3em&MMMMMMM&MMMMMMMMM&MMMMMMMMM&MMMMM&MMMM&MMMM&MMMMMMM\cr
\+&{\bf Folder} & \.{--phrasemin} & \.{--phrasemax} & {\bf Total} & \hfill{\bf Mail} & \hfill{\bf Junk} & \hfill{\bf Prob}\cr
\+&Junk & \hfill1\hfill & \hfill1\hfill & 8957 &\hfill   37 & \hfill8920 & \hfill0.9970\cr
\+&Mail & \hfill1\hfill & \hfill1\hfill & 2316 &\hfill 2316 & \hfill0 &    \hfill0.0000\cr
\+\cr
\+&Junk & \hfill1\hfill & \hfill2\hfill & 8957 &\hfill   3 & \hfill8954 &  \hfill0.9997\cr
\+&Mail & \hfill1\hfill & \hfill2\hfill & 2316 &\hfill 2316 & \hfill0 &    \hfill0.0000\cr
\+\cr
\+&Junk & \hfill1\hfill & \hfill3\hfill & 8957 &\hfill   9 & \hfill8948 &  \hfill0.9983\cr
\+&Mail & \hfill1\hfill & \hfill3\hfill & 2316 &\hfill 2316 & \hfill0 &    \hfill0.0000\cr
\+\cr
\+&Junk & \hfill2\hfill & \hfill3\hfill & 8957 &\hfill    9 & \hfill8948 & \hfill0.9981\cr
\+&Mail & \hfill2\hfill & \hfill3\hfill & 2316 &\hfill 2316 & \hfill0 &    \hfill0.0000\cr
}
}
\vskip 1ex

There's no need to overdo it, however.  Note that extending classification to
phrases of up to three words actually slightly reduced the accuracy with
which junk was recognised.  In most circumstances, classifying based on
phrases of one and two words will yield the best results.

@*1 Integrating with {\bf Procmail}.

Many \UNIX/ users plagued by junk mail already use the
\pdfURL{{\tt Procmail}}{http://www.procmail.org/}
program to filter incoming mail.  \.{Procmail}
makes it easy to define a ``whitelist'' of senders
whose mail is always of interest and a ``blacklist''
of known perpetrators of junk mail.  Although \.{Procmail}
includes a flexible weighted scoring mechanism for
evaluating mail based on content, this has limitations
in coping with real world junk mail.  First of all,
choosing keywords and their scores is a completely manual
process which requires continual attention as the
content of junk mail evolves.  Trial and error is
the only mechanism to avoid ``false positives'' (legitimate
mail erroneously considered junk) and ``false negatives''
(junk which makes it through the filter).  Further,
\.{Procmail} looks only at the raw message received
by the mail agent, and contains no logic to decode
attachments, parse HTML, or interpret encoded character sets.
Present-day junk mail has these attributes in
profusion, and often deliberately employs them
in the interest of ``stealth''---evading keyword
based filters such as \.{Procmail}.

\PRODUCT\ has been designed to work either stand-alone or in
conjunction with a filter like \.{Procmail}. Integrating
\PRODUCT\ and \.{Procmail} provides the best of both
worlds---hand-crafted \.{Procmail} filtering of the obvious
cases (whitelists, blacklists, and routine mail filing) and
\PRODUCT\ evaluation of the unclassified residua.  Here's how
you can go about integrating \PRODUCT\ and \.{Procmail}.  In
the examples below, we'll use ``\.{blohard}'' as the user name
of the person installing \PRODUCT.

\subsection{1}{Installing \PRODUCT}

First of all, you need to build \PRODUCT\ for your system,
create a dictionary from collections of legitimate and
junk mail, and install the lot in a location where
the mail transfer agent
(\pdfURL{{\tt Sendmail}}{http://www.sendmail.org/} on
most \UNIX/ systems) can access it.  This can be any
directory owned by the user, but I recommend you use
the default of \.{.\PRODUCT} in your home
(\.{\$HOME}) directory; this is the destination used by the
\.{install} target in the \.{Makefile}.

After you've built your custom dictionary, copy it to
the \.{.\PRODUCT} directory as \.{dict.bin}.

\subsection{2}{Installing Procmail}

Obviously, if you're going to be using \.{Procmail}, it
needs to be installed on your system.  Fortunately,
many present-day Linux distributions come with \.{Procmail}
already installed, so all the user need do is place the
filtering rules (or ``recipes'') in a \.{.procmailrc} file in the
home directory.  If \.{Procmail} is not installed on
your system, please
visit \pdfURL{{\tt Procmail}}{http://www.procmail.org/}
for details on how to remedy that lacuna.  If you do
need to install \.{Procmail}, note that it can be
installed either system-wide, filtering all users'
mail (this is how the Linux distributions generally
install it), or on a per-user basis, which does not
require super-user permissions to install.  Fortunately,
the configuration file is identical regardless of
how \.{Procmail} is installed.

\vbox{
\subsection{3}{Procmail Configuration}

The next few paragraphs will look at typical components
of a \.{Procmail} configuration file which, by default,
is \.{.procmailrc} in the user's home directory.  To
make the script more generic and portable, we'll start
by defining a few environment variables which specify
where \.{Procmail} files mail and writes its log.
}

\smallskip
\verbatim
MAILDIR=$HOME/mailbox    # Be sure this directory exists
LOGFILE=$MAILDIR/logfile # Write a log of Procmail's actions
!endgroup

\vbox{
\subsection{3.1}{Filtering with \PRODUCT}

\PRODUCT\ integrated with \.{Procmail} as a {\it filter}.
As each message arrives, \.{Procmail} feeds it through
\PRODUCT, which appends its estimation of the probability
the message is junk to the header of the message.
Subsequent \.{Procmail} recipes then test this field
and route the message accordingly.
}

Assuming you've installed \PRODUCT\ in the
\.{\$HOME/\PRODUCT} directory, you activate the
filtering by adding the following lines to
your \.{.procmailrc} file.  If you make this
the first recipe, any subsequent recipe will
be able to test for the \PRODUCT\ header
fields.

\medskip
\vbox{
\noindent
\.{:0 fw}\hfill\break
\.{\vbar{} \$HOME/.\PRODUCT/\PRODUCT} $\backslash$\hfill\break
\hbox to3em{}\.{--fread \$HOME/.\PRODUCT/fdict.bin --trans - --test -}\hfill\break
}
\smallskip

\noindent
The action line which pipes the message to \PRODUCT\ is continued
onto a second line here in order to fit on the page.  \.{Procmail}
permits continuations of this form, but will equally accept the
command all on one line with the backslash removed.

\subsection{3.2}{Routing by \PRODUCT\ classification}

Once the message has been filtered by \PRODUCT, subsequent
rules can test for its classification and route the message
accordingly. The following rules dispatch messages it
classifies as junk to a \.{junk} folder used by the blacklist,
while messages judged to be legitimate mail and those with an
intermediate probability are sent to the user's mailbox. (With
the default settings, \PRODUCT\ will always classify a message
as mail or junk, but if the \.{--threshjunk} and
\.{--threshmail} settings are changed to as to create a gap
between them, intermediate classification can occur.) 
Actually, the latter two recipes could be omitted since any
message which fails to trigger any \.{Procmail} rule is sent to
the user's mailbox by default.  The variable \.{\$ORGMAIL} is
defined by \.{Procmail} as the user's mailbox; using it
avoids using the specific path name which is dependent on
the user name and mail system configuration.

\smallskip
\verbatim
:0 H:
* ^X-Annoyance-Filter-Classification: Junk
junk

:0 H:
* ^X-Annoyance-Filter-Classification: Mail
$ORGMAIL

:0 H:
* ^X-Annoyance-Filter-Classification: Indeterminate
$ORGMAIL
!endgroup
\smallskip

\noindent
Even if you set the mail and junk probabilities so that
messages can be classified as ``\.{Indeterminate}'', you're
unlikely to see many so categorised---as long as the
collections of mail and junk you used to train \PRODUCT\ are
sufficiently large and representative, the vast majority of
messages will usually be scored near the extremes of
probability. If you're seeing a lot of \.{Indeterminate}
messages, you should sort them manually, add them to the
appropriate collection, and re-train \PRODUCT.

If you have other \.{Procmail} recipes for handling
specific categories of mail, you would normally place
the \PRODUCT\ related recipes {\it after} them, at
the very end of the \.{procmailrc} file.  That way
\PRODUCT's evalution is used as the final guardian at
the gate before a message is delivered to your mailbox.

\subsection{3.3}{Other useful \.{.procmailrc} rules}

The following subsections have nothing at all to
do with \PRODUCT, really.  You can set up a
\.{.procmailrc} file based exclusively on
\PRODUCT\ classifications as described above.
Still, in many cases a few \.{Procmail} rules
are worthwile in addition to \PRODUCT\
filtering.  Here are some frequently used
categories.  You would normally place these
rules {\it before} the \PRODUCT\ rules
discussed in section 3.2.

\subsection{3.3.1}{Whitelist}

Most people have a short list of folks with whom they
correspond regularly.  It's embarrassing if the content of a
message from one of them is mistakenly identified as junk
mail.  To prevent this, define a ``whitelist'' as the first
rule in your \.{Procmail} configuration after the filter
command; messages which match its patterns avoid further
scrutiny and are delivered directly to your mailbox.  You
should generally include your own address in the whitelist, as
well as addresses of administrative accounts on machines you're
responsible for, but be careful: junk mailers increasingly use
sender addresses such as \.{root} to exploit whitelists. 
Here's user \.{blohard}'s whitelist definition.  Multiple
\.{Procmail} rules are normally combined with a logical AND
($\land$) operation. Since the whitelist requires an OR
($\lor$) operation, we manufacture one by a trivial application
of \.{Procmail}'s weighted scoring facilities.  \.{Procmail}
patterns are regular expressions identical to those used by
\.{egrep}, so metacharacters such as ``\.{.}'' must be quoted
to be treated literally in patterns.

\smallskip
\verbatim
:0
* 0^0
*   1^1 ^From.*blohard@@spectre\.org
*   1^1 ^From.*auric@@spectre\.org
*   1^1 ^From.*bond@@universal-impex\.co\.uk
*   1^1 ^From.*root@@spectre\.org
$ORGMAIL
!endgroup

\subsection{3.3.2}{Blacklist}

A ``blacklist'' works precisely like the whitelist,
except that anything which matches one of its patterns
is dispatched to the \.{junk} mail folder (or, if you're
particularly confident there will be no false
positives, to oblivion at \.{/dev/null}).
Here we list some egregious spewers and unambiguous earmarks of
junk mail.  Note that in some cases it makes sense
to match on header fields other than ``\.{From}''.
By default, \.{Procmail}'s pattern matching is case-insensitive.

\smallskip
\verbatim
:0
* 0^0
*   1^1 ^From.*@@link3buy\.com
*   1^1 ^From.*@@lowspeedmediaoffers\.com
*   1^1 ^Subject:.*Let's be friends
*   1^1 ^X-Advertisement
*   1^1 ^X-Mailer.*RotMailer
*   1^1 ^To:.*Undisclosed.*Recipient
*   1^1 ^Subject:.*\[ADV\]
*   1^1 ^Subject:.*\(ADV\)
*   1^1 ^Reply-to:.*remove.*@@
*   1^1 ^To.*friend
junk
!endgroup
\smallskip

\noindent
At first glance, blacklists look like a good idea, but
junk mail senders constantly change their domain names,
and trigger words continually evolve protective colouration,
making blacklist maintenance an never-ending process.

\vbox{
\subsection{3.3.3}{Automatic Filing}

If you receive routine mail which you prefer to review
as a batch from time to time, for example, messages
from a mailing list to which you subscribe, you can
have \.{Procmail} recognise them and file them in a
folder for your eventual perusal.  Obviously, you'll
need to identify a pattern which matches all the
messages in the category you wish to file but
no others.
}

\smallskip
\verbatim
:0:
* ^From.*SUPER-VILLAINS +mailing +list
villains

:0 H:
* ^Subject.*Bacula: Backup OK
backups
!endgroup
\smallskip

\noindent
Here, the user has provided a rule which files messages
from a mailing list in a folder and notifications of
successful backup completions (but not error notifications) from
\pdfURL{Bacula}{http://www.bacula.org/} in a second folder.

@*1 Operating a POP3 proxy server.

On systems where it's inconvenient or impossible to interpose
\PRODUCT\ to filter incoming mail, you may be able to use
\PRODUCT\ as a proxy server for the ``Post Office Protocol'' used to
deliver mail from your mail host.

The program you use to read E-mail, for example, Netscape,
Mozilla, or Microsoft Outlook, normally retrieves messages
from a mail server using Post Office Protocol as defined
by Internet
\pdfURL{RFC~1939}{http://www.ietf.org/rfc/rfc1939.txt?number=1939}.
\PRODUCT\ has the ability to act as a {\it proxy} for this
protocol, running on your local machine, and filtering messages
received from your mail server to classify them as legitimate
mail or junk.  Let's assume you currently receive incoming
mail from a POP server at site \.{mail.myisp.net}.  Once
you've created a fast dictionary from your collection of legitimate
and junk mail, you can establish a proxy server directed at that
site with the command:

\hskip4em{\tt \PRODUCT\ --fread fdict.bin --pop3server mail.myisp.net}

Now you need only configure your mail program to request incoming mail
from your local machine (usually called ``\.{localhost}'') on the default
proxy port of 9110.  (You can change the proxy port with the
\.{--pop3port} option if required.)

Messages retrieved through the proxy server will be annotated with
\PRODUCT's\break
\.{X-Annoyance-Filter-Classification} header item, which
may be tested in your mail client's filtering rules to appropriately
dispose of the message.

POP3 proxy server support is primarily intended for an individual
user running on a platform which doesn't permit programmatic filtering
of incoming mail.  The proxy server is, however, completely general and
can support any number of individual mailboxes on a mail server, but
with only a single dictionary common to all mailboxes.  Since accurate
mail classification depends upon individual per-user dictionaries,
this is a capability best undeployed.

If you're installing a POP3 proxy server on a Windows machine, you
may wish to create a ``\.{.pif}'' file to launch the program from
the directory in which it resides with the correct options.  A
skeleton \.{pop3proxy.pif} file is included in the Windows
distribution archives which you can edit to specify parameters
appropriate for your configuration.  (To edit the file, right
click on it in Explorer and select the ``Properties'' item from
the pop-up menu.)

@*1 To-do list.

\medskip
\item{$\bullet$} Translation of Chinese and Japanese characters
currently decoded by the \.{GB2312} and \.{Big5} interpreters
into their Unicode representations would permit uniform
recognition of characters across the encodings.

\medskip
\item{$\bullet$} ``Chinese junk'' also sails into the harbour in the
form of HTML in which the only indication of the character set
is in a \.{charset=} declaration in the HTML itself, usually
in a \.{http-equiv="Content-Type"} declaration.  We ought to
try to spot these and invoke the appropriate interpreter.

\medskip
\item{$\bullet$} Audit the MIME parsing code against RFCs
2045--2049 and subsequent updates (2231, 2387, 2557,
2646, and 3032, plus doubtless others).  Examine various
messages in the training collections which report
MIME parsing and/or decoding errors to determine whether
the messages are, indeed, malformed or are indicative of
errors in this program.

\subsection{1}{Belling the cat}

Most of the items on the above list require expertise I
have not had the opportunity to acquire and/or research
and experimentation I've lacked the time to perform.  If
you've the requisite knowledge for one or more of
these jobs and are willing to put coding stick to
magnetic domains, please get in touch.  You can contact me by
sending E-mail to \.{bugs@@fourmilab.ch} with \.{annoyance-filter}
in the \.{Subject} line.tmp/af.html

@*1 A Brief History of {\tt annoyance-filter}.

{
\parskip=1ex
In a real sense, this program has been twenty-five years in the
making. The seed was planted in the  1970's while thinking about Jim Warren's
concept of "datacasting".  He envisioned using
subcarriers of FM stations (or perhaps data encoded in the
vertical retrace interval of television signals) to transmit
digital information freely accessible to all.  Not Xanadu or
the Internet, mind you $\ldots$ this remained a
one-to-many broadcast medium, but one capable of providing
information in a form which the then-emerging personal
computers could receive, digest, and present in a customised
fashion to their users.

``But who pays?''  Well, that detail, which played a large part
in the inflation and demise of the recent \.{.com}
bubble, was central to the feasibility of datacasting as well.
Jim Warren's view was that the primarily advertiser-supported
business model adopted by most U.S. print and broadcast media
would be equally applicable to bits flung into the ether from a
radio antenna.  As I recall, he cited the experience of
suburban weekly newspapers, which discovered their profits
{\it increased} when they moved from a paid
subscription/per-copy readership to free
distribution---circulation went up, advertising rates rose
apace, and the bottom line changed from red to green.

Intriguing $\ldots$ but still I had my
doubts.  When you read a newspaper or magazine, you
can't avoid the advertising---you can flip past it,
to be sure, but you still have to look at it, at least
momentarily, so there's always the possibility a
sufficiently clever image or tag line may motivate you
to read the rest. I asked Jim why, once a document was
in an entirely digital form, folks couldn't develop
filters to remove the advertising before it ever
reached their eyes.  This would destroy the free
distribution model  and render an advertising-supported
digital broadcasting service unworkable.  Jim wasn't
too concerned about this.  In his estimation, discriminating
advertising from editorial content would require
artificial intelligence which did not exist and wasn't
remotely on the horizon.

That's when von~Mises' words on advertising came
back to me.  Advertising is
{\it advertising}---perforce, it speaks with a
{\it different vocabulary} than the sports page,
letters to the editor, police blotter, national and
international news, and commentary (aside, perhaps,
from Maureen Dowd's columns in {\it The New York
Times}).  Given a sufficiently large collection
of known editorial copy and advertising, might it not
be possible to extract a {\it signature}, in the
sense of radar signatures to discriminate warheads from
decoys in ballistic missile defence, with which a
sufficiently clever program could identify advertising
and remove it, with a high level of confidence, before
the reader ever saw it?

Fast forward---or, more precisely, {\it pause}$\ldots\,$.
By the late 1970's
I'd concluded the best strategy to make the most of
the ambient malaise was to amass a {\it huge
pile} of money.  Money may not buy happiness, but
at the very least it would mitigate many of the
irritations of that bleak, collectivist era. 
Being a nerd, I immediately turned to technology for a
quick fix, and what should I espy but an exploding
market in affordable home video cassette
recorders----VCRs---which were, in those days, becoming a
fixture in more and more households.  Many VCRs were purchased
to play rented movies,
but, being also able to automatically record programs off-the-air
on a preset schedule, they could be used for
``time-shifting''---recording broadcast programs for later
viewing.  But why, thought I, sit though all
those tedious commercials you've recorded along with the
programs you intend to watch?  Certainly, people
quickly learned to ``zip''---use the fast forward to skip
past commercials---but what if you could detect commercials
and ``zap'' them---never record them in the first place?
It occurred to me that inventing a device which
accomplished this might be lucrative indeed.


The concept couldn't have been simpler---a little box
which monitors the video and audio of the channel
you're recording and, based on real-time analysis of
the signal, pauses and resumes recording of the program
on your VCR, yielding a tape free of advertising.  It
was easy to imagine such a gizmo succeeding like the
contemporary ``Demon Dialer'' telephone speed dialer
add-on, selling in the tens of millions in a matter of
months.\footnote{$^2$}{Well
{\it of course} it occurred to me that widespread adoption
of such a device would motivate advertisers to disguise the
tags that discriminated commercials from
programs.  But hey---by the time that happened I'd
have already cashed the customers' checks
and blown the joint.  There was bit of the
\pdfURL{Ferengi}{http://www.fourmilab.ch/documents/ferengi/palm/}
in me then.  Truth be told, there still is.}  Imagine the dismay of advertisers and my own
contented avarice as I watched the money bin fill deep enough
for high diving.  No more laps round the worry room for
me!

I must confess to some inside information in this
regard.  While working for a regrettable employer
in an odious swamp, I'd twigged to the fact that
network television advertisers tagged their commercials
with a signature in the vertical retrace interval
to permit audit bureaux to measure how many
network affiliates actually broadcast each commercial. 
This tag appeared to me the Achilles' heel of
television advertising.  As long as one could
distinguish tagged commercials from an un-tagged
program, it would be more or less straightforward to
detect when a commercial was being transmitted and
pause the VCR until the program resumed.

If only$\ldots\,$.  In reality, only
nationally broadcast commercials bore the tag, and only
some of them.  Local commercials were never tagged. 
This created a difficult marketing dilemma for my grand
scheme.  While it might have been possible to block
some of the most ubiquitous and irritating commercials
on mass-market network series, the bottom feeders who
{\it watch} those shows probably {\it enjoyed}
the commercials and wouldn't be prospects for my
gadget, while those like myself, infuriated by
incessant commercials interrupting late night movies,
would find the device ineffective since local
commercials on independent stations were never tagged.
Real-time analysis of video or even audio in the 1970's
and early 80's was technologically out of the question
for a product aimed at a mass consumer market.
So, I put the idea of an annoyance filter for television
aside and occupied myself with other endeavours.

We now arrive at the late 1980's.  I'd spent the last decade or
so \pdfURL{filling up the money bin}{http://www.fourmilab.ch/autofile/}
more or less flat out, and having reached a level
I judged more than adequate, I began to turn my attention
to matters I'd neglected during those laser-focused years.

Writing science fiction, for one thing.  There was something
about the advertising filter which had dug its way
into my brain so deeply that nothing could dislodge
it.  The year is 1989; the
\pdfURL{Berlin Wall is about to tumble}{http://www.fourmilab.ch/documents/sftriple/nwab.html}; and
I'm scribbling a story about two programmers
spending the downtime between Christmas and New Year's
Day (the period when I'd accomplished about half of my
\pdfURL{own productive work}{http://www.fourmilab.ch/autofile/www/section2\_115\_3.html}
over the previous half decade) prowling
the nascent Internet for evidence of an extraterrestrial
message already received, but not recognised as such.  In\hfill\break
\line{\hfill\pdfURL{{\it We'll Return, After this Message}}{http://www.fourmilab.ch/documents/sftriple/gpic.html},\hfill}
it is an
{\it annoyance filter} which recognises an
extraterrestrial message for what it is,
{\it advertising}, and as von~Mises
observed, distinguishable by its own strident
clamouring for attention.

A decade later, in the very years in which I set my science
fiction story, I launched
\pdfURL{my own search}{http://www.fourmilab.ch/goldberg/} for a message
from our Creator hidden in the most obvious of locations---no
results so far.  Yet still I scour the Net.

Which brings us, more or less, to the present.  The
idea of an annoyance filter continued to intermittently
occupy my thoughts,
especially as the volume of junk arriving in my mailbox
incessantly mounted despite ongoing efforts to filter it
with increasingly voluminous and clever
\.{Procmail} rules. 
Then, in August 2002, my friend and colleague
\pdfURL{Kern Sibbald}{http://www.sibbald.com/}
brought to my attention Paul Graham's
\pdfURL{brilliant design}{http://www.paulgraham.com/spam.html}
for an adaptable,
Bayesian filter to discriminate junk and legitimate
mail by word frequencies measured in actual samples of
mail pre-sorted into those categories.  Now
{\it that} sounded promising!  Here was a design
which was simple in concept, theoretically sound, and best of all,
{\it it seemed to work}.  Graham implemented his prototype
filter in the ``Arc'' Lisp dialect used in his research. 
I decided to build a deployable tool in industrial-strength \CPP/,
founded on his design, and handling all the details required
so the filter could, as much as possible, interpret
mail the same way a human would---decoding, translating,
and extracting wherever necessary to defeat the techniques
junk mailers adopt to hide their content from
nave filtering utilities.

This is not a simple task.  Consider---you can probably
sort out a message you're interested in reading from
unsolicited junk in a fraction of a second, but that
assumes it's presented to you after all of the mail
transfer and content encodings have been peeled away to
reveal the true colours of the content.  Long gone are
days when E-mail was predominantly ASCII text.  Today,
it's more than likely to be HTML (if not a Flash
animation or some other horror), often transmitted in
\.{Quoted-Printable} or \.{Base64} encodings
largely in the interest of ``stealth''---to hide the
content from filters not equipped with the decoding
facilities of a full-fledged mail client.

The \PRODUCT\ is based on Graham's
crystalline vision of Bayesian scoring of messages
by empirically determined word probabilities. It includes the
tedious but essential machinery required to parse MIME
multi-part mail attachments, decode non-plain-text
parts, and interpret character sets in languages the
user isn't accustomed to reading. This makes for great
snowdrifts of software, but fortunately few details about
which the typical user need fret.

Preliminary tests indicate \PRODUCT\ is
inordinately effective in discriminating legitimate
from junk mail.  But this entire endeavour remains very
much an active area of research and, consequently,
\PRODUCT\ has been implemented as a
toolkit intended to facilitate experiments with various
filtering strategies and measuring the characteristics
which best identify mail worth reading.  You're more
than welcome to build and install the program using the
cookbook instructions but, if you're inclined to delve
deeper, feel free to jump in---the programming's fine! 
Everyone is invited to contribute their own wisdom and
creativity toward bringing to an end this intellectual
pollution.  Remember, when nobody ever sees
junk mail, nobody will bother to send it.  Let us
commence rowing toward that happy landfall.
}

@** Dictionary Word.

A |dictionaryWord| represents a unique token found in an input stream.
The |text| field is the |string| value of the token.

@<Class definitions@>=
class dictionaryWord {
public:@/
    static const unsigned int nCategories = 2;
    enum mailCategory {Mail = 0, Junk = 1, Unknown};

    string text;    	    	    // The word itself
    unsigned int occurrences[nCategories]; // Number of occurrences in Mail and Junk
    double junkProbability; 	    // Probability this word appears in Junk
    
    dictionaryWord(string s = "") {
    	set(s);
    }
    
    void set(string s = "", unsigned int s_Mail = 0, unsigned int s_Junk = 0,
    	     double jProb = -1) {
    	text = s;
	occurrences[Mail] = s_Mail;
	occurrences[Junk] = s_Junk;
	junkProbability = jProb;
    }
    
    string get(void) const {
    	return text;
    }
    
    unsigned int n_mail(void) const {
    	return occurrences[Mail];
    }
    
    unsigned int n_junk(void) const {
    	return occurrences[Junk];
    }
    
    unsigned int n_occurrences(void) const {
    	unsigned int o = 0;
	
	for (unsigned int i = 0; i < nCategories; i++) {
	    o += occurrences[i];
	}
	return o;
    }
    
    void add(mailCategory cat, unsigned int howMany = 1) {
    	assert(cat == Mail || cat == Junk);
	occurrences[cat] += howMany;
    }
    
    @/
    /* Reset occurrences in category.
       Returns number of occurrences remaining  in
       other categories. */
    unsigned int resetCat(mailCategory cat) {
    	assert(cat == Mail || cat == Junk);
	occurrences[cat] = 0;
	return occurrences[Mail] + occurrences[Junk];
    }
    
    void computeJunkProbability(unsigned int nMailMessages, unsigned int nJunkMessages,
    	double mailBias = 2, unsigned int minOccurrences = 5);
	
    double getJunkProbability(void) const {
    	return junkProbability;
    }
    
    unsigned int length(void) const {     // Return length of word
    	return text.length();
    }
    
    unsigned int estimateMemoryRequirement(void) const {    // Estimate memory consumed by word
    	return (((length() + 3) / 4) * 4) + sizeof(string::size_type) +	// Word text
	       (sizeof(unsigned int) * nCategories) +	// Category counts
	       sizeof(double) +     	    	    	// Junk probability
	       (sizeof(int) * 8);   	    	    	// Overhead
    }
    
    void toLower(void) {    	    // Convert to lower case
    	transform(text.begin(), text.end(), text.begin(), &dictionaryWord::to_iso_lower);
    }
    
    void describe(ostream &os = cout);
    
    void exportCSV(ostream &os = cout);
    bool importCSV(istream &is = cin);
    
    static string categoryName(mailCategory c) {
    	return (c == Mail) ? "mail" : ((c == Junk) ? "junk" : "unknown");
    }
    
    void exportToBinaryFile(ostream &os);
    bool importFromBinaryFile(istream &is);
    
protected:@/    
    
    @<Transformation functions for algorithms@>;
};

@
In order to store |dictionaryWord| objects in ordered containers such
as |map|, we must define the |<| operator.  It ranks objects
by lexical comparison of their |text| fields.

@<Class implementations@>=
bool operator < (dictionaryWord a, dictionaryWord b) {
    return a.get() < b.get();
}

@
The |computeJunkProbability| procedure determines the probability a
given |dictionaryWord| appears in junk mail.  Words with a high
probability (near 1) are almost certain to be from junk, while
low probability words (near 0) are highly likely to appear in
legitimate mail.  The probability is computed based on the
following parameters:

\vskip 1ex
\settabs 5 \columns
\+\hskip5ex$m$&|occurrences[Mail]|&Occurrences of word in legitimate mail\cr
\+\hskip5ex$j$&|occurrences[Junk]|&Occurrences of word in in junk mail\cr
\+\hskip5ex$n_m$&|nMailMessages|&Number of legitimate mail messages in database\cr
\+\hskip5ex$n_j$&|nJunkMessages|&Number of junk mail messages in database\cr
\+\hskip5ex$b$&|mailBias|&Bias in favour of words in legitimate messages\cr
\+\hskip5ex$s$&|minOccurrences|&Significance: discard words with $(m\times b+j)<s$\cr
\vskip 1ex

$$p=\cases{-1,&if $(m\times b+j)<s$;\cr
    	   \min(0.99,
	    \max(0.01, {\min({j/{n_j}},
	    	1)\over{\min({(m\times b)/{n_m}}, 1)+\min({j/{n_j}}, 1)})})&otherwise.\cr}$$
	   
A word which appears so few times its probability is deemed
insufficiently determined is assigned a notional probability of $-1$ and
ignored in subsequent tests.  To avoid dividing
by zero when incrementally assembling dictionaries, if no
messages in a category have been loaded, we arbitrarily set the
count to 1.

@<Class implementations@>=
void dictionaryWord::computeJunkProbability(unsigned int nMailMessages, unsigned int nJunkMessages,
    	double mailBias, unsigned int minOccurrences)
{
    double nMail = occurrences[Mail] * mailBias,
    	   nJunk = occurrences[Junk];
	   
    nMailMessages = max(nMailMessages, 1u);
    nJunkMessages = max(nJunkMessages, 1u);
		 
    if ((nMail + nJunk) >= minOccurrences) {
    	assert(nMailMessages > 0);
	assert(nJunkMessages > 0);
    	junkProbability = min(0.99, max(0.01, min(nJunk / nJunkMessages, 1.0) /
	    (min(nMail / nMailMessages, 1.0) + min(nJunk / nJunkMessages, 1.0))));
    } else {
    	junkProbability = -1;
    }
}

@
The |describe| method writes a human-readable description of the
various fields in the object to the designated output stream,
which defaults to |cout|.

@<Class implementations@>=
    void dictionaryWord::describe(ostream &os) {
    	os << text <<
	      "  Mail: " << n_mail() << ", Junk: " << n_junk() <<
	      ", Probability: " << setprecision(5) << junkProbability <<  endl;
    }

@
The |exportCSV| method creates a comma-separated value
(CSV) file containing all fields from the dictionary word.
This permitting verification and debugging of the
dictionary compilation process.

@<Class implementations@>=
    void dictionaryWord::exportCSV(ostream &os) {
	os << setprecision(5) << junkProbability << "," <<
	      occurrences[Mail] << "," << occurrences[Junk] << ",\"" <<
	      text << "\"" << endl;
    }
    
@
The |importCSV| method reads the next line from a comma-separated
value (CSV) dictionary dump and stores the values parsed from it
into the |dictionaryWord|.  If this is the special sentinel
pseudo-word used to store the message counts, |junkProbability|
will be set to $-1$.  If the record is not a well-formed CSV
dictionary word, |junkProbability| will be set to $-2$ and
|text| to the actual line from the CSV file; this
may be used to discard title records.  Records which
begin with ``\.{;}'' or ``\.{\#}'' are ignored as comments.
When the end of file is encountered, |false| is returned
and |junkProbability| is set to $-3$.

Note that this is {\it not} a general purpose CSV parser, but
rather one specific to the format which |exportCSV| writes.
In particular, general string quoting is ignored since none
of the difficult cases arise in the CSV we generate.

@<Class implementations@>=
    bool dictionaryWord::importCSV(istream &is) {
    	while (true) {
	    string s;

	    if (getline(is, s)) {
	    	string::size_type p, p1, p2;
		
		for (p = 0; p < s.length(); p++) {
		    if (!isISOspace(s[p])) {
		    	break;
		    }
		}
		if ((p >= s.length()) || (s[p] == '#') || (s[p] == ';')) {
		    continue;	    // Blank line or comment delimiter---ignore
		}
		
		if ((s[p] == '-') || isdigit(s[p])) {		
		    p = s.find(',');
		    if (p != string::npos) {
		    	p1 = s.find(',', p + 1);
			if (p1 != string::npos) {
			    p2 = s.find(',', p1 + 1);
			    if (p2 != string::npos) {
			    	junkProbability = atof(s.substr(0, p).c_str());
				occurrences[Mail] = atoi(s.substr(p + 1, p1 - p).c_str());
				occurrences[Junk] = atoi(s.substr(p1 + 1, p2 - p).c_str());
				p = s.find('"', p2 + 1);
				if (p != string::npos) {
				    p1 = s.find_last_of('"');
				    if ((p1 != string::npos) && (p1 > p)) {
				    	text = s.substr(p + 1, (p1 - p) - 1);
					return true;	// A valid record, hurrah!
				    }
				}
			    }
			}
		    }
		}
		
		junkProbability = -2;   // Ill-formed record
		text = s;
		return true;
	    }
	    junkProbability = -3;   	// End of file
	    return false;
	}
    }
        
@
This method writes a binary representation of the word to an output
stream.  This is used to create the binary word database
used to avoid rebuilding the letter and character category
counts every time.  Each entry begins with the number of
characters in the word followed by its text.  After this,
the count and probability fields are output in portable
big-endian format.  We do assume IEEE floating point compatibility
across platforms, but auto-detect floating point byte order.

@<Class implementations@>=
    void dictionaryWord::exportToBinaryFile(ostream &os) {
	unsigned char c;
	const unsigned char *fp;
	const double k1 = -1.0;
	
#define outCount(x) c = (x); os.put(c)
#define outNumber(x) os.put((x >> 24) & 0xFF); os.put((x >> 16) & 0xFF); \
    	    	     os.put((x >> 8) & 0xFF); os.put(x & 0xFF)

	outCount(text.length());
	os.write(text.data(), text.length());
	outNumber(n_mail());
	outNumber(n_junk());
	fp = reinterpret_cast<const unsigned char *>(&k1);
	if (fp[0] == 0) {
	    fp = reinterpret_cast<unsigned char *>(&junkProbability);
	    for (unsigned int i = 0; i < (sizeof junkProbability); i++) {
	    	outCount(fp[((sizeof junkProbability) - 1) - i]);
    	    }
	} else {
	    // Big-endian platform
	    os.write(reinterpret_cast<const char *>(&junkProbability),
	    	     sizeof junkProbability);
	}
		
#undef outCount
#undef outNumber
    }

@
Importing a word from a binary file is the inverse of the export
above.  Once again we figure out the byte order of |double|
on the fly by testing a constant and decode the byte
stream accordingly.

@<Class implementations@>=
    bool dictionaryWord::importFromBinaryFile(istream &is) {
	unsigned char c;
	char sval[256];
	unsigned char ibyte[4];
	unsigned char fb[8];
	unsigned char *fp;
	const double k1 = -1.0;
	const unsigned char *kp;

#define iNumber ((ibyte[0] << 24) | (ibyte[1] << 16) | (ibyte[2] << 8) | ibyte[3])
	if (is.read(reinterpret_cast<char *>(&c), 1)) {
	    if (is.read(sval, c)) {
	    	text = string(sval, c);
		is.read(reinterpret_cast<char *>(ibyte), 4);
		occurrences[Mail] = iNumber;
		is.read(reinterpret_cast<char *>(ibyte), 4);
		occurrences[Junk] = iNumber;
	    	kp = reinterpret_cast<const unsigned char *>(&k1);
		if (kp[0] == 0) {
		    is.read(reinterpret_cast<char *>(fb), 8);
		    fp = reinterpret_cast<unsigned char *>(&junkProbability);
		    for (unsigned int i = 0; i < (sizeof junkProbability); i++) {
	    		fp[((sizeof junkProbability) - 1) - i] = fb[i];
    		    }
		} else {
		    is.read(reinterpret_cast<char *>(&junkProbability),
		    	    sizeof junkProbability);
		}
		return true;
	    }
    	}
	return false;
#undef iNumber
    }
    
@
The following are simple-minded transformation functions passed
as arguments to STL algorithms for various manipulations of the
text.

@<Transformation functions for algorithms@>=
    
    static char to_iso_lower(char c) {
    	return toISOlower(c);
    }
    
    static char to_iso_upper(char c) {
    	return toISOupper(c);
    }
    
@** Dictionary.

A |dictionary| is a collection of |dictionaryWord| objects,
organised for rapid look-up.  For convenience and efficiency,
we derive |dictionary| from the STL |map| container, thereby
making all of its core functionality accessible to the user.
It would be more efficient and cleaner to use a |set|, but
objects in a |set| cannot be modified; values in a |map|
can.

@<Class definitions@>=
class dictionary : public map<string, dictionaryWord> {
public:@/

    unsigned int memoryRequired;

    void add(dictionaryWord w, dictionaryWord::mailCategory category);
    
    void include(dictionaryWord &w);
    
    void exportCSV(ostream &os = cout);
    void importCSV(istream &is = cin);
    
    void computeJunkProbability(unsigned int nMailMessages, unsigned int nJunkMessages,
    	double mailBias = 2, unsigned int minOccurrences = 5);
	
    void purge(unsigned int occurrences = 0);
    
    void resetCat(dictionaryWord::mailCategory category);
    
    void printStatistics(ostream &os = cout) const;
    
#ifdef HAVE_PLOT_UTILITIES
    void plotProbabilityHistogram(string fileName, unsigned int nBins = 20) const;
#endif

    void exportToBinaryFile(ostream &os);
    void importFromBinaryFile(istream &is);
    
    unsigned int estimateMemoryRequirement(void) const {
    	return memoryRequired;
    }
    
    dictionary() : memoryRequired(0) {
    }
};

@
The |add| method looks up a |dictionaryWord| in the |dictionary|.
If the word is already present, its number of occurrences in the
given |category| is incremented.  Otherwise, the word is added
to the |dictionary| with the occurrence count for the
|category| initialised to 1.

@<Class implementations@>=
    void dictionary::add(dictionaryWord w, dictionaryWord::mailCategory category) {
    	dictionary::iterator p;

    	if ((p = find(w.get())) != end()) {
	    p->second.add(category);
	} else {
	    insert(make_pair(w.get(), w)).first->second.add(category);
	    memoryRequired += w.estimateMemoryRequirement();
	}
    }


@
The |include| method is used when merging dictionaries, for
example when performing an |importFromBinaryFile|.  It looks
up the argument word in the dictionary.  If present, its
occurrence counts are added to those of the existing word.
Otherwise, a new word is added with the occurence counts
of the argument.

@<Class implementations@>=
    void dictionary::include(dictionaryWord &w) {
    	dictionary::iterator p;

    	if ((p = find(w.get())) != end()) {
	    p->second.occurrences[dictionaryWord::Mail] += w.occurrences[dictionaryWord::Mail];
	    p->second.occurrences[dictionaryWord::Junk] += w.occurrences[dictionaryWord::Junk];
	} else {
    	    insert(make_pair(w.get(), w));
	}
    }
    
@
The |exportCSV| method exports the dictionary in comma-separated
value (CSV) format for debugging.  To simplify analysis, the
dictionary is re-sorted by |junkProbability|.  The |byProbability|
comparison function is introduced to permit this sorting of the
dictionary.  A pseudo-word is added at the start of the
CSV file to give the number of mail and junk messages
scanned in preparing it.

@<Class implementations@>=
    bool byProbability(const dictionaryWord *w1,
    	    	       const dictionaryWord *w2) {
	double dp = w1->getJunkProbability() - w2->getJunkProbability();
	if (dp == 0) {
	    return w1->get() < w2->get();
	}
	return dp < 0;
    }

    void dictionary::exportCSV(ostream &os) {
    	if (verbose) {
	    cerr << "Exporting dictionary to CSV file." << endl;
	}
    	vector<dictionaryWord *> dv;
    	for (iterator p = begin(); p != end(); p++) {
	    dv.push_back(&(p->second));
	}
	sort(dv.begin(), dv.end(), byProbability);
	os << "; Probability,Mail,Junk,Word" << endl;
	dictionaryWord pdw;
	
	pdw.set(pseudoCountsWord,
	    	messageCount[dictionaryWord::Mail],
		messageCount[dictionaryWord::Junk], -1);
	pdw.exportCSV(os);
    	for (vector<dictionaryWord *>::iterator q = dv.begin(); q != dv.end(); q++) {
	    (*q)->exportCSV(os);
	}
    }
    
@
We import a dictionary from a CSV file by importing successive
records into a |dictionaryWord|, which is then appended to the
|dictionary|.  When the pseudo-word containing the number
of mail and junk messages used to assemble the dictionary is
encountered, those quantities are added to the running totals.
Note that the CSV input file may be in any order---it need not
be sorted in the order |exportCSV| creates, nor need the
message count pseudo-word be the first record of the file.

@<Class implementations@>=
    void dictionary::importCSV(istream &is) {
    	if (verbose) {
	    cerr << "Importing dictionary from CSV file." << endl;
	}
	
    	dictionaryWord dw;
	
	while (dw.importCSV(is)) {
	    if (dw.getJunkProbability() == -1 && (dw.get() == pseudoCountsWord)) {
	    	messageCount[dictionaryWord::Mail] += dw.n_mail();
	    	messageCount[dictionaryWord::Junk] += dw.n_junk();
	    } else if (dw.getJunkProbability() >= -1) {
		include(dw);
    	    } else {
	    	if (verbose) {
		    cerr << "Ill-formed record in CSV import: \"" << dw.get() << "\"" << endl;
		}
	    }
	}
    }

@
The |purge| method discards words in the dictionary which occur
sufficiently infrequently that no probability has been assigned them.
If the optional |occurrences| argument is nonzero, words with that
number of fewer occurrences in the dictionary will be purged
instead of words with undefined probability.

May I say a few words about how we accomplish this?
Yes, it looks absurd to move the elements we wish to preserve
to a separate |queue|, then transfer them back once we're done
emptying the |map|.  ``Why not just walk through the items
and |erase| any which don't make the cut?'', you ask.
Because you {\it can't}, I reply.  Performing an |erase|
on a |map| invalidates all iterators to it, so once you've
removed an item, you're forced to restart the scan from
the |begin()| iterator; with a large dictionary to
purge, that takes {\it forever}.

Now STL purists will observe that I ought be using the
|remove_if| algorithm rather than iterating over the container
myself.  Well, if you can figure out how to make it work,
you're a better man than I\null.  I defined a predicate to
perform a less test on the probability of the |dictionaryWord|
in the second part of the |pair|, and this contraption
makes it past the compiler intact.  But when I attempt to pass
that predicate to |remove_if| I get half a page of gibberish from the
bowels of STL complaining about not being able to use
the default assignment operator on
|string pair<const string, dictionaryWord>::first|
or some such.  If you can figure out how to make
this work, be my guest---I'll be glad to replace my code
with yours with complete attribution.  I've left my |remove_if|
code (which doesn't make it through the compiler) below,
disabled on the tag |PURGE_USES_REMOVE_IF|.  Good luck---me,
I'm finished.

\vskip1ex
\narrower
``A man is not finished when he is defeated.  He is finished when he quits.''
\hfill\break
\narrower
\hbox to 33em{\hfil---Richard M. Nixon}

@<Class implementations@>=
#ifdef PURGE_USES_REMOVE_IF
    class dictionaryWordProb_less : public unary_function<pair<string, dictionaryWord>, int> {
    	int p;
    public:@/
    	explicit dictionaryWordProb_less(const int pt) : p(pt) {}
	bool operator () (const pair<string, dictionaryWord> &dw) const {
	    return dw.second.getJunkProbability() < p;
	}
    };
#endif

    void dictionary::purge(unsigned int occurrences) {
    	if (verbose) {
	    cerr << "Pruning rare words from database: " << flush;
	}
	memoryRequired = 0;
	
#ifdef PURGE_USES_REMOVE_IF
    	remove_if(begin(), end(), dictionaryWordProb_less(0));
#else
    	queue <dictionaryWord> pq;
    	while (!empty()) {
	    if (((occurrences > 0) && (begin()->second.n_occurrences() > occurrences)) ||
	    	(begin()->second.getJunkProbability() >= 0)) {
	    	pq.push(begin()->second);
	    }
	    erase(begin());
	}
	while (!pq.empty()) {
	    insert(make_pair(pq.front().get(), pq.front()));
	    memoryRequired += pq.front().estimateMemoryRequirement();
	    pq.pop();
	}
#endif

    	if (verbose) {
	    cerr << size() << " words remaining." << endl;
	    cerr << "  Dictionary size " << estimateMemoryRequirement() << " bytes." << endl;
	}
    }
    
@
The |resetCat| method resets the count for all words for
the given |mailCategory|.

@<Class implementations@>=
    void dictionary::resetCat(dictionaryWord::mailCategory category) {
    	if (verbose) {
	    cerr << "Resetting counts for category " <<
	    	dictionaryWord::categoryName(category) << endl;
	}
    	for (iterator mp = begin(); mp != end(); mp++) {
	   mp->second.resetCat(category);
	}
    }

@
Compute and print statistical measures of the probability
distribution of words in the dictionary.  Words with negative
probability are ignored, so there is no need to |purge| before
computing statistics.

@<Class implementations@>=
    void dictionary::printStatistics(ostream &os) const {
    	if (verbose) {
	    cerr << "Computing dictionary statistics." << endl;
	}
    	os << "Dictionary statistics:" << endl;
	dataTable <double> dt;
	
    	for (const_iterator mp = begin(); mp != end(); mp++) {
	    if (mp->second.getJunkProbability() >= 0) {
	    	dt.push_back(mp->second.getJunkProbability());
	    }
	}
        os << "Mean = " << dt.mean() << endl;
        os << "Geometric mean = " << dt.geometricMean() << endl;
        os << "Harmonic mean = " << dt.harmonicMean() << endl;
        os << "RMS = " << dt.RMS() << endl;
        os << "Median = " << dt.median() << endl;
        os << "Mode = " << dt.mode() << endl;
        os << "Percentile(0.5) = " << dt.percentile(0.5) << endl;
        os << "Quartile(1) = " << dt.quartile(1) << endl;
        os << "Quartile(3) = " << dt.quartile(3) << endl;

        os << "Variance = " << dt.variance() << endl;
        os << "Standard deviation = " << dt.stdev() << endl;
        os << "CentralMoment(3) = " << dt.centralMoment(3) << endl;
        os << "Skewness = " << dt.skewness() << endl;
        os << "Kurtosis = " << dt.kurtosis() << endl;
    }

@
Plot a histogram of the distribution of words in the dictionary
by probability.  Words with negative probability are ignored, so
there is no need to |purge| before plotting.

@<Class implementations@>=
#ifdef HAVE_PLOT_UTILITIES
#define PLOT_DEBUG
    void dictionary::plotProbabilityHistogram(string fileName, unsigned int nBins) const {
    	if (verbose) {
	    cerr << "Plotting probability histogram to " << fileName << ".png" << endl;
	}
	ofstream gp((fileName + ".gp").c_str()),
    		 dat((fileName + ".dat").c_str());

    	@<Build histogram of word probabilities@>;
	@<Write GNUPLOT data table for probability histogram@>;

	//	Create GNUPLOT instructions to plot data

	gp << "set term pbm small color" << endl;
	gp << "set ylabel \"Number of Words\"" << endl;
	gp << "set xlabel \"Probability\"" << endl;

	gp << "plot \"" << fileName << ".dat\" using 1:2 title \"\" with boxes" << endl;

	string command("gnuplot ");
	command += fileName + ".gp | pnmtopng >" + fileName + ".png";
#ifdef PLOT_DEBUG
	cout << command << endl;
#else
	command += " 2>/dev/null";
#endif
	gp.close();
	dat.close();
	system(command.c_str());
#ifndef PLOT_DEBUG
	//	Delete the temporary files used to create the plot
	remove((fileName + ".gp").c_str());
	remove((fileName + ".dat").c_str());
#endif
    }
#endif /* |HAVE_PLOT_UTILITIES| */
    
@
Walk through the dictionary and bin the probabilities of words
into |nBins| equally sized bins and compute a histogram of
the numbers in each bin.

@<Build histogram of word probabilities@>=
    vector <unsigned int> hist(nBins);
	
    for (const_iterator mp = begin(); mp != end(); mp++) {
	if (mp->second.getJunkProbability() >= 0) {
	    unsigned int bin = static_cast<unsigned int>(mp->second.getJunkProbability() * nBins);
	    
	    hist[bin]++;
	}
    }
    
@
Write the \.{GNUPLOT} data file for the probability histogram.
The first field in each line is the binned probability and the
second is the number of words which fell into that bin.

@<Write GNUPLOT data table for probability histogram@>=
    for (unsigned int j = 0; j < nBins; j++) {
    	dat << (static_cast<double>(j) / nBins) << " " << hist[j] << endl;
    }


@
When the dictionary has been modified, recompute the junk probability
of all the words it contains.  This simply applies the |computeJunkProbability|
method to all the |dictionaryWord|s in the container.

@<Class implementations@>=
    void dictionary::computeJunkProbability(unsigned int nMailMessages, unsigned int nJunkMessages,
    	double mailBias, unsigned int minOccurrences)
    {
    	for (dictionary::iterator p = begin(); p != end(); p++) {
	    p->second.computeJunkProbability(nMailMessages, nJunkMessages,
    	    	    mailBias, minOccurrences);
	}
    }

@
Exporting or importing a dictionary to or from a binary file is
more or less a matter of iterating through the dictionary and
delegating the matter to each individual word.  One detail
we must deal with, however, is adding a pseudo-word at the
head of the dictionary to record the number of mail and
junk {\it messages} which contributed the words to the
dictionary.  These counts are needed to subsequently
recompute the probability for each word.

When loading a dictionary with |importFromBinaryFile|
this pseudo-word is recognised and the values it contains
are added to the |messageCount| for each category.  Note that
importing a file is logically an {\it addition} to an
existing dictionary---you may import any number of
binary dictionary files, just as you can add mail
folders with the \.{--mail} and \.{--junk} options.

@d pseudoCountsWord " COUNTS "

@<Class implementations@>=
    void dictionary::exportToBinaryFile(ostream &os) {
    	if (verbose) {
	    cerr << "Exporting dictionary to binary file." << endl;
	}
	dictionaryWord pdw;
	
	pdw.set(pseudoCountsWord,
	    	messageCount[dictionaryWord::Mail],
		messageCount[dictionaryWord::Junk], -1);
	pdw.exportToBinaryFile(os);
	
    	for (dictionary::iterator p = begin(); p != end(); p++) {
	    p->second.exportToBinaryFile(os);
	}
    }
    
    void dictionary::importFromBinaryFile(istream &is) {
    	if (verbose) {
	    cerr << "Importing dictionary from binary file." << endl;
	}
	
    	dictionaryWord dw;
	
	if (dw.importFromBinaryFile(is)) {
	    assert(dw.get() == pseudoCountsWord);
	    messageCount[dictionaryWord::Mail] += dw.n_mail();
	    messageCount[dictionaryWord::Junk] += dw.n_junk();
	    
	    while (dw.importFromBinaryFile(is)) {
		include(dw);
    	    }
	}
    }
    
@** Fast dictionary.

A |fastDictionary| sacrifices portability and generality on the altar
of speed.  A |dictionary| exported as a |fastDictionary| can be loaded
into memory (or, even better, memory mapped if the system permits), and
accessed directly without the need to allocate or initialise any objects.
The price one pays for this is that fast dictionaries may not be
shared among platforms with different byte order or floating
point representation, but such incompatibilities are detected and
yield error messages, not Armageddon.

@d fastDictionaryVersionNumber	1
@d fastDictionaryVoidLink   	static_cast<u_int32_t>(-1)
@d fastDictionarySignature  	"AFfd"
@d fastDictionaryFloatingTest	(1.0 / 111)

@<Class definitions@>=
class fastDictionary {
private:@/
    static const u_int16_t byteOrderMark = 0xFEFF;
    static const u_int16_t doubleSize = sizeof(double);
    static const u_int16_t versionNumber = fastDictionaryVersionNumber;

    unsigned char *dblock;  	    	// Monolithic dictionary block pointer
    u_int32_t totalSize;    	    	// Total dictionary size in bytes
    u_int32_t hashTableOffset;	    	// Offset of hash table in file
    u_int32_t hashTableBuckets;     	// Number of buckets in hash table
    u_int32_t wordTableSize;	    	// Word table size in bytes
    
    u_int32_t *hashTable;   	    	// Pointer to hash table in memory
    unsigned char *wordTable;	    	// Pointer to word table in memory
    
#ifdef HAVE_MMAP
    char *dp;	    	    	    	// Pointer to memory mapped block
    int fileHandle; 	    	    	// File handle to memory mapped dictionary
    long fileLength;	    	    	// Length of memory mapped block
#endif
    
    void regen(void) const {
    	cerr << "You should re-generate the fast dictionary on this machine." << endl;
    }
    
    static unsigned int nextGreaterPrime(unsigned int a);
    
    static u_int32_t computeHashValue(const string &s);
    
    static void Vmemcpy(vector <unsigned char> &v,
    	    	    	vector <unsigned char>::size_type off,
		    	const void *buf, const unsigned int bufl) {
	const unsigned char *bp = static_cast<const unsigned char *>(buf);
	
	for (unsigned int i = 0; i < bufl; i++) {
	    v[off++] = *bp++;
	}
    }
        
public:@/
    fastDictionary() : dblock(NULL) {
#ifdef HAVE_MMAP
    	dp = NULL;
#endif
    }
    
    ~fastDictionary() {
#ifdef HAVE_MMAP
    	if (dp != NULL) {
    	    munmap(dp, fileLength);
	    close(fileHandle);
	}
#else
    	if (dblock != NULL) {
	    delete dblock;
    	}
#endif
    }
    
    bool load(const string fname);
    
    bool isDictionaryLoaded(void) {
    	return dblock != NULL;
    }
    
    double find(const string &target) const;
    
    void describe(ostream &os = cout) const {
    	if (dblock != NULL) {
    	    os << "Total size of fast dictionary is " << totalSize << endl;
    	    os << "Hash table offset: " << hashTableOffset << endl;
    	    os << "Hash table buckets: " << hashTableBuckets << endl;
    	    os << "Word table size: " << wordTableSize << endl;
	} else {
	    os << "No fast dictionary is loaded." << endl;
	}
    }

    static void exportDictionary(const dictionary &d, ostream &o);
    static void exportDictionary(const dictionary &d, const string fname);
};

@
The |load| method brings a |fastDictionary| into memory, either by
reading it into a dynamically allocated buffer or by memory mapping
the file containing it.  Even when we're memory mapping the dictionary,
we read the header using an |istrstream| bound to the memory mapped
block in the interest of code commonality---the real win in memory
mapping is shared access to the hash and word tables; the overhead in
reading the header fields from a memory stream is negligible.

@<Class implementations@>=
    bool fastDictionary::load(const string fname) {
#ifdef HAVE_MMAP
	fileHandle = open(fname.c_str(), O_RDONLY);
	if (fileHandle == -1) {
	    cerr << "Cannot open fast dictionary file " << fname << endl;
	    return false;
	}
	fileLength = lseek(fileHandle, 0, 2);
	lseek(fileHandle, 0, 0);
	dp = static_cast<char *>(mmap((caddr_t) 0, fileLength,
		PROT_READ, MAP_SHARED | MAP_NORESERVE,
		fileHandle, 0));
	istrstream is(dp, fileLength);
#else
    	ifstream is(fname.c_str(), ios::in | ios::binary);

    	if (!is) {
	    cerr << "Cannot open fast dictionary file " << fname << "." << endl;
	    return false;
	}
#endif
	char signature[4];
	is.read(signature, 4);
	if (memcmp(signature, fastDictionarySignature, 4) != 0) {
	    cerr << "File " << fname << " is not a fast dictionary." << endl;
fdlbail:;
#ifdef HAVE_MMAP
    	    munmap(dp, fileLength);
	    close(fileHandle);
	    dp = NULL;
#endif
	    return false;
	}
	
	u_int16_t s;
	is.read(reinterpret_cast<char *>(&s), sizeof s);
	if (s != byteOrderMark) {
	    cerr << "Fast dictionary file " << fname <<
	    	" was created on a platform with incompatible byte order." << endl;
	    regen();
	    goto fdlbail;
	}
	
	is.read(reinterpret_cast<char *>(&s), sizeof s);
	if (s != versionNumber) {
	    cerr << "Fast dictionary file " << fname <<
	    	" is version " << s << ".  Version " << versionNumber << " is required." << endl;
	    regen();
	    goto fdlbail;
	}
	
	double d;
	is.read(reinterpret_cast<char *>(&s), sizeof s);
	u_int16_t filler;	
	is.read(reinterpret_cast<char *>(&filler), sizeof filler);  // Two byte filler for alignment
	if (s == doubleSize) {
	    is.read(reinterpret_cast<char *>(&d), sizeof d);
	}
	if ((s != doubleSize) || (d != fastDictionaryFloatingTest)) {
	    cerr << "Fast dictionary file " << fname <<
	    	" was created on a machine with incompatible floating point format." << endl;
	    regen();
	    goto fdlbail;
	}
	
	is.read(reinterpret_cast<char *>(&totalSize), sizeof totalSize);
	is.read(reinterpret_cast<char *>(&hashTableOffset), sizeof hashTableOffset);
	is.read(reinterpret_cast<char *>(&hashTableBuckets), sizeof hashTableBuckets);
    	is.read(reinterpret_cast<char *>(&wordTableSize), sizeof wordTableSize);

#ifdef HAVE_MMAP
    	dblock = reinterpret_cast<unsigned char *>(dp) + is.tellg();
#else
    	u_int32_t fdsize = (hashTableBuckets * sizeof(u_int32_t)) + wordTableSize;
    	try {
    	    dblock = new unsigned char[fdsize];
	} catch (bad_alloc) {
	    cerr << "Unable to allocate memory for fast dictionary.";
	    return false;
	}
	is.read(reinterpret_cast<char *>(dblock), fdsize);
	is.close();
#endif
	
	hashTable = reinterpret_cast<u_int32_t *>(dblock);
	wordTable = dblock + (hashTableBuckets * sizeof(u_int32_t));
	
	if (verbose) {
	    cerr << "Loaded fast dictionary from " << fname << "." << endl;
	}
	
	return true;
    }
    
@
The |find| method looks up the word |target| (assumed to have been
already placed in canonical form) in the dictionary.  The junk
probability of the word is returned, or $-1$ if the word is not
found in the dictionary.  The reason for all the |memcpy|
calls is that the word table are byte packed and
we don't want to worry about whatever alignment issues the
platform may have.

@<Class implementations@>=
    double fastDictionary::find(const string &target) const {
    	assert(dblock != NULL);
    	u_int32_t bucket = computeHashValue(target) % hashTableBuckets;
	if (hashTable[bucket] != fastDictionaryVoidLink) {
	    u_int16_t wlen = target.length();
	    unsigned int sOffset = sizeof(u_int32_t) + sizeof(double);
	    unsigned char *cword = wordTable + hashTable[bucket];
	    
	    while (true) {
		u_int16_t wl;
		memcpy(&wl, cword + sOffset, sizeof wl);
		if ((wl == wlen) &&
	    	    (memcmp(target.data(), cword + sOffset + sizeof(u_int16_t), wlen) == 0)) {
		    double jp;
		    
		    memcpy(&jp, cword + sizeof(u_int32_t), sizeof(double));
		    return jp;
		}
		u_int32_t lnk;
		memcpy(&lnk, cword, sizeof lnk);
		if (lnk == fastDictionaryVoidLink) {
		    break;
    	    	}
		cword = wordTable + lnk;
	    }
	}
	return -1;
    }
    
@
The |exportDictionary| method writes a dictionary to a file in
|fastDictionary| format.  We provide implementations which
accept either an |ostream| of the name of a file to which
the |fastDictionary| is written.  If you pass an |ostream|,
make sure it's opened in binary mode on platforms where
that matters.

@<Class implementations@>=
    void fastDictionary::exportDictionary(const dictionary &d, ostream &o) {
    	u_int32_t hashSize = nextGreaterPrime(d.size());
	
    	vector <u_int32_t> hashTable(hashSize, fastDictionaryVoidLink);
	vector <unsigned char> words;
	
    	for (dictionary::const_iterator w = d.begin(); w != d.end(); w++) {
	    u_int32_t h = computeHashValue(w->first);
	    unsigned int slot = h % hashSize;
	    
	    @<Link new word to hash table chain@>;
	    @<Add new word to word table@>;
	}

    	o << fastDictionarySignature;
	
	u_int16_t b;
	b = byteOrderMark;
	o.write(reinterpret_cast<const char *>(&b), sizeof b);	    // Byte order mark
	
	b = versionNumber;
	o.write(reinterpret_cast<const char *>(&b), sizeof b);	    // File version number
	
	b = doubleSize;
	o.write(reinterpret_cast<const char *>(&b), sizeof b);	    // Size of |double| in bytes
	
	b = 0;
	o.write(reinterpret_cast<const char *>(&b), sizeof b);	    // 88 Filler size is 2 bytes
	
	double td = fastDictionaryFloatingTest;
	o.write(reinterpret_cast<const char *>(&td), sizeof td);    // |double| compatibility test: $1\over 111$
	
	u_int32_t headerSize = 4 + (4 * sizeof(u_int16_t)) + sizeof(double) +
	    (4 * sizeof(u_int32_t));
	    
	u_int32_t wordTableSize = words.size();
	    
	u_int32_t totalSize = headerSize +
	    	    	      (hashTable.size() * sizeof(u_int32_t)) +
			      wordTableSize;

	o.write(reinterpret_cast<const char *>(&totalSize), sizeof totalSize);	// Total size of file	
	o.write(reinterpret_cast<const char *>(&headerSize), sizeof headerSize);    // Hash table offset
	o.write(reinterpret_cast<const char *>(&hashSize), sizeof hashSize);	// Number of buckets in hash table
	o.write(reinterpret_cast<const char *>(&wordTableSize), sizeof wordTableSize);	// Word table size in bytes

#ifdef OLDWAY
	o.write(hashTable.begin(), hashTable.size() * sizeof(u_int32_t)); // Hash table
	
	o.write(words.begin(), words.size());	// Word table
#else
    	for (vector <u_int32_t>::const_iterator htp = hashTable.begin();
	     htp != hashTable.end(); htp++) {
	    u_int32_t hte = *htp;
	    o.write(reinterpret_cast<const char *>(&hte), sizeof hte);
	}
	
    	for (vector <unsigned char>::const_iterator wtp = words.begin();
	     wtp != words.end(); wtp++) {
	    o.put(*wtp);
	}
#endif

	if (verbose) {
	    cerr << "Exported " << d.size() << " words to fast dictionary." << endl;
	}
    }
    
    void fastDictionary::exportDictionary(const dictionary &d, const string fname) {
    	ofstream of(fname.c_str(), ios::out | ios::binary);
	
	if (of) {
	    exportDictionary(d, of);
	    of.close();
	} else {
	    cerr << "Unable to create fast dictionary file " << fname << endl;
	}
    }
    
@
Having determined which bucket in the hash table this word falls
into, we can link it to the hash table itself (if the bucket is
empty), or to the end of the chain of words already sorted
into this bucket.  All links are relative to the start of the
|words| vector.

@<Link new word to hash table chain@>=
    if (hashTable[slot] == fastDictionaryVoidLink) {
	hashTable[slot] = words.size();
    } else {
	u_int32_t p = hashTable[slot];
	u_int32_t l;
	while (true) {
	    memcpy(&l, &(words[p]), sizeof l);
	    if (l == fastDictionaryVoidLink) {
		break;
	    }
	    p = l;
	}
	l = words.size();
	memcpy(&(words[p]), &l, sizeof l);
    }
    
@
Add a new word to the |word| vector.  As this is a new word, we know
that its forward link is |fastDictionaryVoidLink|.  The balance of the fields are
transcribed from the |dictionaryWord| we're adding.

@<Add new word to word table@>=
    vector<unsigned char>::size_type wl = words.size();
    words.resize(words.size() + sizeof(u_int32_t) +
	sizeof(double) + sizeof(u_int16_t) + w->second.get().length());
    u_int32_t vl = fastDictionaryVoidLink;
#ifdef OLDWAY
    memcpy(words.begin() + wl, &vl, sizeof vl);
#else
    Vmemcpy(words, wl, &vl, sizeof vl);
#endif
    wl += sizeof vl;
    double jp = w->second.getJunkProbability();
#ifdef OLDWAY
    memcpy(words.begin() + wl, &jp, sizeof jp);
#else
    Vmemcpy(words, wl, &jp, sizeof jp);
#endif
    wl += sizeof jp;
    u_int16_t wlen = w->second.get().length();
#ifdef OLDWAY
    memcpy(words.begin() + wl, &wlen, sizeof wlen);
#else
    Vmemcpy(words, wl, &wlen, sizeof wlen);
#endif
    wl += sizeof wlen;
#ifdef OLDWAY
    memcpy(words.begin() + wl, w->second.get().data(), wlen);
#else
    Vmemcpy(words, wl, w->second.get().data(), wlen);
#endif

@
This is just about\footnote{$^3$}{Why {\it just about}?  Well, we could have tested
all the {\it even} numbers and divisors, couldn't we?} the dumbest way to generate
prime numbers one can imagine.  We simply start with the next odd number greater
than the argument and try dividing it by all the odd numbers from 3 through the
square root of the candidate.  If none divides it evenly, it's prime.  If not,
bump the candidate by two and try again.  In defence of this ``method'', allow
me to observe this this method is called only when creating a
|fastDictionary| file (to determine the size of the hash table) and then
only once.

@<Class implementations@>=
    unsigned int fastDictionary::nextGreaterPrime(unsigned int a) {
    	unsigned int sqlim = static_cast<unsigned int>(sqrt(static_cast<double>(a)) + 1);
	
	if ((a & 1) == 0) {
	    a++;
	}
	
	while (true) {
	    unsigned int remainder = 0;
	    
	    a += 2;
	    for (unsigned int n = 3; n <= sqlim; n += 2) {
		if ((remainder = (a % n)) == 0) {
		    break;
    	    	}
	    }
	    if (remainder != 0) {
	    	break;
	    }
	}
	return a;
    }

@
Compute a 32 bit unsigned hash value from a string.  This value is
used to determine the hash table slot into which a word is
placed.  It's simple, but it gets you there---tests with a
typical dictionary yield 62\% occupancy for a hash table the
next greater prime than the number of words in the dictionary.
    
@<Class implementations@>=
    u_int32_t fastDictionary::computeHashValue(const string &s) {
    	u_int32_t hash = 1;

	for (unsigned int i = 0; i < s.length(); i++) {
	    hash = (hash * 17) ^ s[i];
	}
	return hash;
    }
    
@** MIME decoders.

MIME decoders process parts of multi-part messages in various
MIME encodings such as \.{base64} and \.{Quoted-Printable}.  They
read encoded lines from an |istream| and return decoded
binary values with the |getchar| method.  The decoder terminates
when the current MIME |partBoundary| is encountered.

|MIMEdecoder| is the parent class of all specific decoders.

@<Class definitions@>=
class mailFolder;

class MIMEdecoder {
public:@/
    istream *is;    	    	    // Stream from which encoded lines are read
    string partBoundary;    	    // Part boundary sentinel
    bool atEnd;     	    	    // At end of part or stream ?
    bool eofHit;    	    	    // Was decoder terminated by end of file ?
    unsigned int nDecodeErrors;     // Number of decoding errors
protected:@/
    string inputLine;	    	    // Current encoded input line
    string::size_type ip;   	    // Input line pointer
    unsigned encodedLineCount;	    // Number of encoded lines read
    bool lookAhead; 	    	    // Have we looked ahead ?
    int lookChar;   	    	    // Look-ahead character
    string endBoundary;     	    // Terminating part boundary
    list <string> *tlist;   	    // Transcript list
    mailFolder *mf; 	    	    // Parent mail folder
    
public:@/    
    MIMEdecoder(istream *i = NULL, mailFolder *m = NULL, string pb = "", list <string> *tl = NULL) {
    	set(i, m, pb, tl);
	resetDecodeErrors();
	tlist = NULL;
    }
    
    virtual ~MIMEdecoder() {
    };
    
    void set(istream *i = NULL, mailFolder *m = NULL,
    	     string pb = "", list <string> *tl = NULL) {
    	is = i;
	mf = m;
	partBoundary = pb;
	inputLine = "";
	ip = 0;
	encodedLineCount = 0;
	lookAhead = false;
	atEnd = false;
	eofHit = false;
	tlist = tl;
    }
    
    virtual string name(void) const = 0;
    
    virtual void resetDecodeErrors(void) {
    	nDecodeErrors = 0;
    }
    
    virtual unsigned int getDecodeErrors(void) const {
    	return nDecodeErrors;
    }
    
    virtual string getTerminatorSentinel(void) const {
    	return endBoundary;
    }
    
    virtual bool isEndOfFile(void) const {
    	return eofHit;
    }
    
    virtual unsigned int getEncodedLineCount(void) const {
    	return encodedLineCount;
    }
    
    virtual int getDecodedChar(void) = 0;   // Return next decoded character, $<0$ if EOF
    
    virtual bool getDecodedLine(string &s); // Return next decoded line, return |false| for EOF
    
    virtual void saveDecodedStream(ostream &os);  // Write decoded text to an |ostream|
    virtual void saveDecodedStream(const string fname); // Write decoded text to file |fname|
    
protected:@/
    virtual bool getNextEncodedLine(void);
};

@
The |getNextEncodedLine| method is called by specific decoders
to obtain the next line (all encodings are line-oriented, being
intended for inclusion in mail messages).  The line is
stored into |inputLine| and tested against the MIME
part boundary sentinel.  A logical end of file is reported
when the part boundary is encountered.  The method is
declared |virtual| so derived decoders may override it
if different behaviour is required.

One subtlety is that decoders may also be activated to decode
the main body of a message.  In this case, the |partBoundary|
is set to the null string and body content is decoded until the
start of the next message is encountered.

@<Class implementations@>=
    bool MIMEdecoder::getNextEncodedLine(void) {
    	if (!atEnd) {
	    if (getline(*is, inputLine) != NULL) {
	    	if (inputLine.substr(0, (sizeof messageSentinel) - 1) == messageSentinel) {
    	    	    endBoundary = inputLine;
		    if (partBoundary != "") {
    	    	    	assert(mf != NULL);
			mf->reportParserDiagnostic("Unterminated MIME sentinel at end of message.");
			mf->setNewMessageEligiblity();
    	    	    }
		    atEnd = true;
		}
	    	if ((partBoundary != "") && (inputLine.substr(0, 2) == "--") &&
		    (inputLine.substr(2, partBoundary.length()) == partBoundary)) {
    	    	    if (Annotate('d')) {
    	    		ostringstream os;

    	    		os << "Part boundary encountered: " << inputLine;
			mf->reportParserDiagnostic(os);
		    }
    	    	    endBoundary = inputLine;
		    atEnd = true;
		} else {
	    	    if (tlist != NULL) {
			tlist->push_back(inputLine);
		    }
		    ip = 0;
		    encodedLineCount++;
		}
	    } else {
	    	atEnd = true;
		eofHit = true;
	    }
	}
	if (atEnd) {
	    inputLine = "";
	    ip = 0;
	}
	return !atEnd;
    }
    
@
We provide a default implementation of |getDecodedLine|
for derived classes.  This forms lines from calls on
|getDecodedChar|, accepting (and discarding) end of
line sequences.

@<Class implementations@>=
    bool MIMEdecoder::getDecodedLine(string &s) {
    	int ch;
	
    	s = "";
	while (true) {
	    if (lookAhead) {
	    	ch = lookChar;
		lookAhead = false;
	    } else {
	    	ch = getDecodedChar();
	    }
	    if (ch < 0) {
	    	break;
	    }
	    @<Check for and process end of line sequence@>;
	    s += ch;
	}
	return s.length() > 0;
    }

@
In order to support all plausible end of line sequences, we
need to look ahead one character at end of line; if the caller
intends to intermix calls on |getDecodedLine| and
|getDecodedChar| (a pretty dopey thing to do, it must be said),
the |getDecodedChar| implementation in the derived class must
be aware that look ahead may have happened and properly
interact with the |lookAhead| flag.

@<Check for and process end of line sequence@>=
    if (ch == '\r' || ch == '\n') {
	int cht = getDecodedChar();

	if (!(((ch == '\r') && (cht == '\n')) ||
	     ((ch == '\n') && (cht == '\r')))) {
	    lookAhead = true;
	    lookChar = cht;
	}
	return true;
    }

@
We may want to export a decoded part to a file or, perhaps, save
it as a string stream for further examination.  This method
writes decoded bytes to its |ostream| argument.

@<Class implementations@>=
    void MIMEdecoder::saveDecodedStream(ostream &os) {
	int ch;
	
	while ((ch = getDecodedChar()) >= 0) {
	    os.put(ch);
	}
    }

@
We also provide a flavour of |saveDecodedStream| which
exports the decoded stream to a named file.

@<Class implementations@>=
    void MIMEdecoder::saveDecodedStream(const string fname) {
    	ofstream of(fname.c_str());
	
	if (!of) {
	    if (verbose) {
	    	cerr << "Cannot create MIMEdecoder dump file: " << fname << endl;
	    }
	} else {
	    saveDecodedStream(of);
	    of.close();
	}
    }

@*1 Identity MIME decoder.

The |identityMIMEdecoder| is a trivial MIME decoder which simply
passes through text in the part without transformation.  It is
provided as a test case and template for genuinely useful
decoders.  It may also come in handy should the need arise for
the interposition of an obligatory decoder even for MIME parts
which can be read directly as text.

@<Class definitions@>=
class identityMIMEdecoder : public MIMEdecoder {
public:@/
    string name(void) const {
    	return "Identity";
    }

    int getDecodedChar(void) {
    	while (!atEnd) {
    	    if (ip < inputLine.length()) {
		return inputLine[ip++] & 0xFF;
	    }
	    if (getNextEncodedLine()) {
		continue;
	    }
	}
	return -1;
    }
    
    bool getDecodedLine(string &s) {
    	if (ip < inputLine.length()) {
	    s = inputLine.substr(ip);
	    ip = inputLine.length();
	    return true;
	}
	if (getNextEncodedLine()) {
	    s = inputLine;
	    ip = inputLine.length();
	    return true;
    	}
	return false;
    }
};

@*1 Sink MIME decoder.

The |sinkMIMEdecoder| simply discards lines from the MIME part
the first time |getDecodedChar| or |getDecodedLine| is
called.  It is used for skipping parts in which we aren't
interested.

@<Class definitions@>=
class sinkMIMEdecoder : public MIMEdecoder {
public:@/
    string name(void) const {
    	return "Sink";
    }
    
    int getDecodedChar(void) {
    	if (!atEnd) {
	    while (getNextEncodedLine()) ;
	    assert(atEnd);
	}
	return -1;
    }
};

@*1 Base64 MIME decoder.

The base64MIMEdecoder decodes an input stream encoded as
MIME \.{base64} per RFC~1341.  This is based on my
stand-alone 
\pdfURL{\.{base64} decoder}{http://www.fourmilab.ch/webtools/base64/}.

@<Class definitions@>=
class base64MIMEdecoder : public MIMEdecoder {
private:@/
    unsigned char dtable[256];	    	// Decoding table
    void initialiseDecodingTable(void);	// Initialise decoding table
    deque<unsigned char> decodedBytes;	// Decoded bytes queue
    
public:@/
    base64MIMEdecoder() {
    	initialiseDecodingTable();
    }
    
    string name(void) const {
    	return "Base64";
    }
    
    int getDecodedChar(void);
    
    static string decodeEscapedText(const string s, mailFolder *m = NULL);

};

@
The |getDecodedChar| returns decoded characters from the
|decodedBytes| queue, refilling it with triples of
bytes decoded from the input stream as required.  When
the end of the stream is encountered, $-1$ is returned.

@<Class implementations@>=
    int base64MIMEdecoder::getDecodedChar(void) {
    	@<Check for look ahead character@>;
    	if (decodedBytes.size() == 0) {
	    @<Refill decoded bytes queue from input stream@>;
	}
    	if (decodedBytes.size() > 0) {
	    unsigned char v = decodedBytes[0];
	    
	    decodedBytes.pop_front();
	    return v;
	}
	return -1;
    }

@
This is the heart of the \.{base64} decoder.  It reads the next
four significant (non-white space) characters from the input
stream, extracts the 6 bits encoded by each, and assembles
the bits into three 8 bit bytes which are added to the
|decodedBytes| queue.  Although the current decoder always
immediately empties the queue, in principal any sequence of
the encoded content up to its entire length may be decoded
by repeated invocations of this code.

@<Refill decoded bytes queue from input stream@>=
    unsigned char a[4], b[4], o[3];
    int j, k;
    
    @<Decode next four characters from input stream@>;
    @<Assemble the decoded bits into bytes and place on decoded queue@>;

@
Read the next four non-blank bytes from the input stream,
checking for end of file, and place their decoded 6 bit
values into the array |b|.  We save the original encoded
characters in array |a| to permit testing them for the
special ``\.{=}'' sentinel which denotes short sequences
at the end of file.

@<Decode next four characters from input stream@>=
    for (int i = 0; i < 4; i++) {
    	int c;
	
	@<Get next significant character from input stream@>;
	@<Check for end of file in base64 stream@>;
        if (dtable[c] & 0x80) {
	    nDecodeErrors++;
	    ostringstream os;
    	    os << "Illegal character '" << c << "' in Base64 input stream.";
            mf->reportParserDiagnostic(os.str());
	    
            /* Ignoring errors: discard invalid character. */
            i--;
            continue;
        }
        a[i] = (unsigned char) c;
        b[i] = dtable[c];
    }

@
Read the encoded input stream and return the next non-white
space character.  This code does not verify whether characters
it returns are valid within a \.{base64} stream---that's up
to the caller to determine once the character is returned.

@<Get next significant character from input stream@>=
    while (true) {
   	c = -1;
	while (ip < inputLine.length()) {
    	    if (inputLine[ip] > ' ') {
		c = inputLine[ip++];
		break;
	    }
	    ip++;
	}
	if (c >= 0) {
	    break;
	}
	if (!getNextEncodedLine()) {
	    break;
	}
    }

@
An end of file indication (due to encountering the MIME
part separator sentinel) is valid only after an even number
of four character encoded sequences.  Validate this and report
any errors accordingly.  If an unexpected end of file is
encountered, any incomplete encoded sequence is discarded.

@<Check for end of file in base64 stream@>=
    if (c == EOF) {
	if (i > 0) {
	    nDecodeErrors++;
	    mf->reportParserDiagnostic("Unexpected end of file in Base64 decoding.");
	}
	return -1;
    }

@
Once we've decoded four characters from the input stream, we
have four six-bit fields in the |b| array.  Now we extract,
shift, and $\lor$ these fields together to form three
8 bit bytes.  One subtlety arises at the end of file.
The last one or two characters of an encoded four character
field may be replaced by equal signs to indicate that the
final field encodes only one or two source bytes.  If this
is the case, the number of bytes placed onto the
|decodedBytes| queue is reduced to the correct value.

@<Assemble the decoded bits into bytes and place on decoded queue@>=
    o[0] = (b[0] << 2) | (b[1] >> 4);
    o[1] = (b[1] << 4) | (b[2] >> 2);
    o[2] = (b[2] << 6) | b[3];
    j = a[2] == '=' ? 1 : (a[3] == '=' ? 2 : 3);
    
    for (k = 0; k < j; k++) {
    	decodedBytes.push_back(o[k]);
    }

@
Since we rely on the parent class default implementation of
|getNextEncodedLine|, if we wish to permit intermixed calls
on |getNextEncodedLine| and |getNextEncodedChar| we must cope
with the fact that the last |getNextEncodedLine| call may
have peeked ahead one character.  If so, clear the look
ahead flag and return the look ahead character.
    
@<Check for look ahead character@>=
    if (lookAhead) {
	lookAhead = false;
	return lookChar;
    }

@
The |initialiseDecodingTable| method fills the binary encoding
table with the characters the 6 bit values are mapped into. 
The curious and disparate sequences used to fill this table
permit this code to work both on ASCII and EBCDIC systems.

In EBCDIC systems character codes for letters are not
consecutive; the initialisation must be split to accommodate
the EBCDIC consecutive letters:

\centerline{\.{A}--\.{I} \.{J}--\.{R} \.{S}--\.{Z} \.{a}--\.{i} \.{j}--\.{r} \.{s}--\.{z}}

This code works on ASCII as well as EBCDIC systems.

@<Class implementations@>=
void base64MIMEdecoder::initialiseDecodingTable(void)
{
    int i;

    for (i = 0; i < 255; i++) {
        dtable[i] = 0x80;
    }
    for (i = 'A'; i <= 'I'; i++) {
        dtable[i] = 0 + (i - 'A');
    }
    for (i = 'J'; i <= 'R'; i++) {
        dtable[i] = 9 + (i - 'J');
    }                             
    for (i = 'S'; i <= 'Z'; i++) {
        dtable[i] = 18 + (i - 'S');
    }                             
    for (i = 'a'; i <= 'i'; i++) {
        dtable[i] = 26 + (i - 'a');
    }
    for (i = 'j'; i <= 'r'; i++) {
        dtable[i] = 35 + (i - 'j');
    }
    for (i = 's'; i <= 'z'; i++) {
        dtable[i] = 44 + (i - 's');
    }
    for (i = '0'; i <= '9'; i++) {
        dtable[i] = 52 + (i - '0');
    }
#define CI(x)	static_cast<int>(x)
    dtable[CI('+')] = 62;
    dtable[CI('/')] = 63;
    dtable[CI('=')] = 0;
#undef CI
}
    
@
The |static| method |decodeEscapedText| decodes text in its
|string| argument, returning a string with escape sequences
replaced by the encoded characters.  Note that, notwithstanding this being a
|static| method which can be invoked without reference to a
|base64MIMEdecoder| object, we in fact actually instantiate
such an object within the method, supplying its input from
an |istringstream| constructed from the argument |string|.

@<Class implementations@>=
    string base64MIMEdecoder::decodeEscapedText(const string s, mailFolder *m) {
    	string r = "";
	base64MIMEdecoder dc;
	istringstream iss(s);
	int dchar;
	
	dc.set(&iss, m, "");
	
	while ((dchar = dc.getDecodedChar()) >= 0) {
	    r += static_cast<char>(dchar);
	}
	
	return r;
    }

@*1 Quoted-Printable MIME decoder.

The quotedPrintableMIMEdecoder decodes an input stream encoded as
MIME ``Quoted-Printable'' per RFC~1521.  This is based on my
stand-alone 
\pdfURL{Quoted-Printable decoder}{http://www.fourmilab.ch/webtools/qprint/}.

@<Class definitions@>=
class quotedPrintableMIMEdecoder : public MIMEdecoder {
public:@/
    quotedPrintableMIMEdecoder() {
    	atEndOfLine = false;
    }
    
    string name(void) const {
    	return "Quoted-Printable";
    }
    
    int getDecodedChar(void);
    
    static string decodeEscapedText(const string s, mailFolder *m = NULL);
    
protected:@/
    bool atEndOfLine;
    int getNextChar(void);
    static int hex_to_nybble(const int ch);
};

@
Get the next decoded character from the stream, expanding
``\.{=}'' escape sequences.

@<Class implementations@>=
    int quotedPrintableMIMEdecoder::getDecodedChar(void) {
    	int ch;
	
	@<Check for look ahead character@>;
	
	while (true) {
	    ch = getNextChar();
	    if (ch == '=') {
		@<Decode equal sign escape@>;
	    } else {
		return ch;
	    }
	}
    }

@
When we encounter an equal sign in the input stream there are
two possibilities: it may introduce two characters of ASCII
representing an 8-bit octet in hexadecimal or, if followed by
an end of line sequence, it's a ``soft end-of-line'' introduced
to avoid emitting a line longer than the maximum number of characters
prescribed by the RFC.

@<Decode equal sign escape@>=
    int ch1 = getNextChar();
    @<Ignore white space after soft line break@>;
    if (ch1 == '\n') {
        continue;
    } else {
        int n1 = hex_to_nybble(ch1);
        int ch2 = getNextChar();
        int n2 = hex_to_nybble(ch2);
        if (n1 == -1 || n2 == -1) {
	    ostringstream os;

	    os << "Invalid escape sequence '=" <<
		    	static_cast<char>(ch1) << static_cast<char>(ch2) <<
			"' in Quoted-Printable MIME part.";
	    mf->reportParserDiagnostic(os.str());
	    nDecodeErrors++;
        }
        ch = (n1 << 4) | n2;
    }
    return ch;

@
Return the next character from the encoded input stream.  Since
end of line sequences have been stripped, we append our own
new-line character to the end of each line.  This indicates
that in the absence of a soft line break (trailing equal sign),
we should emit a line break to the output stream.

@<Class implementations@>=
    int quotedPrintableMIMEdecoder::getNextChar(void) {
    	while (true) {
	    if (atEndOfLine) {
		atEndOfLine = false;
	    	return '\n';
	    }
    	    if (ip < inputLine.length()) {
	    	if (ip == (inputLine.length() - 1)) {
		    atEndOfLine = true;
		}
		return inputLine[ip++];
	    }
	    if (!getNextEncodedLine()) {
	    	break;
	    }
	    if (inputLine.length() == 0) {
	    	atEndOfLine = true;
	    }
	}
	return -1;
    }

@
There are lots of ways of defining ``ASCII white space,''
but RFC~1521 explicitly states that only ASCII space
and horizontal tab characters are deemed white space
for the purposes of Quoted-Printable encoding.  However,
we must also cope with POP3 messages where the lines are
terminated with CR/LF, so we extend the definition to allow
a carriage return before the line feed.  This is easily
accomplished by broadening the definition of white space
to include carriage return.

@<Character is white space@>=
    ((ch1 == ' ') || (ch1 == '\t') || (ch1 == '\r'))

@
Some systems pad text lines with white space (ASCII blank
or horizontal tab characters).  This may result in a line
encoded with a ``soft line break'' at the end appearing, when
decoded, with white space between the supposedly-trailing
equal sign and the end of line sequence.  If white space
follows an equal sign escape, we ignore it up to the
beginning of an end of line sequence.  Non-white space
appearing before we sense the end of line is an error;
these erroneous characters are ignored.

@<Ignore white space after soft line break@>=
    while (@<Character is white space@>) {
        ch1 = getNextChar();
        if (ch1 == '\n') {
            continue;
        }
        if (!@<Character is white space@>) {
	    nDecodeErrors++;
	    ostringstream os;

	    os << "Invalid character '" << static_cast<char>(ch1) <<
		    	"' after soft line break in Quoted-Printable MIME part.";
	    mf->reportParserDiagnostic(os.str());
            ch1 = ' ';	    	    // Fake a space and soldier on
        }
    }    
    
@
The |hex_to_nybble| method converts a hexadecimal digit
in the sequence ``\.{0123456789ABCDEF}'' or the equivalent
with lower case letters to its binary value.  If an invalid
hexadecimal digit is supplied, $-1$ is returned.

@<Class implementations@>=
    int quotedPrintableMIMEdecoder::hex_to_nybble(const int ch) {
	if ((ch >= '0') && (ch <= ('0' + 9))) {
            return ch - '0';
	} else if ((ch >= 'A') && (ch <= ('A' + 5))) {
            return 10 + (ch - 'A');
	} else if ((ch >= 'a') && (ch <= ('a' + 5))) {
            return 10 + (ch - 'a');
	}
	return -1;
    }
    
@
The |static| method |decodeEscapedText| decodes text in its
|string| argument, returning a string with escape sequences
replaced by the encoded characters.

@<Class implementations@>=
    string quotedPrintableMIMEdecoder::decodeEscapedText(const string s, mailFolder *m) {
    	string r = "";
	string::size_type p;
	
	for (p = 0; p < s.length(); p++) {
	    bool decoded = false;
	    
	    if (s[p] == '=') {
	    	if (p > (s.length() - 3)) {
		    if (verbose) {
		    	cerr << "decodeEscapedText: escape too near end of string: " << s << endl;
    	    	    }
		} else {
		    int n1 = hex_to_nybble(s[p + 1]),
		    	n2 = hex_to_nybble(s[p + 2]);
		    if ((n1 < 0) || (n2 < 0)) {
		    	if (verbose) {
			    cerr << "decodeEscapedText: invalid escape sequence \"" <<
			    	    s.substr(p, 3) << "\"" << endl;
			}
		    } else {
		    	r += static_cast<char>((n1 << 4) | n2);
			decoded = true;
			p += 2;
		    }
		}
	    }
	    if (!decoded) {
	    	r += s[p];
	    }
	}
	return r;
    }
        
@** Multiple byte character set decoders and interpreters.

To support languages with character sets too large to be encoded in
a single byte, a bewildering variety of {\it multiple byte character sets}
are employed.  In a rational world, there would be a single, universal, and
uniform encoding of every glyph used in human written encoding, and a
unique way of representing this in byte-oriented messages.

Rather amazingly, there {\it is} such a representation:
ISO/IEC~10646 and its UTF-8 encoding.  Not surprisingly, hardly
anybody uses it---it's an international standard, after all. 
So, we must cope with a plethora of character sets and byte
encodings, than that's the lot in life of the |MBCSdecoder| and
|MBCSinterpreter|. These abstract classes are the parent of
specific decoders for various encodings and interpreters for
the motley crowd of character sets.

First, let's define our terms.  A {\it decoder} is charged with
chewing through a byte stream and identifying the logical characters
within it, in all their various lengths.  Decoders must cope with
encoding such as EUC, shift-JIS, and UTF-8.  An {\it interpreter}'s
responsibility is expressing the character codes delivered by the
decoder in a form comprehensible to those not endowed with the
original language character set or knowledge of how to read it.
This usually means encoding ideographic languages where each
character more or less corresponds to a word as space-separated
tokens uniquely identifying the character code (by its hexadecimal
code, for example), and characters in word-oriented languages as
unique strings which meet the downstream rules for tokens.
For example, one might express a sequence of Chinese characters
in the ``\.{Big5}'' character set as:

\centerline{\.{big5-A2FE big5-E094 big5-F3CA}}

\noindent
or a two words in a Cyrillic font as:

\centerline{\.{cyr-A0cyr-98cyr-81cyr-FE cyr-84cyr-D3cyr-EAcyr-A7}}

\noindent
(These examples were just made up off the cuff---if they represent
something heroically obscene in some representation of a
language, it's just my lucky day.)

Note that because of what we're doing here, we don't have to
remotely comprehend the character set or read the language
to be highly effective in accomplishing our mission.  Like
cryptographers who broke book codes without knowing the
language of the plaintext, we're concerned only with the
frequency with which various tokens, however defined, occur
in legitimate and junk mail.  As long as our representations
are unique and more or less correspond to tokens in the
underlying language, we don't need to understand what it
{\it means}.

@*1 Decoders.

@*2 Decoder parent class.

This is the abstract parent class of all specific decoders.
Albeit abstract in the details, we provide a variety of services
to derived classes.

@<Class definitions@>=
class MBCSdecoder {
protected:@/
    const string *src;
    string::size_type p;
    mailFolder *mf;
    
public:@/
    MBCSdecoder(mailFolder *m = NULL) : src(NULL), p(0), mf(NULL) {
    }
    
    virtual ~MBCSdecoder() {
    }

    virtual string name(void) = 0;  	// Name of decoder

    virtual void setSource(const string &s) {	// Set input source line
    	src = &s;
	p = 0;
    }
    
    virtual void setMailFolder(mailFolder *m = NULL) {
    	mf = m;
    }
    
    virtual void reset(void) {	    	// Reset stateful decoder to ground state
    }
    
    virtual int getNextDecodedChar(void) = 0;	// Get next decoded character
    
    virtual int getNextEncodedByte(void) {
    	if (p >= src->length()) {
	    return -1;
	}
	return ((*src)[p++]) & 0xFF;
    }
    
protected:@/
    virtual int getNextNBytes(const unsigned int n);
    
    virtual int getNext2Bytes(void) {
    	return getNextNBytes(2);
    }
    
    virtual int getNext3Bytes(void) {
    	return getNextNBytes(3);
    }
    
    virtual int getNext4Bytes(void) {
    	return getNextNBytes(4);
    }
    
    virtual void discardLine(void) {
    	p = src->length();
    }
    
    virtual void reportDecoderDiagnostic(const string s) const;
    virtual void reportDecoderDiagnostic(const ostringstream &os) const;
};

@
Return a character assembled by concatenating the next
|n| bytes in most significant byte to least significant
byte order.  If the end of input is encountered, $-1$ is
returned.  A multiple byte character equal to $-1$ triggers
an assertion failure in debug builds.

@<Class implementations@>=
    int MBCSdecoder::getNextNBytes(const unsigned int n) {
    	assert((n >= 1) && (n <= 4));
	int v = 0;
    	for (unsigned int i = 0; i < n; i++) {
	    int b = getNextEncodedByte();
	    if (b < 0) {
	    	return b;
	    }
	    v = (v << 8) | b;
	}
	assert(v != -1);
	return v;
    }

@
If the decoder encounters an error, we usually report it as a
parser diagnostic to the parent mail folder.  If there is no
such folder (since a decoder can be invoked stand-alone), we
report the diagnostic to standard error if the \.{--verbose}
option is specified.

@<Class implementations@>=
    void MBCSdecoder::reportDecoderDiagnostic(const string s) const {
    	if (mf != NULL) {
	    mf->reportParserDiagnostic(s);
    	} else {
	    if (verbose) {
	    	cerr << s << endl;
	    }
	}
    }
    
    void MBCSdecoder::reportDecoderDiagnostic(const ostringstream &os) const {
    	reportDecoderDiagnostic(os.str());
    }


@*2 EUC decoder.

This decoder extracts logical characters from byte streams encoded in
\.{EUC} encoding.  In \.{EUC}, if a byte in the input stream is in
the range |0xA1|--|0xFE| and the subsequent byte in the range
|0x80|--|0xFF|, then the variant fields encoded in the two
bytes define the character code.  A byte not within the range
of the first byte of a two byte character is interpreted as a single
byte character with ASCII/ISO-8859 semantics.

@<Class definitions@>=
class EUC_MBCSdecoder : public MBCSdecoder {
public:@/
    virtual string name(void) {
    	return "EUC";
    }
    
    virtual int getNextDecodedChar(void);	// Get next decoded byte
};

@
Bytes are parsed from the input stream as follows.  Any bytes with values
within the range |0xA1|--|0xFE| denote the first byte of a two byte
character, whose second byte must be within the range |0x80|--|0xFF|.
Any violation of the constraints on the second byte indicates an invalid
sequence.  Characters outside the range of initial characters are
considered single byte codes.  We return $-1$ when the end of the encoded
line is encountered.

@<Class implementations@>=
    int EUC_MBCSdecoder::getNextDecodedChar(void) {
    	int c1 = getNextEncodedByte();
	
	if ((c1 >= 0xA1) && (c1 <= 0xFE)) {
	    int c2 = getNextEncodedByte();
	    
	    if ((c2 >= 0x80) && (c2 <= 0xFF)) {
	    	return (c1 << 8) | c2;
	    }
	    if (c2 == -1) {
		ostringstream os;

	    	os << name() << "_MBCSdecoder: Premature end of line in two byte character.";
		reportDecoderDiagnostic(os);
		return -1;
	    }
	    
	    /* Odds are that once we've encountered an invalid second byte,
	       the balance of the encoded line will be screwed up as well.
	       To avoid such blithering, discard the line after such an
	       error. */
	       
	    discardLine();
	    ostringstream os;

	    os << name() << "_MBCSdecoder: Invalid second byte in two byte character: "
		"0x" << setiosflags(ios::uppercase) << hex << c1 << " " << "0x" << c2 << ".";
	    reportDecoderDiagnostic(os);
	    return c1;
	}
	return c1;
    }

@*2 Big5 decoder.

This decoder extracts logical characters from byte streams encoded in
\.{Big5} encoding.  In \.{Big5}, bytes in the range |0x00|--|0x7F|
are single ASCII characters.  Bytes with the |0x80| bit set are
the first byte of a two byte character, the second byte of which
may have any value.

@<Class definitions@>=
class Big5_MBCSdecoder : public MBCSdecoder {
public:@/
    virtual string name(void) {
    	return "Big5";
    }
    
    virtual int getNextDecodedChar(void);	// Get next decoded byte
};

@
Decode the next logical character. We return $-1$ when the end
of the encoded line is encountered.

@<Class implementations@>=
    int Big5_MBCSdecoder::getNextDecodedChar(void) {
    	int c1 = getNextEncodedByte();
	
	if ((c1 >= 0) && ((c1 & 0x80) != 0)) {
	    int c2 = getNextEncodedByte();
	    
	    if (c2 == -1) {
		ostringstream os;

		os << name() << "_MBCSdecoder: Premature end of line in two byte character.";
		reportDecoderDiagnostic(os);
		return -1;
	    }
    	    return (c1 << 8) | c2;
	}
	return c1;
    }
    
@*2 Shift-JIS decoder.

Shift-JIS is used to encode Japanese characters on MS-DOS, Windows,
and the Macintosh (which adds four additional one-byte characters
which we support here).  The encoding uses code points |0x21|--|0x7E|
for ASCII/JIS-Roman single byte characters, code points
|0xA1|--|0xDF| for single byte hald width katakana,
plus two-byte characters
introduced by first bytes in the ranges |0x81|--|0x9F|,
|0xE0|--|0xEF|, and, for user-defined characters, |0xF0|--|0xFC|.
The second byte of a valid two-byte character will always be in
one of the ranges |0x40|--|0x7E| and |0x80|--|0xFC|.  

@<Class definitions@>=
class Shift_JIS_MBCSdecoder : public MBCSdecoder {
protected:@/
    string pending;
    
public:@/
    Shift_JIS_MBCSdecoder() : pending("") {
    }
    
    virtual ~Shift_JIS_MBCSdecoder() {
    }

    virtual string name(void) {
    	return "Shift_JIS";
    }
    
    virtual int getNextDecodedChar(void);	// Get next decoded byte
};

@
Decode the next logical character. We return $-1$ when the end
of the encoded line is encountered.  An invalid second byte of a
two byte character terminates processing of the line, as it's likely
to be gibberish from then on.

@<Class implementations@>=
    int Shift_JIS_MBCSdecoder::getNextDecodedChar(void) {
    	@<Check for pending characters and return if so@>;
	
    	int c1 = getNextEncodedByte();
	
	if (c1 >= 0) {
	    @<Check for Shift-JIS two byte character and assemble as required@>;
	    @<Check for Macintosh-specific single byte characters and translate@>;
    	}
	return c1;
    }	
    
@
We test for the first byte we've read being in the range which
denotes a two byte character.  If so, read the second byte of
the character, validating that it is within the ranges permitted
for second bytes, and assemble the 16 bit character from the two
bytes.

@<Check for Shift-JIS two byte character and assemble as required@>=
    if (((c1 >= 0x81) && (c1 <= 0x9F)) ||
	((c1 >= 0xE0) && (c1 <= 0xEF)) ||
	((c1 >= 0xF0) && (c1 <= 0xFC))) {
	int c2 = getNextEncodedByte();

	if (c2 == -1) {
	    ostringstream os;

	    os << name() << "_MBCSdecoder: Premature end of line in two byte character.";
	    reportDecoderDiagnostic(os);
	    return -1;
	}
	if (!(((c2 >= 0x40) && (c2 <= 0x7E)) ||
	      ((c2 >= 0x80) && (c2 <= 0xFC)))) {
	    ostringstream os;

	    os << name() << "_MBCSdecoder: Invalid second byte in two byte character: "
		"0x" << setiosflags(ios::uppercase) << hex << c1 << " " << "0x" << c2 << ".";
	    reportDecoderDiagnostic(os);
	    return -1;
    	}
    	return (c1 << 8) | c2;
    }

@
To permit expansion of Macintosh-specific characters to multiple
character replacements, we have the ability to store the balance
of a multiple character sequence in the |pending| string.  If there
are any characters there, return them before obtaining another
character from the input stream.

@<Check for pending characters and return if so@>=
    if (!pending.empty()) {
    	int pc = pending[0];
	pending = pending.substr(1);
	return pc;
    }    
@
The
four additional characters added by the Macintosh are
|0x80| (backslash),
|0xFD| (copyright symbol),
|0xFE| (trademark symbol),
and |0xFF| (ellipsis).
We check for them and translate them into plausible
ISO 8859 replacements, expanding as necessary into
multiple character sequences via the |pending| string
mechanism.

@<Check for Macintosh-specific single byte characters and translate@>=
    switch (c1) {
    	case 0x80:@/
	    c1 = '\\';	    	// Macintosh backslash
	    break;
	    
	case 0xFD:@/
	    c1 = 0xA9;	    	// ISO 8859 \copyright\ symbol
	    break;
	    
	case 0xFE:
	    c1 = 'T';	    	// Trademark ($^{\rm TM}$) symbol
	    pending = "M";
	    break;
	    
	case 0xFF:  	    	// Ellipsis (``$\ldots$'')
	    c1 = '.';
	    pending = "..";
	    break;
    }
    
@*2 Unicode decoders.

The \pdfURL{Unicode}{http://www.unicode.org/} character set (itself
a subset of the 32 bit ISO~10646 character set), uses a variety of
encoding schemes.  The |Unicode_MBCSdecoder| is the parent class
for all specific Unicode decoders and provides common services
for them.

@<Class definitions@>=
class Unicode_MBCSdecoder : public MBCSdecoder {
public:@/
    virtual string name(void) {
    	return "Unicode";
    }
    
    virtual int getNextDecodedChar(void) = 0;	// Get next decoded byte
};

@*3 UCS-2 Unicode decoder.

UCS-2 encoding of Unicode is simply a sequence of 16 bit quantities,
which may be stored in either little-endian or big-endian order; usually
identified by a Unicode Byte Order Mark at the start of the file.  Here
we do not attempt to auto-sense byte order; it must be set by the
setBigEndian method before the decoder is used.

@<Class definitions@>=
class UCS_2_Unicode_MBCSdecoder : public Unicode_MBCSdecoder {
protected:@/
    bool bigEndian;
    
public:@/
    UCS_2_Unicode_MBCSdecoder(bool isBigEndian = true) {
    	setBigEndian(isBigEndian);
    }
    
    void setBigEndian(bool isBigEndian = true) {
    	bigEndian = isBigEndian;
    }
    
    virtual string name(void) {
    	return "UCS_2_Unicode";
    }
    
    virtual int getNextDecodedChar(void);	// Get next decoded byte
};

@
Decode the next logical character. We return $-1$ when the end
of the encoded line is encountered.

@<Class implementations@>=
    int UCS_2_Unicode_MBCSdecoder::getNextDecodedChar(void) {
    	int c1 = getNextEncodedByte();
	int c2 = getNextEncodedByte();

	if (c2 == -1) {
	    ostringstream os;

	    os << name() << "_MBCSdecoder: Premature end of line in two byte character.";
	    reportDecoderDiagnostic(os);
	    return -1;
	}
	if (bigEndian) {
	    c1 = (c1 << 8) | c2;
    	} else {
	    c1 |= (c2 << 8);
	}
	return c1;
    }	

@*3 UTF-8 Unicode decoder.

The UTF-8 encoding of Unicode is an ASCII-transparent encoding into
a stream of 8 bit bytes.  The length of encoded character is
variable and forward-parseable.

@<Class definitions@>=
class UTF_8_Unicode_MBCSdecoder : public Unicode_MBCSdecoder {
public:@/    
    virtual string name(void) {
    	return "UTF_8_Unicode";
    }
    
    virtual int getNextDecodedChar(void);	// Get next decoded byte
};

@
Decode the next logical character. We return $-1$ when the end
of the encoded line is encountered.

@<Class implementations@>=
    int UTF_8_Unicode_MBCSdecoder::getNextDecodedChar(void) {
    	int c1 = getNextEncodedByte();
	
	if (c1 < 0) {
	    return c1;	    // End of input stream
	}
        string::size_type nbytes = 0;
        unsigned int result;
        
        if (c1 <= 0x7F) {   // Fast track special case for ASCII 7 bit codes
            result = c1;
            nbytes = 1;
        } else {
            unsigned char chn = c1;
            
            @#@,
            /* N.b.  You can dramatically speed up the determination of how many
               bytes follow the first byte code by looking it up in a 256 byte
               table of lengths (with duplicate values as needed due to value
               bits in the low order positions.  Once the length is determined, you
               can use a table look-up to obtain the mask for the first byte
               rather than developing the mask with a shift.  The code which
               assembles the rest of the value could also be unrolled into
               individual cases to avoid loop overhead.  Of course none of this
               is worth the bother unless you're going to be doing this a lot. */
            while ((chn & 0x80) != 0) {
                nbytes++;
                chn <<= 1;
            }
            if (nbytes > 6) {
		ostringstream os;

		os << name() << "_MBCSdecoder: Invalid first byte " <<
		"0x" << setiosflags(ios::uppercase) << hex << c1 << " in UTF-8 encoded string";
		reportDecoderDiagnostic(os);
		return -1;
            }
            result = c1 & (0xFF >> (nbytes + 1));   // Extract bits from first byte
            for (string::size_type i = 1; i < nbytes; i++) {
                c1 = getNextEncodedByte();
		if (c1 < 0) {
		    ostringstream os;

	    	    os << name() << "_MBCSdecoder: Premature end of line in UTF-8 character.";
		    reportDecoderDiagnostic(os);
		    return -1;
		}
                if ((c1 & 0xC0) != 0x80) {
		    ostringstream os;

	    	    os << name() << "_MBCSdecoder: Bad byte 1--n signature in UTF-8 encoded sequence.";
		    reportDecoderDiagnostic(os);
                }
                result = (result << 6) | (c1 & 0x3F);
            }
        }
        return result;
    }	

@*3 UTF-16 Unicode decoder.

The UTF-16 encoding of Unicode encodes logical characters as sequence
of 16 bit codes.  Most Unicode characters are encoded in a single
16 bit quantity, but character codes greater than 65535 are
encoded in a pair of 16 bit values in the {\it surrogate} range.
Naturally, this encoding can be either big- or little-endian in
byte sequence; we handle either, as set by the |setBigEndian|
method or the constructor.

@<Class definitions@>=
class UTF_16_Unicode_MBCSdecoder : public Unicode_MBCSdecoder {
protected:@/
    bool bigEndian;
    
    int getNextUTF_16Word(void) {
    	int c1 = getNextEncodedByte();
	if (c1 < 0) {
	    return c1;
	}
	int c2 = getNextEncodedByte();
	if (c2 < 0) {
	    ostringstream os;

	    os << name() << "_MBCSdecoder: Premature end of line in UTF-16 character.";
	    reportDecoderDiagnostic(os);
	    return -1;
	}
	if (bigEndian) {
	    c1 = (c1 << 8) | c2;
    	} else {
	    c1 |= (c2 << 8);
	}
	return c1;
    }
    
public:@/
    UTF_16_Unicode_MBCSdecoder(bool isBigEndian = true) {
    	setBigEndian(isBigEndian);
    }
    
    void setBigEndian(bool isBigEndian = true) {
    	bigEndian = isBigEndian;
    }

    virtual string name(void) {
    	return "UTF_16_Unicode";
    }
    
    virtual int getNextDecodedChar(void);	// Get next decoded byte
};

@
Decode the next logical character. We return $-1$ when the end
of the encoded line is encountered.

@<Class implementations@>=
    int UTF_16_Unicode_MBCSdecoder::getNextDecodedChar(void) {
        string::size_type nwydes = 0;
        int w1, w2, result;
        
	w1 = getNextUTF_16Word();
	if (w1 < 0) {
	    return w1;
	}
	
        if ((w1 <= 0xD800) || (w1 > 0xDFFF)) {
            result = w1;
            nwydes = 1;
        } else if ((w1 >= 0xD800) && (w1 <= 0xDBFF)) {
	    w2 = getNextUTF_16Word();
            if (w2 < 0) {
		ostringstream os;

		os << name() << "_MBCSdecoder: Premature end of line in UTF-16 two word character.";
		reportDecoderDiagnostic(os);
	    	return -1;
            }
            nwydes = 2;
            if ((w2 < 0xDC00) || (w2 > 0xDFFF)) {
		ostringstream os;

		os << name() << "_MBCSdecoder: Invalid second word surrogate " <<
		    "0x" << setiosflags(ios::uppercase) << hex << w2 << " in UTF-16 encoded string.";
		reportDecoderDiagnostic(os);
	    	return -1;
            }
            result = (((w1 & 0x3FF) << 10) | (w2 & 0x3FF)) + 0x10000;
        } else {
	    ostringstream os;

	    os << name() << "_MBCSdecoder: Invalid first word surrogate " <<
	    	"0x" << setiosflags(ios::uppercase) << hex << w1 << " in UTF-16 encoded string.";
	    reportDecoderDiagnostic(os);
	    return -1;
        }
        return result;
    }
    
@*1 Interpreters.

@*2 Interpreter parent class.

This is the abstract parent class of all concrete interpreters.  We
provide the services common to most decoders, while permitting them
to be overridden by derived classes.

@<Class definitions@>=
class MBCSinterpreter {
protected:@/
    const string *src;
    MBCSdecoder *dp;
    string prefix, suffix;

public:@/

    virtual ~MBCSinterpreter() {
    }

    virtual string name(void) = 0;  	// Name of decoder

    virtual void setDecoder(MBCSdecoder &d) {
    	dp = &d;
    }

    virtual void setSource(const string &s) {	// Set input source line
    	assert(dp != NULL);
    	dp->setSource(s);
    }
    
    virtual void setPrefixSuffix(string pre = "", string suf = "") {
    	prefix = pre;
	suffix = suf;
    }
    
    virtual string getNextDecodedChar(void);
    
    virtual string decodeLine(const string &s);
};

@
We provide this default implementation of |getNextDecodedChar|
for derived classes.  They're free to override it, but this
may do the job for most.  A logical character is obtained from
the decoder.  If its character code is less than or equal
to 256, it is taken as a single byte character and returned
directly.  Otherwise, a character name is concocted by
concatenating the character set |name| and the
hexadecimal character code, with the |prefix| and
|suffix| at either end.  Character sets in which each
ideograph is logically a word will typically use
a prefix and suffix of a single blank, while sets
in which characters behave like letters will use a void
prefix and suffix.

@<Class implementations@>=
    string MBCSinterpreter::getNextDecodedChar(void) {
    	assert(dp != NULL);
    	int dc = dp->getNextDecodedChar();
	if (dc < 0) {
	    return "";	    	    // End of input stream
	}
	if (dc < 256) {
	    string r(1, static_cast<char>(dc));
	    return r;
	}
	ostringstream os;
	os.setf(ios::uppercase);
	os << prefix << name() << "-" << hex << dc << dec << suffix;
	return os.str();
    }

@
The default implementation of |decodeLine| sets the source to
the argument string, then assembles a line by concatenating
the results of successive calls to |getNextDecodedChar|.

@<Class implementations@>=
    string MBCSinterpreter::decodeLine(const string &s) {
    	string r = "", t;
	
	setSource(s);
	while ((t = getNextDecodedChar()) != "") {
	    r += t;
	}
	return r;
    }

@*2 GB2312 Interpreter class.

This interpreter class parses \.{GB2312} ideographs into
tokens which downstream parsers can comprehend.

@<Class definitions@>=
class GB2312_MBCSinterpreter : public MBCSinterpreter {
public:@/
    GB2312_MBCSinterpreter() {
    	setPrefixSuffix(" ", " ");
    }

    virtual string name(void) {
    	return "GB2312";
    }
};

@*2 Big5 Interpreter class.

This interpreter class parses \.{Big5} ideographs into
tokens which downstream parsers can comprehend.

@<Class definitions@>=
class Big5_MBCSinterpreter : public MBCSinterpreter {
public:@/
    Big5_MBCSinterpreter() {
    	setPrefixSuffix(" ", " ");
    }

    virtual string name(void) {
    	return "Big5";
    }
};

@*2 Shift-JIS Interpreter class.

This interpreter class parses Shift-JIS ideographs into
tokens which downstream parsers can comprehend.

@<Class definitions@>=
class Shift_JIS_MBCSinterpreter : public MBCSinterpreter {
public:@/
    Shift_JIS_MBCSinterpreter() {
    	setPrefixSuffix(" ", " ");
    }

    virtual string name(void) {
    	return "Shift_JIS";
    }
    
    string getNextDecodedChar(void);
};

@
Our |getNextDecodedChar| implementation is a bit more complicated
than the default provided by the parent class.  In addition to
handling ASCII and two byte character codes, we also wish to
interpret Katakana single byte characters, which are emitted
without spaces between them.

@<Class implementations@>=
    string Shift_JIS_MBCSinterpreter::getNextDecodedChar(void) {
    	assert(dp != NULL);
    	int dc = dp->getNextDecodedChar();
	if (dc < 0) {
	    return "";	    	    // End of input stream
	}
	if (dc < 0xA1) {
	    string r(1, static_cast<char>(dc)); // ASCII character
	    return r;
	}
	ostringstream os;
	os.setf(ios::uppercase);
	if ((dc >= 0xA1) && (dc <= 0xDF)) {
	    os << "SJIS-K" << hex << dc << dec; // Katakana---don't space around characters
	} else {
	    os << prefix << "SJIS-" << hex << dc << dec << suffix;  // Kanji--space on both sides
	}
	return os.str();
    }

@*2 Korean Interpreter class.

This interpreter class parses Korean characters into
tokens which downstream parsers can comprehend.  This
type (usually expressed as a \.{charset} of \.{euc-kr})
is uncommon, but we handle it to illustrate an interpreter
for an alphabetic non-Western language.

@<Class definitions@>=
class KR_MBCSinterpreter : public MBCSinterpreter {
public:@/
    virtual string name(void) {
    	return "KR";
    }
};

@*2 Unicode Interpreter class.

This interpreter class parses Unicode characters into
a form which can be comprehended by the parser.

@<Class definitions@>=
class Unicode_MBCSinterpreter : public MBCSinterpreter {
public:@/
    Unicode_MBCSinterpreter() {
    	setPrefixSuffix(" ", " ");
    }

    virtual string name(void) {
    	return "Unicode";
    }
    
    string getNextDecodedChar(void);
};

@
Our |getNextDecodedChar| implementation attempts to represent
the Unicode characters in a fashion which will best enable the
parser to classify them.  Characters in the first 256 code
positions, which are identical to ISO-8859 are output as
ISO characters.  Other codes are represented as
``\.{UCS-}{\it nnnn}'' where {\it nnnn} is the Unicode
code value in hexadecimal.  Codes representing iedographs are
output separated by spaces while codes for alphanumeric
characters are not space-separated.

@<Class implementations@>=
    string Unicode_MBCSinterpreter::getNextDecodedChar(void) {
    	assert(dp != NULL);
    	int dc = dp->getNextDecodedChar();
	if (dc < 0) {
	    return "";	    	    // End of input stream
	}
	if (dc <= 0xFF) {
	    string r(1, static_cast<char>(dc)); // ASCII character
	    return r;
	}
	ostringstream os;
	os.setf(ios::uppercase);
	if (((dc >= 0x3200) && (dc < 0xD800)) ||
	    ((dc >= 0xF900) && (dc < 0xFAFF))) {
	    os << prefix << "UCS-" << hex << dc << dec << suffix;  // Ideographic--space on both sides
	} else {
	    os << "UCS-" << hex << dc << dec; // Alphabetic---don't space around characters
	}
	return os.str();
    }

@** Application string parsers.

An {\it application string parser} reads files in application-defined
formats (for example, word processor documents, spreadsheets,
page description languages, etc.) and returns strings included in
the file.  Unlike |tokenParser| in ``byte stream'' mode, there is nothing
heuristic in the operation of an application string parser---it must
understand the structure of the application data file in order to
identify and extract strings within it.

The |applicationStringParser| class is the virtual parent of all
specific application string parsers.  It provides common services
to derived classes and defines the external interface.  When initialising
an |applicationStringParser|, the caller must supply a pointer to
the |mailFolder| from which it will be invoked, through which the
folder's |nextByte| method will be called to return decoded binary bytes
of the application file.  It would be {\it much} cleaner if we could
simply supply an arbitrary function which returned the next byte of
the stream we're decoding, but that runs afoul of \CPP/'s rules for
taking the address of class members.  Consequently, we're forced
to make |applicationStringParser| co-operate with |mailFolder|
to obtain decoded bytes.

@<Class definitions@>=
class applicationStringParser {
@/
protected:@/
    bool error, eof;	    	    // Error and end of file indicators
    mailFolder *mf;
    
    virtual unsigned char get8(void);
    
    virtual void get8n(unsigned char *buf, const int n) { // Store next |n| bytes into |buf|
    	for (int i = 0; (!eof) && (i < n); i++) {
	    buf[i] = get8();
	}
    }
    
public:@/
    applicationStringParser(mailFolder *f = NULL) :
    	    error(false), eof(false), mf(NULL) {
    	setMailFolder(f);
    }
    
    virtual ~applicationStringParser() {
    }
    
    virtual string name(void) const = 0;
    
    void setMailFolder(mailFolder *f) {
    	mf = f;
    }
    
    virtual bool nextString(string &s) = 0;@;
    
    virtual void close(void) {
    	error = eof = false;
    }

    bool isError(void) const@; {
    	return error;
    }@;

    bool isEOF(void) const@; {
    	return eof;
    }@;
    
    bool isOK(void) const {
    	return (!isEOF()) && (!isError());
    }
};

@
@<Class implementations@>=
    unsigned char applicationStringParser::get8(void) {	// Get next byte, unsigned
    	assert(mf != NULL);
    	int ch = mf->nextByte();
	if (ch == EOF) {
	    eof = true;
	}
	return ch & 0xFF;
    }

@*1 Flash stream decoder.

The |flashStream| is a specialisation of |applicationStringParser|
which contains all of the logic needed to parse a Macromedia
Flash script (\.{.swf}) file.  This class remains abstract in
that it does not implement the |nextString| method; that is
left for the |flashTextExtractor| class, of which this class is
the parent.

This decoder is based on the \.{swfparse.cpp} program written by
David Michie, which is available on the \pdfURL{OpenSWF.org}{http://www.openswf.org/}
site.

@<Class definitions@>=
class flashStream : public applicationStringParser {
protected:@/
    
    @<Flash file tag values@>;
    @<Flash file action codes@>;
    @<Flash text field mode definitions@>;
    @<Flash file data structures@>;
    
    //	Header fields
    
    unsigned char sig[3];   	    // Signature: ``\.{FWS}'' in ASCII
    unsigned char version;  	    // Version number
    unsigned int fileLength;	    // Length of entire file in bytes
    rect frameSize; 	    	    // Frame size in TWIPS
    unsigned short frameRate;	    // Frames per second (8.8 bit fixed)
    unsigned short frameCount;	    // Total frames in animation

    //	Current tag information
    
    tagType tType;  	    	    // Tag type
    unsigned int tDataLen;  	    // Length of data chunk
    
    //	Bit stream decoder storage
    
    unsigned int bitBuf, bitPos;

public:@/

    flashStream(mailFolder *f = NULL) :
    	applicationStringParser(f) {
    }
    
    void readHeader(void);  	// Read header into memory
    void describe(ostream &os = cout);	    //	Describe stream
    bool nextTag(void);     //	Read next tag identifier and length of tag data
    
    //	Retrieve properties of current tag
    
    tagType getTagType(void) const {
    	return tType;
    }
    
    unsigned int getTagDataLength(void) const {
    	return tDataLen;
    }

    void ignoreTag(unsigned int lookedAhead = 0);   //	Ignore data for tag we aren't interested in
    
    virtual void close(void) {
    	applicationStringParser::close();
    }
    
protected:@/
    
    @<Read 16 and 32 bit quantities from Flash file@>;
    
    //	Skip |n| bytes of the input stream
    void skip8n(const int n) {
    	for (int i = 0; (!eof) && (i < n); i++) {
	    get8();
	}
    }

    void getString(string &s, int n = -1);
    
    //	Bit field decoding methods
    void initBits(void);
    unsigned int getBits(int n);
    int getSignedBits(const int n);
        
    void getRect(rect *r);  	    //  Read a Rectangle specification
    void getMatrix(matrix *mat);    //	Read a Matrix definition
};

@
Read the header of the Flash file into memory, validating its
signature.

@<Class implementations@>=
    void flashStream::readHeader(void) {
	sig[0] = get8();
	sig[1] = get8();
	sig[2] = get8();
	if (isEOF() || (memcmp(sig, "FWS", 3) != 0)) {
	    error = true;
	    if (verbose) {
	    	cerr << "Invalid signature in Flash animation file." << endl;
	    }
	    return;
	}
	version = get8();
	fileLength = get32();
	getRect(&frameSize);
	frameRate = get16();
	frameCount = get16();
    }

@
Write a primate-readable description of the Flash header on
the output stream argument |os|, which defaults to
|cout|.

@<Class implementations@>=
    void flashStream::describe(ostream &os) {
    	os << "Flash animation version " <<
	    static_cast<unsigned int>(version) << endl;
	os << "  File length: " << fileLength << " bytes." << endl;
	os << "  Frame size:  X: " << frameSize.xMin << " - " <<
	    	    	    	      frameSize.xMax <<
			    " Y: " << frameSize.yMin << " - " <<
			    	      frameSize.yMax << endl;
	os << "  Frame rate: " << setprecision(5) << (frameRate / 256.0) <<
	      " fps." << endl;
	os << "  Frame count: " << frameCount << endl;
    }

@
Read the header for the next tag.  Each tag begins with a 16 bit field
which contains 10 bits of tag identifier and a 6 bit field specifying the
number of argument bytes which follow.  For tags with arguments of 0 to
62 bytes, the 6 bit field is the data length.  For longer tags, the 6
bit length field is set of |0x3F| and a 32 bit quantity giving the
tag data length immediately follows.  Regardless of the format of the
tag header, we store the tag type in |tType| and the number of
data bytes in |tDataLen|.

@<Class implementations@>=
    bool flashStream::nextTag(void) {
    	unsigned short s = get16();
	unsigned long l;
	if (isOK()) {
	    tType = static_cast<tagType>(s >> 6);
	    l = s & 0x3F;
	    if (l == 0x3F) {
	    	l = get32();	    	// Long tag; read 32 bit length
	    }
	    if (isOK()) {
	    	tDataLen = l;
		return tType != stagEnd;
	    }
	}
	//  In case of error dummy up end tag for sloppy callers
	tType = stagEnd;
	tDataLen = 0;
	return false;
    }

@
Having read the tag header, if we decide we aren't interested
in the tag, we can simply skip past |tDataLen| argument bytes
to advance to the next tag header; |ignoreTag| performs this.
If you've read into the tag data before deciding you wish to
skip the tag, call |ignoreTag| with the |lookedAhead| argument
specifying how many bytes of the tag data you've already read.

@<Class implementations@>=
    void flashStream::ignoreTag(unsigned int lookedAhead) {
    	if (isOK()) {
//	    assert(lookedAhead <= tDataLen);	// (This assertion will fail if \.{--bsdfolder} is set)
	    for (unsigned int i = lookedAhead; isOK() && (i < tDataLen); i++) {
	    	get8();
	    }
	}
    }
    
@
Flash files are a little schizophrenic when it comes to the
definition of strings.  Sometimes they're stored with a leading
count byte followed by the given number of bytes of text, while
in other places they're stored \CEE/ style, with a zero terminator
byte marking the end of the string.  The |getString| method handles
both kinds.  If called with no length argument, it reads a zero
terminated string, otherwise it reads a string of |n| characters.
It's up to the caller to first read the length and pass it as
the |n| argument,

@<Class implementations@>=
    void flashStream::getString(string &s, int n) {
    	s = "";
	char ch;
	
	if (n == -1) {
	    while ((ch = get8()) != 0) {
		s += ch;
	    }
	} else {
	    while (n > 0) {
	    	ch = get8();
		s += ch;
		n--;
	    }
    	}
    }
    
@
A rectangle is stored as a 5 bit field which specifies the number
of bits in the extent fields which follow, which are sign
extended when extracted.

@<Class implementations@>=
    void flashStream::getRect(rect *r) {
	initBits();
	int nBits = static_cast<int>(getBits(5));
	r->xMin = getSignedBits(nBits);
	r->xMax = getSignedBits(nBits);
	r->yMin = getSignedBits(nBits);
	r->yMax = getSignedBits(nBits);
    }
    
@
A transformation matrix is stored as separate scale,
rotation/skew, and translation terms, each represented
as a signed fixed-point value.  The scale and rotation/skew
terms are optional and are omitted if they are
identity---an initial bit indicates whether they are
present.

@<Class implementations@>=
    void flashStream::getMatrix(matrix *mat) {
	initBits();

	// Scale terms
	if (getBits(1)) {
            int nBits = static_cast<int>(getBits(5));
            mat->a = getSignedBits(nBits);
            mat->d = getSignedBits(nBits);
	} else {
            mat->a = mat->d = 0x00010000L;  // Identity: omitted
	}

	// Rotate/skew terms
	if (getBits(1)) {
            int nBits = static_cast<int>(getBits(5));
            mat->b = getSignedBits(nBits);
            mat->c = getSignedBits(nBits);
	} else {
            mat->b = mat->c = 0;    	    // Identity: omitted
	}

	// Translate terms
	int nBits = static_cast<int>(getBits(5));
	mat->tx = getSignedBits(nBits);
	mat->ty = getSignedBits(nBits);
    }

@
16 and 32 bit quantities are stored in little-endian byte
order.  These methods, declared within the class so they're
inlined in the interest of efficiency, use the |get8| primitive
byte input method to assemble the wider quantities.  The
|get16n| and |get32n| methods read a series of |n| consecutive
values of the corresponding type into an array.

@<Read 16 and 32 bit quantities from Flash file@>=
    unsigned short get16(void) {
    	unsigned short u16;
	
	u16 = get8();
	u16 |= get8() << 8;
	return u16;
    }
    
    unsigned int get32(void) {
    	unsigned int u32;
	
	u32 = get8();
	u32 |= get8() << 8;
	u32 |= get8() << 16;
	u32 |= get8() << 24;
	return u32;
    }
    
    void get16n(unsigned short *buf, const int n) {
    	for (int i = 0; (!eof) && (i < n); i++) {
	    buf[i] = get16();
	}
    }
    
    void get32n(unsigned int *buf, const int n) {
    	for (int i = 0; (!eof) && (i < n); i++) {
	    buf[i] = get32();
	}
    }

@
Flash files include quantities packed into bit fields, the
width of some of which are specified by other fields in the
file.  The following methods decode these packed fields.
Call |initBits| to initialise decoding of a bit field
which begins in the next (as yet unread) byte.  Then call
|getBits| or |getSignedBits| to return an |n| bit field
without or with sign extension respectively.

@<Class implementations@>=
    void flashStream::initBits(void) {
	// Reset the bit position and buffer.
	bitPos = 0;
	bitBuf = 0;
    }
        
    // Get n bits from the stream.
    unsigned int flashStream::getBits(int n) {
	unsigned int v = 0;

	while (true) {
            int s = n - bitPos;
            if (s > 0) {
        	// Consume the entire buffer
        	v |= bitBuf << s;
        	n -= bitPos;

        	// Get the next buffer
        	bitBuf = get8();
        	bitPos = 8;
            } else {
        	// Consume a portion of the buffer
        	v |= bitBuf >> -s;
        	bitPos -= n;
        	bitBuf &= 0xFF >> (8 - bitPos); // mask off the consumed bits

        	return v;
            }
	}
    }

    // Get n bits from the string with sign extension.
    int flashStream::getSignedBits(const int n)  {
	signed int v = static_cast<int>(getBits(n));

	// Is the number negative?
	if (v & (1 << (n - 1))) {
            // Yes. Extend the sign.
            v |= -1 << n;
	}
	return v;
    }

@
After the header, a Flash file consists of a sequence of {\it tags}, each of
which begins with a 10 bit tag type and a field specifying the number of
bytes of tag data which follow.  Since each tag specifies its length,
unknown tags may be skipped.

@<Flash file tag values@>=
    // Tag values that represent actions or data in a Flash script.
    typedef enum { 
	stagEnd                 = 0,	// End of Flash file---this is always the last tag
	stagShowFrame           = 1,@/
	stagDefineShape         = 2,@/
	stagFreeCharacter       = 3,@/
	stagPlaceObject         = 4,@/
	stagRemoveObject        = 5,@/
	stagDefineBits          = 6,@/
	stagDefineButton        = 7,@/
	stagJPEGTables          = 8,@/
	stagSetBackgroundColor  = 9,@/
	stagDefineFont          = 10,@/
	stagDefineText          = 11,@/
	stagDoAction            = 12,@/
	stagDefineFontInfo      = 13,@/
	stagDefineSound         = 14,   // Event sound tags.
	stagStartSound          = 15,@/
	stagDefineButtonSound   = 17,@/
	stagSoundStreamHead     = 18,@/
	stagSoundStreamBlock    = 19,@/
	stagDefineBitsLossless  = 20,   // A bitmap using lossless \.{zlib} compression.
	stagDefineBitsJPEG2     = 21,   // A bitmap using an internal JPEG compression table.
	stagDefineShape2        = 22,@/
	stagDefineButtonCxform  = 23,@/
	stagProtect             = 24,   // This file should not be importable for editing.

	// These are the new tags for Flash 3.
	stagPlaceObject2        = 26,   // The new style place w/ alpha color transform and name.
	stagRemoveObject2       = 28,   // A more compact remove object that omits the character tag (just depth).
	stagDefineShape3        = 32,   // A shape V3 includes alpha values.
	stagDefineText2         = 33,   // A text V2 includes alpha values.
	stagDefineButton2       = 34,   // A button V2 includes color transform, alpha and multiple actions
	stagDefineBitsJPEG3     = 35,   // A JPEG bitmap with alpha info.
	stagDefineBitsLossless2 = 36,   // A lossless bitmap with alpha info.
	stagDefineEditText      = 37,   // An editable Text Field
	stagDefineSprite        = 39,   // Define a sequence of tags that describe the behavior of a sprite.
	stagNameCharacter       = 40,   // Name a character definition, character id and a string, (used for buttons, bitmaps, sprites and sounds).
	stagFrameLabel          = 43,   // A string label for the current frame.
	stagSoundStreamHead2    = 45,   // For lossless streaming sound, should not have needed this...
	stagDefineMorphShape    = 46,   // A morph shape definition
	stagDefineFont2         = 48,@/
    } tagType;

@
Executable actions are encoded in a Flash script as a |stagDoAction|
tag, which contains a sequence of action codes, terminated by a zero
(|sactionNone|) action.  Action codes in the range |0x00|--|0x7F| are
single byte codes with no arguments.  Action codes from |0x80| to
|0xFF| are followed by a 16 bit field specifying the number of
argument bytes which follow.  Unknown actions, like tags, may hence
be skipped.

@<Flash file action codes@>=
    typedef enum {
	sactionNone                     = 0x00,@/
	sactionNextFrame                = 0x04,@/
	sactionPrevFrame                = 0x05,@/
	sactionPlay                     = 0x06,@/
	sactionStop                     = 0x07,@/
	sactionToggleQuality            = 0x08,@/
	sactionStopSounds               = 0x09,@/
	sactionAdd                      = 0x0A,@/
	sactionSubtract                 = 0x0B,@/
	sactionMultiply                 = 0x0C,@/
	sactionDivide                   = 0x0D,@/
	sactionEqual                    = 0x0E,@/
	sactionLessThan                 = 0x0F,@/
	sactionLogicalAnd               = 0x10,@/
	sactionLogicalOr                = 0x11,@/
	sactionLogicalNot               = 0x12,@/
	sactionStringEqual              = 0x13,@/
	sactionStringLength             = 0x14,@/
	sactionSubString                = 0x15,@/
	sactionInt                      = 0x18,@/
	sactionEval                     = 0x1C,@/
	sactionSetVariable              = 0x1D,@/
	sactionSetTargetExpression      = 0x20,@/
	sactionStringConcat             = 0x21,@/
	sactionGetProperty              = 0x22,@/
	sactionSetProperty              = 0x23,@/
	sactionDuplicateClip            = 0x24,@/
	sactionRemoveClip               = 0x25,@/
	sactionTrace                    = 0x26,@/
	sactionStartDragMovie           = 0x27,@/
	sactionStopDragMovie            = 0x28,@/
	sactionStringLessThan           = 0x29,@/
	sactionRandom                   = 0x30,@/
	sactionMBLength                 = 0x31,@/
	sactionOrd                      = 0x32,@/
	sactionChr                      = 0x33,@/
	sactionGetTimer                 = 0x34,@/
	sactionMBSubString              = 0x35,@/
	sactionMBOrd                    = 0x36,@/
	sactionMBChr                    = 0x37,@/
	sactionHasLength                = 0x80,@/
	sactionGotoFrame                = 0x81, // frame num (WORD)
	sactionGetURL                   = 0x83, // url (STR), window (STR)
	sactionWaitForFrame             = 0x8A, // frame needed (WORD), 
                                        	// actions to skip (BYTE)
	sactionSetTarget                = 0x8B, // name (STR)
	sactionGotoLabel                = 0x8C, // name (STR)
	sactionWaitForFrameExpression   = 0x8D, // frame needed on stack,
                                        	// actions to skip (BYTE)
	sactionPushData                 = 0x96,@/
	sactionBranchAlways             = 0x99,@/
	sactionGetURL2                  = 0x9A,@/
	sactionBranchIfTrue             = 0x9D,@/
	sactionCallFrame                = 0x9E,@/
	sactionGotoExpression           = 0x9F
    } actionCode;    

@
Here we define the various mode bits which occur in font
and text related tags.  Many of these bits are irrelevant
to our mission of string parsing, but we define them all anyway.

@<Flash text field mode definitions@>=
    typedef enum {  	    	    	// Flag bits for DefineFontInfo
	fontUnicode   = 0x20,@/
	fontShiftJIS  = 0x10,@/
	fontANSI      = 0x08,@/
	fontItalic    = 0x04,@/
	fontBold      = 0x02,@/
	fontWideCodes = 0x01
    } fontFlags;

    typedef enum {  	    	    	// Flag bits for text record type 1
	isTextControl = 0x80,@/

	textHasFont   = 0x08,@/
	textHasColor  = 0x04,@/
	textHasYOffset= 0x02,@/
	textHasXOffset= 0x01
    } textFlags;
    
    typedef enum {  	    	    	// Flag bits for DefineEditText
	seditTextFlagsHasFont       = 0x0001,@/
	seditTextFlagsHasMaxLength  = 0x0002,@/
	seditTextFlagsHasTextColor  = 0x0004,@/
	seditTextFlagsReadOnly      = 0x0008,@/
	seditTextFlagsPassword      = 0x0010,@/
	seditTextFlagsMultiline     = 0x0020,@/
	seditTextFlagsWordWrap      = 0x0040,@/
	seditTextFlagsHasText       = 0x0080,@/
	seditTextFlagsUseOutlines   = 0x0100,@/
	seditTextFlagsBorder        = 0x0800,@/
	seditTextFlagsNoSelect      = 0x1000,@/
	seditTextFlagsHasLayout     = 0x2000
    } editTextFlags;

@
The following data structures are used to represent rectangles
and transformation matrices.  We don't do anything with these
quantities, but we need to understand their structure in order
to skip over them while looking for fields we are
interested in.

@<Flash file data structures@>=
    typedef struct {
	int xMin, xMax, yMin, yMax;
    } rect;
    
    typedef struct {
	int a;
	int b;
	int c;
	int d;
	int tx;
	int ty;
    } matrix;

@*2 Flash text extractor.

The |flashTextExtractor| extends |flashStream| to parse tags containing
text fields and return them with the |nextString| method.  We define this
as a separate class in order to encapsulate all of the string parsing
machinery in one place, while leaving |flashStream| a general-purpose
\.{.swf} file parser adaptable to other purposes.

@<Class definitions@>=
class flashTextExtractor : public flashStream {
protected:
    map <unsigned short, vector<unsigned short> *> fontMap;
    map <unsigned short, unsigned short> fontGlyphCount;
    map <unsigned short, fontFlags> fontInfoBits;
    queue <string> strings;

    bool initialised;
    
    //	Options
    
    bool textOnly;  	    	    // Return only text (not font names, URLs, etc.)
    
public:
    flashTextExtractor(mailFolder *f = NULL) :
    	flashStream(f), initialised(false), textOnly(false) {
    }
    
    ~flashTextExtractor() {
    	close();
    }
    
    virtual string name(void) const {
    	return "Flash";
    }
     
    void setTextOnly(const bool tf) {
    	textOnly = tf;
    }
    
    bool getTextOnly(void) const {
    	return textOnly;
    }

    bool nextString(string &s);     	// Return next string from Flash file
    
    virtual void close(void) {
    	while (!fontMap.empty()) {
	    delete fontMap.begin()->second;
	    fontMap.erase(fontMap.begin());
	}
	fontGlyphCount.clear();
	fontInfoBits.clear();
	while (!strings.empty()) {
	    strings.pop();
	}
	initialised = textOnly = false;
	flashStream::close();
    }
};

@
Return the next string (which may contain any number of tokens)
from the Flash file.  If the |strings| queue contains already-parsed
strings, return and delete the the item at the head of the queue.
Otherwise, we parse our way through the Flash file, adding any
strings which appear in tags to the |strings| queue.  If, after
parsing a tag, we find |strings| non-empty, we return the
first item in the queue.  The method returns |true| if a string
was stored and |false| when the end of the Flash file is encountered.

The first time this method is called, we read the Flash file
header and validate it.  If an error occurs in the process, we
treat the event as a logical end of file.

@<Class implementations@>=
    bool flashTextExtractor::nextString(string &s) {
    	if (!initialised) {
	    initialised = true;
	    readHeader();
	    if (!isOK()) {
	    	if (verbose) {
		    cerr << "Invalid header in Flash application file." << endl;
		    close();
		    while (!isEOF()) {
		    	get8();     	    	// Discard contents after error
		    }
		    return false;
		}
	    }
	}
    	while (true) {
haveStrings:@/
    	    @<Check for strings in the queue and return first if queue not empty@>;
	    
    	    while ((!isEOF()) && (!isError()) && nextTag()) {
	    	unsigned int variant = 0;   	// Twiddley-puke variant type for tags
		
		switch (tType) {
	    	    case stagDefineFont:@/
		    	@<Parse Flash DefineFont tag@>;
			break;
			
		    case stagDefineFont2:@/
		    	@<Parse Flash DefineFont2 tag@>;
		    	break;

	    	    case stagDefineFontInfo:@/
		    	@<Parse Flash DefineFontInfo tag@>;
			break;
			
		    case stagDefineText2:   	// Like |stagDefineText|, but colour is RGBA
		    	variant = 2;
			@,@/
			// Note fall-through

		    case stagDefineText:@/
		    	@<Parse Flash DefineText tags@>;
			break;
			
		    case stagDefineEditText:@/
		    	@<Parse Flash DefineEditText tag@>;
		    	break;
			
		    case stagFrameLabel:@/
		    	@<Parse Flash FrameLabel tag@>;
			break;
			
		    case stagDoAction:@/
		    	@<Parse Flash DoAction tag@>;
			break;

		    default:
#ifdef FLASH_PARSE_DEBUG
			cout << "nextString ignoring tag type " << getTagType() << " data length: " <<
			    getTagDataLength() << endl;
#endif
			ignoreTag();
			break;
		}
		if (!strings.empty()) {
		    goto haveStrings;
		}
	    }
	    if (strings.empty()) {
	    	break;
	    }
	}
	while (isOK()) {
	    get8();
	}
	return false;
    }

@
Since a single tag may contain any number of strings, we place
strings extracted from a tag in the |strings| queue.  Then, after
we're done digesting the tag, if the queue is non-empty, we return
the first string from it.  Subsequent calls return strings from
the queue until it's empty, at which time we resume scouring the
Flash file for more strings.

@<Check for strings in the queue and return first if queue not empty@>=
    if (!strings.empty()) {
	s = strings.front();
	strings.pop();
	return true;
    }

@
The DefineFont tag actually contains only one thing of interest
to us: the number of glyphs in the font.  We save the glyph count
in the |fontFlyphCount| map, tagged by the font ID.

@<Parse Flash DefineFont tag@>=
    {
#ifdef FLASH_PARSE_DEBUG
    	cout << "DefineFont" << endl;
#endif
    	unsigned short fontID = get16();
	unsigned int offsetTable = get16();
#ifdef FLASH_PARSE_DEBUG
	cout << "  Font ID: " << fontID << endl;
    	cout << "  Glyph count: " << ( offsetTable / 2) << endl;
#endif
	fontGlyphCount.insert(make_pair(fontID, offsetTable / 2));
    	ignoreTag(2 * 2);
    }

@
The DefineFont2 tag adds a font name to the fields in the
original DefineFont tag.  We consider this font name as an
eligible string if the |textOnly| constraint isn't |true|.

@<Parse Flash DefineFont2 tag@>=
    {
#ifdef FLASH_PARSE_DEBUG
    	cout << "DefineFont2" << endl;
#endif
    	unsigned short fontID = get16();
    	get16();    	    // Flag bits

    	// Parse the font name
    	unsigned int fontNameLen = get8();
    	string fontName;
	getString(fontName, fontNameLen);
	if (!textOnly) {
	    strings.push(fontName);
	}

    	// Get the number of glyphs.
    	unsigned int nGlyphs = get16();
	fontGlyphCount.insert(make_pair(fontID, nGlyphs));
	ignoreTag(2 + 2 + 1 + fontNameLen + 2);
    }

@
The DefineFontInfo tag is crucial to decoding Flash text strings.
Text in Flash files is stored a glyph indices within a font.  The
font can, in the general case, be defined by an arbitrary stroked
path outline, independent of any standard character set.  For fonts
which employ standard character sets, the optional
DefineFontInfo identifies the character set and provides the
mapping from the glyph indices to characters in the font's
character set.  We save these in maps indexed by the font ID so
we can look them up when we encounter text in that font.

@<Parse Flash DefineFontInfo tag@>=
    {
#ifdef FLASH_PARSE_DEBUG
    	cout << "DefineFontInfo" << endl;
#endif
    	unsigned short fontID = get16();
    	unsigned int fontNameLen = get8();
    	string fontName;
	getString(fontName, fontNameLen);
	if (!textOnly) {
	    strings.push(fontName);
	}
	fontFlags fFlags = static_cast<fontFlags>(get8());   	    	
	map<unsigned short, unsigned short>::iterator fp = fontGlyphCount.find(fontID);
	if (fp == fontGlyphCount.end()) {
	    if (verbose) {
    	    	cerr << "DefineFontInfo for font ID " << fontID <<
		    " without previous DefineFont." << endl;
	    }
    	    ignoreTag(4);
	} else {
	    unsigned nGlyphs = fp->second;
	    vector <unsigned short> *v = new vector<unsigned short>(nGlyphs);
	    fontMap.insert(make_pair(fontID, v));
	    fontInfoBits.insert(make_pair(fontID, fFlags));

	    for (unsigned int g = 0; g < nGlyphs; g++) {
		if (fFlags & fontWideCodes) {
		    (*v)[g] = get16();
		} else {
		    (*v)[g] = get8();
		}
	    }
	}
    }

@
Most of the text we're really interested in will be found in the
DefineText tag and its younger sibling DefineText2.  After spitting
out the various wobbly green parts, we digest the list of glyphs
composing the text, going back to the font definition to claw them
back into civilised language which we can filter.

@<Parse Flash DefineText tags@>=
    {
#ifdef FLASH_PARSE_DEBUG
	unsigned short textID = get16();
    	cout << "DefineText.  ID = " << textID << endl;
#else
    	get16();    	    // Ignore textID
#endif
	rect tr;
	getRect(&tr);
	matrix tm;
	getMatrix(&tm);
	unsigned short textGlyphBits = get8();
	unsigned short textAdvanceBits = get8();
	int fontId = -1;
	map <unsigned short, vector<unsigned short> *>::iterator fontp = fontMap.end();
	map <unsigned short, unsigned short>::iterator fgcp = fontGlyphCount.end();
	unsigned int fGlyphs = 0;
	fontFlags fFlags = static_cast<fontFlags>(0);
	
	vector<unsigned short> *fontChars = NULL;

	//  Now it's a matter of parsing the text records

   	while (true) {
	    unsigned int textRecordType = get8();
	    if (textRecordType == 0) {
		break;	    	// 0 indicates end of text records
	    }

	    if (textRecordType & isTextControl) {
#ifdef FLASH_PARSE_DEBUG
    	    	cout << "Text control record." << endl;
#endif
        	if (textRecordType & textHasFont) {
        	    fontId = get16();
#ifdef FLASH_PARSE_DEBUG
    	    	    cout << "    fontId: " << fontId << endl;
#endif
		    fgcp = fontGlyphCount.find(fontId);
		    if (fgcp == fontGlyphCount.end()) {
			fontp =fontMap.end();
			if (verbose) {
			    cerr << "Flash DefineText item references undefined font ID " <<
				fontId << endl;
			}
		    } else {
			fGlyphs = fgcp->second;
			fontChars = fontMap.find(fontId)->second;
			fFlags = fontInfoBits.find(fontId)->second;
    	    	    }
        	}
        	if (textRecordType & textHasColor) {
#ifdef FLASH_PARSE_DEBUG
        	    int r = get8();
        	    int g = get8();
        	    int b = get8();
		    if (variant == 2) {
			int a = get8(); 	// Alpha (transparency) channel
			cout << "    tfontColour: (" << r << "," <<
			    g << "," << b << "," << a << ")" << endl;
		    } else {
			cout << "    tfontColour: (" << r << "," <<
			    g << "," << b << ")" << endl;
		    }
#else
    	    	    skip8n(3);	    // Skip R, G, B bytes
#endif
        	}
        	if (textRecordType & textHasXOffset) {
#ifdef FLASH_PARSE_DEBUG
        	    int iXOffset = get16();
		    cout << "    X offset " << iXOffset << endl;
#else
    	    	    get16();	    // Skip text X offset
#endif
        	}
        	if (textRecordType & textHasYOffset) {
#ifdef FLASH_PARSE_DEBUG
        	    int iYOffset = get16();
		    cout << "    Y offset " << iYOffset << endl;
#else
    	    	    get16();	    // Skip text Y offset
#endif
        	}
        	if (textRecordType & textHasFont) {
#ifdef FLASH_PARSE_DEBUG
        	    int iFontHeight = get16();
		    cout << "    Font Height: " << iFontHeight << endl;
#else
    	    	    get16();	    // Skip text font height
#endif
        	}
	    } else {	// Type 0:  Glyph record
#ifdef FLASH_PARSE_DEBUG
    	    	cout << "Text glyph record." << endl;
#endif
		unsigned int nGlyphs = textRecordType & 0x7F;

		initBits();
		string s = "";

        	for (unsigned int i = 0; i < nGlyphs; i++) {
        	    unsigned int iIndex = getBits(textGlyphBits);
#ifdef FLASH_PARSE_DEBUG
        	    unsigned int iAdvance = getBits(textAdvanceBits);
		    cout << "[" << iIndex << "," << iAdvance << "] " << flush;
#else
    	    	    getBits(textAdvanceBits);	    // Ignore text advance distance
#endif
		    if (fontId < 0) {
			if (verbose) {
			    cerr << "Flash DefineText does not specify font." << endl;
			}
		    } else if (fgcp != fontGlyphCount.end()) {
			if (iIndex >= fGlyphs) {
			    if (verbose) {
				cerr << "Flash DefineText glyph index " <<
				    iIndex << " exceeds font size of " << fGlyphs << "." <<
				    endl;
			    }
			} else {
			    if (fFlags & fontWideCodes) {
			    	unsigned int wc = (*fontChars)[iIndex];
				s += static_cast<char>((wc >> 8) & 0xFF);
				s += static_cast<char>(wc & 0xFF);
			    } else {
			    	s += static_cast<char>((*fontChars)[iIndex]);
			    }
			}
		    }
        	}
#ifdef FLASH_PARSE_DEBUG
        	cout << endl;
		cout << "Decoded: (" << s << ")" << endl;
#endif
    	    	@<Decode non-ANSI Flash text@>;
		strings.push(s);
	    }
	}
    }
    
@
Text strings in a Flash file can be encoded in Shift-JIS and
Unicode in addition to ANSI characters.  If the font if flagged
as using one of those encodings, decode it into an ANSI representation.

@<Decode non-ANSI Flash text@>=
    if (fFlags &  fontUnicode) {
    	UCS_2_Unicode_MBCSdecoder mbd_ucs;	// Unicode decoder
    	Unicode_MBCSinterpreter mbi_ucs; 	// Unicode interpreter
	
	mbi_ucs.setDecoder(mbd_ucs);
	s = mbi_ucs.decodeLine(s);
    } else if (fFlags & fontShiftJIS) {
    	Shift_JIS_MBCSdecoder mbd_sjis;	    	// Shift-JIS decoder
    	Shift_JIS_MBCSinterpreter mbi_sjis; 	// Shift-JIS interpreter
	
	mbi_sjis.setDecoder(mbd_sjis);
	s = mbi_sjis.decodeLine(s);
    }


@
Of course, there isn't just text, there's {\it editable text}, where
morons can type in their credit card numbers after receiving ``so cool
a Flash''.  We deem any initial text in the edit field a string, as
well as the variable name, unless |textOnly| is |true|.

@<Parse Flash DefineEditText tag@>=
    {
#ifdef FLASH_PARSE_DEBUG
    	cout << "Edit text record." << endl;
#endif
	get16();
	rect rBounds;
	getRect(&rBounds);

	unsigned int flags = get16();

#ifdef FLASH_PARSE_DEBUG
    	cout << "DefineEditText.  Flags = 0x" << hex << flags << dec << endl;
#endif

	if (flags & seditTextFlagsHasFont) {
#ifdef FLASH_PARSE_DEBUG
            unsigned short uFontId = get16();
            unsigned short uFontHeight = get16();
	    cout << "FontId: " << uFontId << "  FontHeight: " << uFontHeight << endl;
#else
    	    get16();
	    get16();
#endif
	}

	if (flags & seditTextFlagsHasTextColor) {
            skip8n(4);	    	// Skip colour (including alpha transparency)
	}

	if (flags & seditTextFlagsHasMaxLength) {
#ifdef FLASH_PARSE_DEBUG
            int iMaxLength = get16();
            printf("length:%d ", iMaxLength);
#else
    	    get16();
#endif
	}

	if (flags & seditTextFlagsHasLayout) {
    	    skip8n(1 + (2 * 4));
	}

    	string varname;
	getString(varname);
	if (!textOnly) {
	    strings.push(varname);  	    // Emit variable name as a string
	}

	if (flags & seditTextFlagsHasText ) {
    	    string s;
	    char c;

	    while ((c = get8()) != 0) {
		s += c;
	    }
	    strings.push(s);
	}
    }
    
@
Frames in Flash files can have labels, which can be used to jump to them.
If |textOnly| is not set, we parse these labels and return them as strings,
since they will frequently identify Flash files which appear in junk mail.

@<Parse Flash FrameLabel tag@>=
    {
	string s;

	getString(s);
	if (!textOnly) {
	    strings.push(s);
	}
    }

@
Some of the DoAction tags contain string we might be interested
in perusing.  Walk through the action items in a DoAction tag and
push any relevant strings onto the |strings| queue.

@<Parse Flash DoAction tag@>=
    {
#ifdef FLASH_PARSE_DEBUG
    	cout << "Do action:" << endl;
#endif
    	actionCode ac;

	while (isOK() && (ac = static_cast<actionCode>(get8())) != sactionNone) {
	    unsigned int dlen = 0;
	    if ((ac & 0x80) != 0) {
		dlen = get16();
	    }

	    switch (ac) {
		case sactionGetURL:
		    {
			string url, target;

			getString(url);
			getString(target);
			if (!textOnly) {
			    strings.push(url);
			}
			strings.push(target);
    	    	    }
		    break;

		default:
		    if (dlen > 0) {
			skip8n(dlen);
		    }
#ifdef FLASH_PARSE_DEBUG
    	    	    cout << "  Skipping action code 0x" << hex << ac << dec <<
			" data length " << dlen << endl;
#endif
		    break;
	    }
	}
    }
    
@*1 PDF text extractor.

The |pdfTextExtractor| decodes Portable Document File \.{.pdf} files by
opening a pipe to the \pdfURL{\.{pdftotext}}{http://www.foolabs.com/xpdf/} program.
Since this program cannot read a PDF document from standard input, we transcribe
the PDF stream to a temporary file which is passed to \.{pdftotext} on
the command line; the extracted text is directed to standard output
whence it can be read through the pipe.  The temporary file is deleted
after the PDF decoding is complete.  Natually, this facility is available
only if the system provides \.{pdftotext} and the machinery needed to
connect to it.

@<Class definitions@>=
#ifdef HAVE_PDF_DECODER
class pdfTextExtractor : public applicationStringParser {
protected:
    bool initialised;
#ifdef HAVE_FDSTREAM_COMPATIBILITY
    fdistream is;
#else
    ifstream is;
#endif
    FILE *ip;
#ifdef HAVE_MKSTEMP
    char tempfn[256];
#else
    char tempfn[L_tmpnam + 2];
#endif
    
public:
    pdfTextExtractor(mailFolder *f = NULL) :
    	applicationStringParser(f),
	initialised(false),
	ip(NULL) {
    }
    
    ~pdfTextExtractor() {
    	close();
    }
    
    virtual string name(void) const {
    	return "PDF";
    }
   
    bool nextString(string &s);
    
    virtual void close(void) {
	if (ip != NULL) {
#ifndef HAVE_FDSTREAM_COMPATIBILITY
	    is.close();
#endif
	    pclose(ip);
	    remove(tempfn);
	    ip = NULL;
	}
	applicationStringParser::close();
    	initialised = false;
    }
};
#endif

@
Since \.{pdftotext} cannot read a PDF file from standard input, we're
forced to transcribe the content to a temporary file.  We do this
the first time |nextString| is called, setting the |initialised|
flag once the deed is done.  Subsequent calls simply return the
decoded text from the pipe, closing things down when end of file
is encountered.

@<Class implementations@>=
#ifdef HAVE_PDF_DECODER
    bool pdfTextExtractor::nextString(string &s) {
    	if (!initialised) {
	    initialised = true;
	    
	    @<Transcribe PDF document to temporary file@>;
	    @<Create pipe to pdftotext decoder@>;	    
	}
	
	if (ip == NULL) {
	    return false;   	    // Could not open pipe; fake EOF
	}
	
	if (getline(is, s) != NULL) {
	    return true;
	}
	close();
	return false;
    }
#endif

@
Read the PDF document text and export to a temporary file
whence \.{pdftotext} can read it.  We generate a unique name
for the temporary file with |mkstemp| or, if the system
doesn't provide that function, the POSIX |tmpnam|
alternative.

@<Transcribe PDF document to temporary file@>=
#ifdef HAVE_MKSTEMP
    strcpy(tempfn, "PDF_decode_XXXXXX");
    int pdffd = mkstemp(tempfn);
#ifdef HAVE_FDSTREAM_COMPATIBILITY
    fdostream pdfstr(pdffd);
#else
    ofstream pdfstr(pdffd);
#endif
#else
    tmpnam(tempfn);
    ofstream pdfstr(tempfn, ios::out | ios::binary);
#endif
    if (!pdfstr) {
	cerr << "Cannot create PDF temporary file " << tempfn << endl;
	error = eof = true;
	return false;
    }
    while (isOK()) {
	pdfstr << get8();
    }
#ifdef HAVE_MKSTEMP
#ifdef HAVE_FDSTREAM_COMPATIBILITY
    ::close(pdffd);
#else
    pdfstr.close();
#endif
#else
    pdfstr.close();
#endif

@
Since \.{pdftotext} does all the heavy lifting here, we need only
invoke it with |popen|, which is bound to the \CPP/ input stream
we use to read the decoded text.

@<Create pipe to pdftotext decoder@>=
    string pdfcmd = "pdftotext ";
    pdfcmd += tempfn;
    pdfcmd += " -";
    ip = popen(pdfcmd.c_str(), "r");
    if (ip == NULL) {
	cerr << "Cannot open pipe to pdftotext." << endl;
	error = eof = true;
	return false;
    }
    is.attach(fileno(ip));
    is.clear();

@** Mail folder.

The |mailFolder| class returns successive lines from a mail
folder bound to an input stream.

@<Class definitions@>=
@<Configure compression suffix and command@>@/

class mailFolder {
public:@/
    istream *is;     	    	    // Stream to read mail folder from
    dictionaryWord::mailCategory category;  // Category (Mail or Junk)
    unsigned int nLines;    	    // Number of lines in folder
    unsigned int nMessages; 	    // Number of messages read so far
    bool newMessage;	    	    // On first line of new message ?
    bool expectingNewMessage;	    // Expecting start of new message ?
    bool lastLineBlank;     	    // Was last line in mail folder blank ?
    bool BSDfolder; 	    	    // Mail folder uses ``pure BSD'' message boundary semantics
    bool inHeader;  	    	    // Within message header section
    string lookAheadLine;   	    // Line to save look ahead while parsing headers
    bool lookedAhead;	    	    // Have we a look ahead line ?
    ifstream isc;    	    	    // Input stream for (possibly compressed) input file
#if defined(COMPRESSED_FILES) && defined(HAVE_FDSTREAM_COMPATIBILITY)
    fdistream iscc; 	    	    // Pipe input stream to read compressed input file
#endif
    
    string fromLine;	    	    // ``\.{From\ }'' line for diagnostics
    string messageID;	    	    // Message ID for diagnostics
    
    string lastFromLine;	    // Last ``\.{From\ }'' line shown in diagnostics
    string lastMessageID;   	    // Last message ID shown in disgnostics
    
    //	Compressed file decoding
#if defined(COMPRESSED_FILES) || defined(HAVE_DIRECTORY_TRAVERSAL)
    FILE *ip;	    	    	    // File handle used for |popen| pile to decompressor
#endif
    
#ifdef HAVE_DIRECTORY_TRAVERSAL
    //	Directory traversal
    bool dirFolder; 	    	    // Are we reading a directory folder ?
    DIR *dh;	    	    	    // Handle for |readdir|
    string dirName, cfName; 	    // Directory name and current file name in directory
    string pathSeparator;   	    // System path separator
#ifdef HAVE_FDSTREAM_COMPATIBILITY
    fdistream ifcdir;	    	    // Stream to read compressed file in directory
#endif
    ifstream ifdir; 	    	    // Stream to read file in directory
    istringstream nullstream;	    // Null stream for empty directory case
#endif
    
    //	Body encoding properties
    string bodyContentType; 	    	// \.{Content-Type}
    string bodyContentTypeCharset;  	// \hskip2em \.{charset=}
    string bodyContentTypeName;  	// \hskip2em \.{name=}
    string bodyContentTransferEncoding; // \.{Content-Transfer-Encoding}
    
    //	MIME multi-part separators and status
    string partBoundary;    	    // Mime part boundary sentinel
    bool multiPart; 	    	    // Is message MIME multi-part ?
    bool inPartHeader;	    	    // In MIME part header ?
    unsigned int partHeaderLines;   // Number of lines in part header
    stack <string> partBoundaryStack;	// |stack| of part boundaries for \.{multipart/alternative} nesting
    
    //	MIME properties of current part
    string mimeContentType; 	    	// \.{Content-Type}
    string mimeContentTypeCharset;  	// \hskip2em \.{charset=}
    string mimeContentTypeName;  	// \hskip2em \.{name=}
    string mimeContentTypeBoundary; 	// \hskip2em \.{boundary=}
    string mimeContentTransferEncoding; // \.{Content-Transfer-Encoding}
    string mimeContentDispositionFilename; // \.{Content-Disposition} \hskip2em \.{filename=}
    
    //	MIME decoders
    MIMEdecoder *mdp;	    	    // Active MIME decoder if any
    identityMIMEdecoder imd;	    // Identity MIME decoder for testing
    base64MIMEdecoder bmd;	    // Base64 MIME decoder for testing
    sinkMIMEdecoder smd;    	    // Sink MIME decoder
    quotedPrintableMIMEdecoder qmd; // Quoted-Printable MIME decoder
    
    //	Multi-byte character set decoding
    MBCSinterpreter *mbi;   	    // Active multi-byte character set interpreter or |NULL|
    EUC_MBCSdecoder mbd_euc;	    // EUC decoder
    GB2312_MBCSinterpreter mbi_gb2312; // GB2312 interpreter
    Big5_MBCSdecoder mbd_big5;	    // Big5 decoder
    Big5_MBCSinterpreter mbi_big5;  // Big5 interpreter
    KR_MBCSinterpreter mbi_kr;	    // Korean (\.{euc-kr}) interpreter
    UTF_8_Unicode_MBCSdecoder mbd_utf_8; // Unicode UTF-8 decoder
    Unicode_MBCSinterpreter mbi_unicode;    // Unicode interpreter
    
    //	Application file string parsing
    applicationStringParser *asp;   // Application string parser or NULL if none
    flashTextExtractor aspFlash;    // Flash animation string parser
#ifdef HAVE_PDF_DECODER
    pdfTextExtractor aspPdf;	    // PDF string parser
#endif
    
    //	Byte stream decoding
    bool byteStream;	    	    // Extract probable strings from binary files ?
    
    list <string> *tlist;   	    // Message transcript list
        
    list <string> *dlist;   	    // Diagnostic message contents list
        
    mailFolder(istream &i, dictionaryWord::mailCategory cat = dictionaryWord::Unknown) {
#if defined(COMPRESSED_FILES) || defined(HAVE_DIRECTORY_TRAVERSAL)
    	ip = NULL;
#endif	
#ifdef HAVE_DIRECTORY_TRAVERSAL
    	dirFolder = false;
#endif
    	set(&i, cat);
    }
    
    mailFolder(string fname, dictionaryWord::mailCategory cat = dictionaryWord::Unknown) {
#if defined(COMPRESSED_FILES) || defined(HAVE_DIRECTORY_TRAVERSAL)
    	ip = NULL;
#endif	
	@<Check whether folder is a directory of messages@>;

#ifdef HAVE_DIRECTORY_TRAVERSAL
    	if (!dirFolder) {
#endif	
#ifdef COMPRESSED_FILES
    	    @<Check for symbolic link to compressed file@>;

	    if (jname.rfind(Compressed_file_type)  ==
    	    	    	   (jname.length() - string(Compressed_file_type).length())) {
		@<Open pipe to read compressed file@>;
	    } else {
#endif
    		if (fname == "-") {
	    	    is = &cin;
		} else {
	    	    isc.open(fname.c_str());
	    	    is = &isc;
		}
#ifdef COMPRESSED_FILES
    	    }
#endif
#ifdef HAVE_DIRECTORY_TRAVERSAL
    	}
#endif
	if (!(*is)) {
	    cerr << "Cannot open mail folder file " << fname << endl;
	    exit(1);
	}
	set (is, cat);
    }
    
    ~mailFolder() {
#ifdef COMPRESSED_FILES
    	if (ip != NULL) {
	    pclose(ip);
	}
#endif
    }
    
    void set(istream *i, dictionaryWord::mailCategory cat = dictionaryWord::Unknown) {
    	is = i;
	nLines = nMessages = 0;
	lookedAhead = false;
	lookAheadLine = "";
	category = cat;
	dlist = NULL;
	tlist = NULL;
	@<Reset MIME decoder state@>;
	bodyContentType = bodyContentTypeCharset =
	    bodyContentTypeName = bodyContentTransferEncoding = "";
	expectingNewMessage = true;
	setNewMessageEligiblity();
	BSDfolder = false;
    }
    
    void setCategory(dictionaryWord::mailCategory c) {
    	category = c;
    }
    
    dictionaryWord::mailCategory getCategory(void) const {
    	return category;
    }
    
    void setBSDmode(bool mode) {
    	BSDfolder = mode;
    }
    
    bool getBSDmode(void) const {
    	return BSDfolder;
    }
    
    void setNewMessageEligiblity(bool stat = true) {
    	lastLineBlank = stat;
    }
    
    void forceInHeader(bool state = true) {
    	inHeader = state;
    }
    
    bool nextLine(string &s);
    
    int nextByte(void);
    
#ifdef HAVE_DIRECTORY_TRAVERSAL
    bool findNextFileInDirectory(string &fname);
    bool openNextFileInDirectory(void);
#endif
    
    static void stringCanonicalise(string &s);
    
    static bool compareHeaderField(string &s, const string target, string &arg);
    
    static bool parseHeaderArgument(string &s, const string target, string &arg);
    
    static bool isSpoofedExecutableFileExtension(const string &s);
    
    bool isNewMessage(void) const {
    	return newMessage;
    }
    
    unsigned int getMessageCount(void) const {
    	return nMessages;
    }
    
    unsigned int getLineCount(void) const {
    	return nLines;
    }
    
    bool isByteStream(void) const {
    	return byteStream;
    }
       
    void describe(ostream &os = cout) const {
    	os << "Mail folder.  Category: " << dictionaryWord::categoryName(category) << endl;
    	os << "  Lines: " << getLineCount() << "  Messages: " << getMessageCount() << endl;
    }
    
    void setDiagnosticList(list <string> *lp) {
    	dlist = lp;
    }
    
    void setTranscriptList(list <string> *lp) {
    	tlist = lp;
    }
    
    unsigned int sizeMessageTranscript(const unsigned int lineOverhead = 1) const;
    void writeMessageTranscript(ostream &os = cout) const;
    void writeMessageTranscript(const string fname = "-") const;
    void clearMessageTranscript(void) {
    	assert(tlist != NULL);
	tlist->clear();
    }
    
    void reportParserDiagnostic(const string s);
    void reportParserDiagnostic(const ostringstream &os);
};
    
@
The |nextLine| method returns the next line from the mail folder
to the caller, while parsing the mail folder into headers,
recognising MIME multi-part messages and their boundaries
and encodings.  We wrap a grand |while| loop around the entire
function so code within it can ignore the current input line
(which may, depending on where you are in the process, be the
concatenation of header lines with continuations), with a
simple |continue|.

@<Class implementations@>=
    bool mailFolder::nextLine(string &s) {
    	while (true) {
    	    bool decoderEOF = false;

    	    if (lookedAhead) {
		s = lookAheadLine;
		lookedAhead = false;
	    } else {
		if (mdp != NULL) {
	    	    if ((asp != NULL) ? (!asp->nextString(s)) : (!(mdp->getDecodedLine(s)))) {
		    	if (asp != NULL) {
			    if (Annotate('d')) {
				ostringstream os;

				os << "Closing " << asp->name() << " application file decoder.";
				reportParserDiagnostic(os);
			    }
			    asp->close();
			    asp = NULL;
			}
    	    		s = mdp->getTerminatorSentinel();
			decoderEOF = mdp->isEndOfFile();
			if (decoderEOF) {
			    s = "";
			}
			if (Annotate('d')) {
			    ostringstream os;
			    
			    os << "Closing out " << mdp->name() << " decoder.  " <<
			    	  mdp->getEncodedLineCount() << " lines decoded.";
			    reportParserDiagnostic(os);
			    os.str("");
			    os << "End sentinel: " << s;
			    reportParserDiagnostic(os);
			}
    	    		@<Reset MIME decoder state@>;
    	    		inPartHeader =
    	    		    !((s.substr(0, 2) == "--") &&
		    	      (s.substr(2, partBoundary.length()) == partBoundary) &&
			      (s.substr(partBoundary.length() + 2, 2) == "--"));
			if ((!inPartHeader) && (!(partBoundaryStack.empty()))) {
		    	    partBoundary = partBoundaryStack.top();
			    partBoundaryStack.pop();
			}
		    }
		} else {
	    	    if (!getline(*is, s)) {
		    	@<Advance to next file if traversing directory@>;
			return false;
		    }
		}
	    }
	    nLines++;

    	    if (sloppyheaders) {	    

		/*  Spam seems to have begun to arrive with the required blank line
		    separating a mail message header and the first MIME separator.
		    Since the ``{\tt --}'' sequence which precedes the sentinel
		    (whatever it be) cannot appear in a legitimate header, we check
		    for it here and, if the header runs onto the sentinel, fake a
		    blank line to properly terminate the header.  */
		    
		if (inHeader && multiPart &&
		    (partBoundary != "") &&
		    (s.substr(0, 2) == "--") &&
		    (s.substr(2, partBoundary.length()) == partBoundary)) {
		    if (Annotate('d')) {
			ostringstream os;

			os << "Header runs into --" << partBoundary << " sentinel.  Adding blank line to end header.";
			reportParserDiagnostic(os);
		    }
		    assert(!lookedAhead);
		    lookedAhead = true;
		    lookAheadLine = s;
		    s = "";
		}
    	    }
	    if ((mdp == NULL) && (tlist != NULL) && (!decoderEOF)) {
		tlist->push_back(s);
	    }
	    @<Check for start of new message in folder@>;
	    @<Eliminate any trailing space from line@>;
	    @<Process message header lines@>;
	    @<Parse MIME part header@>;

	    @<Check for MIME part sentinel@>;
	    @<Decode multiple byte character set@>;
	    return true;
	}
    }
    

@
The |nextByte| method is used by the |tokenParser| when scouring
byte stream data for plausible strings.  It must only be used
when |byteStream| is set.  It returns the next byte from the
stream or $-1$ at the end of the stream and cancels |byteStream|
mode.  How we get out of here depends on a fairly intimate
mutual understanding between |mailFolder| and |tokenParser|
of each other's innards.

@<Class implementations@>=
    int mailFolder::nextByte(void) {
	assert(mdp != NULL);
	int c = mdp->getDecodedChar();
	if (c < 0) {
	    byteStream = false;
	    if (Annotate('d')) {
		ostringstream os;
			    
    	    	os << "End of byte stream.  Deactivating byte stream parser.";
		reportParserDiagnostic(os);
    	    }
	}
	return c;
    }

@
The type of compression and command required to expand compressed
files may differ from system to system.  The following code,
conditional based on variables determined by the
\.{autoconf} process, defines the file suffix denoting a
compressed file and the corresponding command used to decode it.
We only support one type of compression on a given system; if
\.{gzip} is available, we use it in preference to \.{compress}.

@<Configure compression suffix and command@>=

#ifdef HAVE_POPEN
#if (defined HAVE_GUNZIP) || (defined HAVE_GZCAT) || (defined HAVE_GZIP)
    #define COMPRESSED_FILES
    static const char Compressed_file_type[] = ".gz";@#

    static const char Uncompress_command[] =@/
    #if (defined HAVE_GUNZIP)@/
        "gunzip -c"@/
    #elif (defined HAVE_GZCAT)@/
        "gzcat"@/
    #elif (defined HAVE_GZIP)@/
        "gzip -cd"@/
    #endif@/
        ;
#elif (defined HAVE_ZCAT) || (defined HAVE_UNCOMPRESS) || (defined HAVE_COMPRESS)@/
    #define COMPRESSED_FILES
    static const char Compressed_file_type[] = ".Z";@#

    static const char Uncompress_command[] =@/
    #if (defined HAVE_ZCAT)@/
        "zcat"@/
    #elif (defined HAVE_UNCOMPRESS)@/
        "uncompress -c"@/
    #elif (defined HAVE_COMPRESS)@/
        "compress -cd"@/
    #endif@/
        ;
#endif
#endif

@
Before testing whether the input file is compressed, see if the
name we were given is a symbolic link.  If so, follow the link
and test the actual file.  We only follow links up to 50 levels.  We copy
the file name given us to |jname|, then attempt to interpret it
as a symbolic link by calling |readlink|, which will fail if
the name is not, in fact, a symbolic link.  If it is, we obtain
the link destination as a \CEE/ string, which is copied into
|jname| prior to the test for a compressed file extension.

@<Check for symbolic link to compressed file@>=
#ifdef HAVE_READLINK
    int maxSlinks = 50;
    
    string jname = fname;
    char slbuf[1024];
    while (maxSlinks-- > 0) {
	int sll = readlink(jname.c_str(), slbuf, (sizeof slbuf) - 1);
	if (sll >= 0) {
	    assert(sll < static_cast<int>(sizeof slbuf));
	    slbuf[sll] = 0;
	    jname = slbuf;
	} else {
	    break;
	}
    }
    if (maxSlinks <= 0) {
	cerr << "Warning: probable symbolic link loop for \"" <<
		fname << "\"" << endl;
    }
#endif
    
@
If our input file bears an extension which identifies it as a
compressed file, we use |popen| to create a file handle
connected to a pipe to the appropriate decompression
program.  The pipe is then screwed into the input stream
from which we subsequently read.

@<Open pipe to read compressed file@>=
    string cmd(Uncompress_command);
    cmd += ' ' + fname;
    ip = popen(cmd.c_str(), "r");
#ifdef HAVE_FDSTREAM_COMPATIBILITY
    iscc.attach(fileno(ip));
    is = &iscc;
#else
    isc.attach(fileno(ip));
    is = &isc;
#endif

@
Some mail systems define mail folders as directories containing
individual messages as files.  If the folder name is in fact a
directory, set up to retrieve the contents of all the files
it contains logically concatenated.

@<Check whether folder is a directory of messages@>=
#ifdef HAVE_DIRECTORY_TRAVERSAL
    dirFolder = false;
    struct stat fs;
    
    if ((stat(fname.c_str(), &fs) == 0) && S_ISDIR(fs.st_mode)) {
    	dh = opendir(fname.c_str());
	if (dh != NULL) {
	    dirFolder = true;
    	    dirName = fname;
	    pathSeparator = '/';	    	// Should detect in configuration process
    	    if (!findNextFileInDirectory(fname)) {
	    	nullstream.str("");
	    	is = &nullstream; 	// Doooh!!!  No mail messages in directory
	    } else {
	    	if (verbose) {
		    cerr << "Processing files from directory \"" <<
		    	    dirName << "\"." << endl;
		}
	    }
    	} else {
	    cerr << "Cannot open mail folder directory \"" << fname << "\"" << endl;
	    exit(1);
	}
    }
#endif

@
When we're reading a mail folder consisting of a directory of
individual mail messages, when we reach the end of a message
file we wish to seamlessly advance to the next file, logically
concatenating the files in the directory.  This method, which
should be called whenever the next file in the directory is
required, searches the directory for the next eligible
file and opens it.  We return |true| if the file was opened
successfully and |false| if the end of the directory was
hit whilst looking for the next file.

@<Class implementations@>=
#ifdef HAVE_DIRECTORY_TRAVERSAL
    bool mailFolder::findNextFileInDirectory(string &fname) {
    	assert(dirFolder);
	
	if (dh == NULL) {
	    return false;   	    	// End of directory already encountered
	}
	
	while (true) {
    	    struct dirent *de;
	    struct stat fs;

    	    de = readdir(dh);
	    if (de == NULL) {
		closedir(dh);
		dh = NULL;
		return false;
	    }
	    cfName = dirName + pathSeparator + de->d_name;
    	    if (stat(cfName.c_str(), &fs) == 0) {
		if (S_ISREG(fs.st_mode)) {
		    fname = cfName;
		    return openNextFileInDirectory();
		}
	    } else {
		if (verbose) {
    	    	    cerr << "Cannot get status of " << cfName << ".  Skipping." << endl;
		}
	    }
	}    	
    }
#endif

@
Open the next file in a directory of files which constitute a logical
mail folder.  |findNextFileInDirectory| has already vetted and expanded
the path name, certifying that (at least when it checked) the target was
an extant regular file.

@<Class implementations@>=
#ifdef HAVE_DIRECTORY_TRAVERSAL
    bool mailFolder::openNextFileInDirectory(void) {
    	assert(dirFolder);
	
	if (dh == NULL) {
	    return false;
	}
	
#ifdef COMPRESSED_FILES
    	string fname = cfName;
    	@<Check for symbolic link to compressed file@>;
	
	if (jname.rfind(Compressed_file_type) ==
	    	       (jname.length() - string(Compressed_file_type).length())) {
	    string cmd(Uncompress_command);
	    cmd += ' ' + fname;
	    ip = popen(cmd.c_str(), "r");
#ifdef HAVE_FDSTREAM_COMPATIBILITY
	    ifcdir.attach(fileno(ip));
	    ifcdir.clear();  	// Stupid |attach| doesn't reset |ios::eofbit|!
	    is = &ifcdir;
#else
	    ifdir.attach(fileno(ip));
	    ifdir.clear();  	// Stupid |attach| doesn't reset |ios::eofbit|!
	    is = &ifdir;
#endif
	} else {
#endif
	    ifdir.open(cfName.c_str());
	    if (!ifdir.is_open()) {
		if (verbose) {
	    	    cerr << "Unable to open mail folder directory file \"" <<
	    	    	    cfName << "\"" << endl;
		}
		return false;
	    }
    	    ifdir.clear();  	//  Clean |ios::eofbit| if |open| didn't do so
	    is = &ifdir;
#ifdef COMPRESSED_FILES
	}
#endif
    	expectingNewMessage = true; 	// Expect file to contain a new message
	setNewMessageEligiblity();
	return true;
    }
#endif

@
When we hit end of file, check whether we're traversing a directory
and, if so, advance to the next file within it.  When we
reach the end of the directory, call it quits.

@<Advance to next file if traversing directory@>=
#ifdef HAVE_DIRECTORY_TRAVERSAL
    if (dirFolder) {
    	if (ip != NULL) {
    	    pclose(ip);
	    ip = NULL;
	} else {
	    ifdir.close();  	// Close previous file from directory
	}
	if (findNextFileInDirectory(cfName)) {
	    continue;
	}
    }
#endif


@
Each message in a folder begins with a line containing the text
``\.{From\ }'' starting in the first column.  Well, more or
less$\ldots\,$.  In the beginning there were BSD mail folders,
in which messages were simply concatenated together with the
start of each message indicated by a line beginning with the
``\.{From\ }'' sentinel.  In this scheme, any line in a message
body which matches this pattern must be quoted, usually by
inserting a ``\.{>}'' character in column 1, but this is not
universal.  This was kind of ugly, and could cause problems
when messages began to contain content other than human-readable
text, so then there were Sun message folders, where each message
header indicated the number of bytes in the message with a
``\.{Content-Length}'' header item.  You can imagine how
disastrous this was in the typical \UNIX/ environment where
people pass mail folders and messages through all kinds
of text filters---'nuff said; better forgotten.  These
days the most common form of text file mail folder is a
compromise in which the basic BSD scheme is used, but the
``\.{From\ }'' sentinel only designates the start of a message
if it appears following a blank line.  This avoids quoting
many cases in body copy, while remaining robust against
editing and ease of parsing by simple programs.

If |BSDfolder| is set, we follow the original BSD semantics
and recognise any ``\.{From\ }'' as beginning a new message.
Otherwise, we only treat the sentinel as denoting the start
of message if it follows a blank line or appears at the
start of the folder.

Upon finding the start of a message, we increment the number of
messages in the folder, mark the start of a new message, and
set the |inHeader| flag to indicate we're parsing the header
section of the message.

One complication is that some mail systems which store messages
as files in a directory do {\it not} include the ``\.{From\ }''
sentinel at the start of message files.  We use the
|expectingNewMessage| flag to cope with this.  This flag gets
set at the start of every new file we begin to read (whether
a concatenated mail folder or a file within a directory).  When
this flag is set, the first nonblank line in the file is considered
the start of message, even if it isn't the ``\.{From\ }'' sentinel.

@d messageSentinel  "From " 	    // First line of each message in folder

@<Check for start of new message in folder@>=
#ifdef BSD_DIAG
if (s.substr(0, (sizeof messageSentinel) - 1) == messageSentinel) {
    if (!BSDfolder && !lastLineBlank) {
    	cerr << "*** NonBSD From line ditched: " << s << endl;
    }
}
#endif
    if (((s.substr(0, (sizeof messageSentinel) - 1) == messageSentinel) &&
    	 (BSDfolder || lastLineBlank))
    	|| 
    	(expectingNewMessage && (s.length() > 0) && (!isISOspace(s[0])))) {
	nMessages++;
	newMessage = true;
	expectingNewMessage = false;
	inHeader = true;
	multiPart = false;
	inPartHeader = false;
	partHeaderLines = 0;
	bodyContentType = bodyContentTypeCharset =
	    bodyContentTypeName = bodyContentTransferEncoding = "";
	fromLine = s;	    	    // Save last ``\.{From\ }'' line for diagnostics
	lastFromLine = lastMessageID = messageID = "";
	while (!partBoundaryStack.empty()) {
    	    ostringstream os;
	    
    	    os << "Orphaned part boundary on stack: \"" << partBoundaryStack.top() << "\"";
	    reportParserDiagnostic(os);
	    partBoundaryStack.pop();
	}
	@<Reset MIME decoder state@>;
    } else {
	newMessage = false;
    }

@
To facilitate message parsing, we delete any white space from the
ends of lines.  Mail transfer agents are explicitly permitted to
do this, and all forms of encoding are proof against it.  If the
line is blank after pruning trailing white space, we note this
to use in testing for the start of the next message for non-BSD
folders.

@<Eliminate any trailing space from line@>=
    while ((s.length() > 0) && (isISOspace(s[s.length() - 1]))) {
	s.erase(s.length() - 1);
    }
    setNewMessageEligiblity(s.empty());

@
If we're within the message header section, there are various
things we want to be on the lookout for.  First, of course, is
the blank line that denotes the end of the header.  If the
header declares the content type of the body to be MIME
multi-part, we need to save the part boundary separator
for later use.  As it happens, this code works equally fine
for parsing the part headers which follow the sentinel
denoting the start of new part in a MIME multi-part message.

@<Process message header lines@>=
    if (inHeader || inPartHeader) {
    	if (s == "") {
	    if (inHeader) {
	    	if ((!multiPart) && (bodyContentTransferEncoding != "")) {
    	    	    mimeContentType = bodyContentType;
		    mimeContentTypeCharset = bodyContentTypeCharset;
		    mimeContentTypeName = bodyContentTypeName;
		    mimeContentTransferEncoding = bodyContentTransferEncoding;
		    multiPart = true;
		    partBoundary = "";
    	    	}
    	    }
	    inHeader = inPartHeader = false;
    	    @<Activate MIME decoder if required@>;
    	}
	@<Check for continuation of mail header lines@>;
	
	@<Save Message-ID for diagnostics@>;
	@<Process multipart MIME header declaration@>;
	@<Process body content type declarations@>;
	@<Check for encoded header line and decode@>;
    }
    
@
At the end of a MIME part, switch off the decoder and reset the
part properties to void.

@<Reset MIME decoder state@>=
    mimeContentType = mimeContentTypeCharset =
    	mimeContentTypeName = mimeContentDispositionFilename =
    	mimeContentTypeBoundary = mimeContentTransferEncoding = "";
    mdp = NULL;
    mbi = NULL;
    asp = NULL;
    byteStream = false;

@
Statements in the message header section may be continued onto
multiple lines.  Continuations are denoted by white space in
the first column of successive continuations.  To simplify
header parsing, we look ahead and concatenate all
continuations into one single header statement.  The twiddling
with |lal| in the following code is to ensure the integrity
of transcripts.  We delete trailing space from the look
ahead line before concatenating it, but if we in fact looked
ahead to a line which is not a continuation, we want to
eventually save it in the transcript as it originally arrived,
complete with trailing space, so we replace it with the original
line before deleting the trailing space.

@<Check for continuation of mail header lines@>=
    @<Check for lines with our sentinel already present in the header@>;
    
    while ((inHeader || inPartHeader) && getline(*is, lookAheadLine) != NULL) {
    	string lal = lookAheadLine;
	
	while ((lookAheadLine.length() > 0) && (isISOspace(lookAheadLine[lookAheadLine.length() - 1]))) {
	    lookAheadLine.erase(lookAheadLine.length() - 1);
	}
    	if ((lookAheadLine.length() > 0) && isISOspace(lookAheadLine[0])) {
	    string::size_type p = 1;
	    while (isISOspace(lookAheadLine[p])) {
	    	p++;
	    }
	    s += lookAheadLine.substr(p);
    	    if ((tlist != NULL) && (!isSpoofedHeader)) {
	    	tlist->push_back(lal);
	    }
	    continue;
	}
	lookedAhead = true;
	lookAheadLine = lal;
	break;
    }
    if (isSpoofedHeader) {
    	ostringstream os;
	
    	os << "Spoofed header rejected: " << s;
	reportParserDiagnostic(os.str());
    	continue;
    }
    
@
A clever junk mail author might try to evade filtering
based on the header items we include in the
\.{--transcript} by including his own, on the assumption
that a downstream filter would not detect the multiple
items and filter on the first one it found.  To prevent this,
and to make it more convenient when feeding transcripts back
through the program (for testing the effects of different
settings or for training on new messages), we
detect header lines which begin with our |Xfile| sentinel
and completely delete them from the transcript.  The
|isSpoofedHeader| flag causes continuation lines, if any,
to be deleted as well.  (At this writing we never use
continuations of our header items, but better safe than
sorry.)

@<Check for lines with our sentinel already present in the header@>=
    bool isSpoofedHeader = false;
    if (inHeader) {
    	string sc = s, scx = Xfile;
	
	stringCanonicalise(sc);
	stringCanonicalise(scx);
	scx += '-';
     	if (sc.substr(0, scx.length()) == scx) {
	    if (tlist != NULL) {
    	    	tlist->pop_back();
	    }
	    isSpoofedHeader = true;
	}
    }
    
@
When processing mail folders in bulk, as when generating a dictionary,
we want to identify parser diagnostics with the message which they
refer to.  While processing the header, we save the \.{Message-ID}
tag, which which |reportParserDiagnostic| prefixes the message
in its \.{--verbose} mode output.  Messages which lack a
\.{Message-ID} header item must be identified from the ``\.{From\ }'' line.
\pdfURL{RFC~2822}{http://www.ietf.org/rfc/rfc2822.txt?number=2822}
specifies that \.{Message-ID} {\it should} be present, but is
an optional field.

@<Save Message-ID for diagnostics@>=
    {
    	string arg;
	
	if (inHeader && compareHeaderField(s, "message-id", arg)) {
	    messageID = arg;
	    lastMessageID = "";
	}
    }

@
It is possible for the main body of a message to be
encoded with a \.{Content-Transfer-Encoding}
specification.  While encoding is usually encountered
in MIME multi-part messages, junk mail sometimes takes
advantage of encoding to hide trigger words from
content-based filters.  If the message body is encoded,
we need to interpose the appropriate filter before
parsing it.
    
@<Process body content type declarations@>=
    {
    	string arg, par;
	
    	if (compareHeaderField(s, "content-type", arg)) {
	    if (parseHeaderArgument(s, "charset", par)) {
		stringCanonicalise(par);
		bodyContentTypeCharset = par;
	    }
	    if (parseHeaderArgument(s, "name", par)) {
		bodyContentTypeName = par;
	    }
	    bodyContentType = arg;
	}
	if (inHeader && compareHeaderField(s, "content-transfer-encoding", arg)) {
    		bodyContentTransferEncoding = arg;	    
	}
    }
    
@
Message header lines may contain sequences of
characters encoded in \.{Quoted-Printable} or
\.{Base64} form (since mail headers must not contain 8 bit
characters). To better extract words from these lines, we test
for such subsequences and replace them with the encoded
text.  Due to the fact that, in the fullness of time, this code
will be fed every conceivable kind of nonconforming trash, it
must be completely bulletproof. The flailing around with |p4|
protects against falling into a loop when decoding a
sequence fails.

@<Check for encoded header line and decode@>=
    if (inHeader) {
	string sc = s;
	string::size_type p, p1, p2, p3, p4;
	char etype;
	unsigned int ndecodes = 0;
	string charset;
	
	stringCanonicalise(sc);
	p4 = 0;
	while (((p = sc.find("=?", p4)) != string::npos)) {
    	    p4 = p + 2;
	    if (((p1 = sc.find("?q?", p4)) != string::npos) ||
	    	((p1 = sc.find("?b?", p4)) != string::npos)) {
		charset = sc.substr(p4, p1 - p4);
		etype = sc[p1 + 1];
	    	p4 = p1 + 3;
	    	if ((p2 = sc.find("?=", p4)) != string::npos) {
		    p1 += 3;
		    p3 = p2 - p1;
    	    	    string drt;
		    if (etype == 'q') {
		    	drt = quotedPrintableMIMEdecoder::decodeEscapedText(sc.substr(p1, p3), this);
		    } else {
		    	assert(etype == 'b');
		    	drt = base64MIMEdecoder::decodeEscapedText(sc.substr(p1, p3), this);
		    }
		    @<Interpret header quoted string if character set known@>;
		    sc.replace(p, (p2 - p) + 2, drt);
    	    	    p4 = p + drt.length();
		    ndecodes++;
		}
	    }
	}
	if (ndecodes > 0) {
	    s = sc;
	}	    
    }

@
After decoding the \.{Quoted-Printable} or \.{Base64} sequence from
the header line, examine its character set specification.  If it
is a character set we know how to decode and interpret,
instantiate the appropriate components and replace the decoded
sequence with its interpretation.  There is no need to further
process \.{ISO-8859} sequences.

@<Interpret header quoted string if character set known@>=
    if (charset.substr(0, 6) == "gb2312") {
	EUC_MBCSdecoder mbd_euc;	    // EUC decoder
	GB2312_MBCSinterpreter mbi_gb2312;  // GB2312 interpreter
	
	mbd_euc.setMailFolder(this);
	mbi_gb2312.setDecoder(mbd_euc);
	drt = mbi_gb2312.decodeLine(drt);
    } else if (charset == "big5") {
    	Big5_MBCSdecoder mbd_big5;          // Big5 decoder
    	Big5_MBCSinterpreter mbi_big5;      // Big5 interpreter
	
	mbd_big5.setMailFolder(this);
	mbi_big5.setDecoder(mbd_big5);
	drt = mbi_big5.decodeLine(drt);
    } else if (charset == "utf-8") {
    	UTF_8_Unicode_MBCSdecoder mbd_utf_8;	// Unicode UTF-8 decoder
    	Unicode_MBCSinterpreter mbi_unicode;	// Unicode interpreter
	
	mbd_utf_8.setMailFolder(this);
	mbi_unicode.setDecoder(mbd_utf_8);
	drt = mbi_unicode.decodeLine(drt);
    } else if (charset == "euc-kr") {
	EUC_MBCSdecoder mbd_euc;	    // EUC decoder
	KR_MBCSinterpreter mbi_kr;	    // Korean (\.{euc-kr}) interpreter
	
	mbd_euc.setMailFolder(this);
	mbi_kr.setDecoder(mbd_euc);
	drt = mbi_kr.decodeLine(drt);
    } else if ((charset.substr(0, 8) == "iso-8859") ||
    	       (charset == "us-ascii")) {
    	// No decoding or interpretation required for ISO-8859 or US-ASCII
    } else {
    	ostringstream os;
	
    	os << "Header line: no interpreter for (" << charset << ") character set.";
	reportParserDiagnostic(os.str());
    }
        
@
Here we parse interesting fields from a MIME message part
header.

@<Parse MIME part header@>=
    if (multiPart && inPartHeader) {
    	string arg, par;
	
	partHeaderLines++;
    	if (compareHeaderField(s, "content-type", arg)) {
	    if (parseHeaderArgument(s, "charset", par)) {
		stringCanonicalise(par);
		mimeContentTypeCharset = par;
	    }
	    if (parseHeaderArgument(s, "boundary", par)) {
	    	mimeContentTypeBoundary = par;
	    }
	    if (parseHeaderArgument(s, "name", par)) {
		mimeContentTypeName = par;
	    }
	    mimeContentType = arg;
	}
	
    	if (compareHeaderField(s, "content-transfer-encoding", arg)) {
    	    mimeContentTransferEncoding = arg;	    
    	}
	
    	if (compareHeaderField(s, "content-disposition", arg)) {
	    if (parseHeaderArgument(s, "filename", par)) {
		mimeContentDispositionFilename = par;
	    }
    	}
    }
    
@
A multi-part message in MIME format will contain a declaration
in the header which identifies the body as being in that format
and provides a part separator sentinel which appears before each
subsequent part.  We test for the MIME declaration and save the
part boundary sentinel for later use.

@<Process multipart MIME header declaration@>=
    string::size_type p, p1;
    string arg;

    if (inHeader && compareHeaderField(s, "content-type", arg)) {
    	string sc = s;
	stringCanonicalise(sc);
    	if ((p = sc.find("multipart/", 13)) != string::npos) {
	    if ((p = sc.find("boundary=", p + 10)) != string::npos) {
	    	if (s[p + 9] == '\"') {
		    p1 = sc.find("\"", p + 10);
		    p += 10;
		} else {
		    p += 9;
		    p1 = sc.length() - p;
		}
		multiPart = true;
	    	partBoundary = s.substr(p, (p1 - p));
		if (Annotate('d')) {
		    ostringstream os;

		    os << "Multi-part boundary: \"" << partBoundary << "\"";
		    reportParserDiagnostic(os);
		}
    	    }
    	}
    }
    
@
If we're in the body of a MIME multi-part message, we must
test each line against the |partBoundary| sentinel declared
in the ``\.{Content-type:}'' header statement.  If the line
is a part boundary, we then must parse the part header
which follows.

@<Check for MIME part sentinel@>=
    if (multiPart && (!inHeader) && 
    	(partBoundary != "") &&
    	(s.substr(0, 2) == "--") &&
    	(s.substr(2, partBoundary.length()) == partBoundary) &&
	(s.substr(partBoundary.length() + 2) != "--")) {
    	inPartHeader = true;
	mimeContentType = mimeContentTypeCharset = mimeContentTypeBoundary =
	mimeContentTransferEncoding = "";
    }
    
@
If we're in the body of text encoded in a multiple-byte character
set, pass the text through the interpreter to convert it into
a form we can better recognise.

@<Decode multiple byte character set@>=
    if ((mbi != NULL) && (!inHeader) && (!inPartHeader)) {
    	s = mbi->decodeLine(s);
    }
    
@
If we've just reached the end of a MIME part header, determine if the
body which follows requires decoding.  If so, activate the
appropriate decoder and place it in the pipeline between the
raw mail folder and our parsing code.

@<Activate MIME decoder if required@>=
    if (multiPart) {
	assert(mdp == NULL);

#ifdef TYPE_LOG
    	/* If |TYPE_LOG| is defined, we create a file containing all
	   of the part properties we've seen.  You can obtain a list
	   of things you may need to worry about by processing
	   one of the fields $n$ of this file with a command
	   like \.{cut -f$n$ /tmp/typelog.txt \vbar{} sort \vbar{} uniq}. */
    	typeLog << mimeContentType << "\t" <<
	    	   mimeContentTypeCharset << "\t" <<
		   mimeContentTransferEncoding << endl;
#endif

    	@<Check for change of sentinel within message@>;
	
	@<Check for application file types for which we have a decoder@>;
	
	@<Detect binary parts worth parsing for embedded ASCII strings@>;
    
    	@<Test for Content-Types we always ignore@>@;
	
	@<Process Content-Types we are interested in parsing@>;
    }
    
@
The sentinel which delimits parts of a multi-part message may
be changed in the middle of the message by a \.{Content-Type}
of \.{multipart/alternative} specifying a new \.{boundary=}. 
Detect this and change the part boundary on the fly. These
parts usually seem devoid of content, but just in case fake a
content type of \.{text/plain} so anything which may be there
gets looked at.
       
@<Check for change of sentinel within message@>=
    if (mimeContentType == "multipart/alternative") {
	if (mimeContentTypeBoundary != "") {
	    partBoundaryStack.push(partBoundary);
	    partBoundary = mimeContentTypeBoundary;
	} else {
	    if (Annotate('d')) {
		ostringstream os;

		os << "Boundary missing from Content-Type of multipart/alternative.";
		reportParserDiagnostic(os);
	    }
	}
    }
    
@
We have decoders for certain application file types.  Check the
\.{Content-Type} for types we can decode, and if it's indeed
one we can, splice the appropriate decoder into the pipeline.

@<Check for application file types for which we have a decoder@>=
#ifdef HAVE_PDF_DECODER
    if (mimeContentType == "application/pdf") {
	asp = &aspPdf;
    } else
#endif
    	   if ((mimeContentType == "application/x-shockwave-flash") ||
	       (mimeContentType == "image/vnd.rn-realflash")) {
	asp = &aspFlash;
    }
    if (asp != NULL) {
	asp->setMailFolder(this);
	if (Annotate('d')) {
	    ostringstream os;

	    os << "Activating " << asp->name() << " application file decoder.";
	    reportParserDiagnostic(os);
	}
    }
    
@
Certain MIME \.{Content-Type} declarations denote binary files
best classified by parsing them for ASCII strings.  Test for
such files and invoke the requisite decoder unless binary
stream parsing has been disabled by setting
|streamMinTokenLength| to zero or the file is already
scheduled for parsing by an application-specific string
parser.

Thanks to a hideous design error in Microsoft Outlook, mail worms
can spoof the test for executable content by declaring an attachment
as an innocuous file type such an image or audio file, and then cause
it to be executed simply by specifying a file name with one of the many
Microsoft executable file extensions.  We check for such spoofed
attachments and pass them through the byte stream parser as well.

@<Detect binary parts worth parsing for embedded ASCII strings@>=
    if ((asp == NULL) && (streamMinTokenLength > 0) &&
	    ((mimeContentType.substr(0, 12) == "application/") ||
	     (((mimeContentType.substr(0, 6) == "audio/") ||
	       (mimeContentType.substr(0, 6) == "image/")) &&
	      (isSpoofedExecutableFileExtension(mimeContentTypeName) ||
	       isSpoofedExecutableFileExtension(mimeContentDispositionFilename))
	     )
	    )
        ) {
//cout << "* * *  Content-type name = \"" << mimeContentTypeName << "\"" << endl;
//cout << "* * *  Content-Disposition filename = \"" << mimeContentDispositionFilename << "\"" << endl;
	if (Annotate('d')) {
	    ostringstream os;

	    os << "Activating byte stream parser for \"" << mimeContentType << "\"";
	    reportParserDiagnostic(os);
	}
	byteStream = true;
    }
@
Test for Content-Types we are never interested in parsing,
regardless of their encoding. This includes images, video, and
most application specific files which \UNIX/ \.{strings} would
make no sense of.  These parts are dispatched to the sink
decoder for disposal.  Note that some of these items may
be compressed files and/or archives (\.{zip}, \.{gzip}, \.{tar}, etc.)
which might be comprehensible if we could enlist the appropriate
utilities, but we'll defer that refinement for now.

@<Test for Content-Types we always ignore@>=
    if (Annotate('d')) {
	ostringstream os;

    	reportParserDiagnostic("");
	os << "mimeContentType: {" << mimeContentType << "}";
	reportParserDiagnostic(os);
	os.str("");
	os << "mimeContentTypeCharset: {" << mimeContentTypeCharset << "}";
	reportParserDiagnostic(os);
	os.str("");
	os << "mimeContentTransferEncoding: {" << mimeContentTransferEncoding << "}";
	reportParserDiagnostic(os);
    }
	
    if ((asp == NULL) &&
    	(mimeContentType.substr(0, 6) == "image/") ||
	(mimeContentType.substr(0, 6) == "video/")
       ) {
    	smd.set(is, this, partBoundary, tlist);
	mdp = &smd;
	if (Annotate('d')) {
	    ostringstream os;

	    os << "Activating MIME sink decoder with sentinel: \"" << partBoundary <<
	    	  "\" due to Content-Type = " << mimeContentType;
	    reportParserDiagnostic(os);
	}
    	if (dlist) {
	    dlist->push_back(Xfile + "-Decoder: Sink");
	}
    }

@
Next, check for content types we're always interested
parsing.  This includes most forms labeled as text
and embedded mail messages.  If the content is of
interest but is encoded, make sure we have the requisite
decoder and, if so, plumb it into the pipeline.
	   
@<Process Content-Types we are interested in parsing@>=
    else if (byteStream || (asp != NULL) ||
	     (mimeContentType == "plain/txt") ||
	     (mimeContentType.substr(0, 5) == "text/") ||
	     (mimeContentType == "message/rfc822")) {	    

	@<Test for multiple byte character sets and activate decoder if available@>; 

    	@<Verify Content-Transfer-Encoding and activate decoder if necessary@>;

    	@<Cancel byte stream interpretation for non-binary encoded parts@>;
    
    	@<Test for message/rfc822 embedded as part@>;
    }

@
Just because we're {\it interested} in the contents
of this part, doesn't necessarily mean we can
{\it comprehend} it.  First of all, it must be encoded in a
form we can either read directly or have a decoder
for, and secondly it must be in a character set we
understand, not some Asian chicken tracks.  First of
all, test the character set and accept only those we
read directly or have interpreters for.
	       
@<Test for multiple byte character sets and activate decoder if available@>=
    bool gibberish = false; 

    if (mimeContentTypeCharset.substr(0, 6) == "gb2312") {
    	mbd_euc.setMailFolder(this);
    	mbi_gb2312.setDecoder(mbd_euc);
	mbi = &mbi_gb2312;
    }

    if (mimeContentTypeCharset == "big5") {
    	mbd_big5.setMailFolder(this);
    	mbi_big5.setDecoder(mbd_big5);
	mbi = &mbi_big5;
    }

    if (mimeContentTypeCharset == "utf-8") {
    	mbd_utf_8.setMailFolder(this);
    	mbi_unicode.setDecoder(mbd_utf_8);
	mbi = &mbi_unicode;
    }

    if (mimeContentTypeCharset == "euc-kr") {
    	mbd_euc.setMailFolder(this);
    	mbi_kr.setDecoder(mbd_euc);
	mbi = &mbi_kr;
    }

#ifdef CHECK_FOR_GIBBERISH_CHARACTER_SETS  
    if ((mimeContentTypeCharset.length() == 0) ||
	(mimeContentTypeCharset == "us-ascii") ||
	(mimeContentTypeCharset.substr(0, 8) == "iso-8859") ||
	(mimeContentTypeCharset == "windows-1251")) {
	if (Annotate('d')) {
	    ostringstream os;

	    os << "Accepting part in Content-Type-Charset: " << mimeContentTypeCharset << "  (" <<
	    	  mimeContentType << " " << mimeContentTransferEncoding << ")";
	    reportParserDiagnostic(os);
	}
    } else {
	if (Annotate('d')) {
	    ostringstream os;

	    os << "Rejecting part in Content-Type-Charset: " << mimeContentTypeCharset << "  (" <<
	    	   mimeContentType << " " << mimeContentTransferEncoding << ")";
	    reportParserDiagnostic(os);
	}
    	gibberish = true;
    }
#endif

@
If the contents appear to be in a character set we understand,
we still aren't home free---the part may be encoded in a manner
for which we lack a decoder.  Analyse the
\.{Content-Transfer-Encoding} specification and select the
appropriate decoder. If we lack a decoder, we must regretfully
consign the part to the sink decoder.

If we end up accreting any additional decoders, this should
probably be re-written to look up the decoder in a
|map<string, MIMEdecoder *>| and use common code for
every decoder.

@<Verify Content-Transfer-Encoding and activate decoder if necessary@>=
    if (!gibberish) {
    	if ((mimeContentTransferEncoding.length() == 0) ||
	    (mimeContentTransferEncoding.substr(0, 4) == "7bit") ||
	    (mimeContentTransferEncoding.substr(0, 4) == "8bit") ||
	    (mimeContentTransferEncoding == "ascii")) {
    	    imd.set(is, this, partBoundary, tlist); // Identity
	    mdp = &imd;
	} else if (mimeContentTransferEncoding == "base64") {
    	    bmd.set(is, this, partBoundary, tlist); // \.{Base64}
	    mdp = &bmd;
	} else if (mimeContentTransferEncoding == "quoted-printable") {
    	    qmd.set(is, this, partBoundary, tlist); // \.{Quoted-Printable}
	    mdp = &qmd;
	} else {
	    gibberish = true;
    	    smd.set(is, this, partBoundary, tlist); // Sink
	    mdp = &smd;
	}
	
	assert(mdp != NULL);
	if (Annotate('d')) {
	    ostringstream os;

	    os << (gibberish ? "Rejecting" : "Accepting") <<
	    	  " part in Content-Transfer-Encoding: " << mimeContentTransferEncoding << "  (" <<
	          mimeContentTypeCharset << " " << mimeContentType << ")";
	    reportParserDiagnostic(os);
	}
    	if (dlist) {
	    dlist->push_back(Xfile + "-Decoder: " + mdp->name());
	}
	if (Annotate('d')) {
	    ostringstream os;

	    os << "Activating MIME " << mdp->name() << " decoder with sentinel: " << partBoundary;
	    reportParserDiagnostic(os);
	}
    }

@
If we think we're about to process a byte stream, but it isn't
actually encoded, think again and treat the content as regular
text, which it in all likelihood actually is.

@<Cancel byte stream interpretation for non-binary encoded parts@>=
    if (byteStream && (mdp == NULL)) {
	if (Annotate('d')) {
	    ostringstream os;

	    os << "Canceling byte stream mode due to Content-Transfer-Encoding: {" <<
    	    	  mimeContentTransferEncoding << "}  (" <<
    	    	  mimeContentTypeCharset << " " << mimeContentType << ")";
	    reportParserDiagnostic(os);
	}
	byteStream = false;
    }
    
@
The \.{Content-Type} of ``\.{message/rfc822}'' permits one MIME message
to be embedded into another.  This is commonly used when forwarding
messages and to return the original message when sending a bounce
back to the sender.  Upon encountering an embedded message, we reset
the MIME decoder, then force the parser back into the state of
processing a message header.  This will cause any \.{Content-Type}
specifying a \.{boundary} in the embedded message to be parsed,
permitting us to properly decode MIME parts belonging to the embedded
message.

@<Test for message/rfc822 embedded as part@>=
    if (mimeContentType == "message/rfc822") {
    	@<Reset MIME decoder state@>;
    	forceInHeader();
    }

@
Canonicalise a string in place to all lower-case characters.
This works for ISO-8859 accented letters as well as
ASCII, although such characters should appear as raw text
within header items.  This is a |static| method and may be
used without reference to a |mailFolder| object.

@<Class implementations@>=
    void mailFolder::stringCanonicalise(string &s)
    {
    	for (unsigned int i = 0; i < s.length(); i++) {
	    if (isISOupper(s[i])) {
	    	s[i] = toISOlower(s[i]);
	    }
	}
    }
    
@
To facilitate parsing of header fields, this static method
performs a case-insensitive test for header field |target|
and, if it is found, stores its argument into |arg|, set
to canonical lower case.

@<Class implementations@>=
    bool mailFolder::compareHeaderField(string &s, const string target, string &arg)
    {
	if (s.length() > target.length()) {
    	    string sc = s;

	    stringCanonicalise(sc);
	    if ((sc.substr(0, target.length()) == target) &&
		(sc[target.length()] == ':')) {
		unsigned int i;
		
		for (i = target.length() + 1; i < sc.length(); i++) {
		    if (!isISOspace(sc[i])) {
		    	break;
		    }
		}
		if (i < sc.length()) {
		    int n = 0;
		    
		    while ((i + n) < sc.length()) {
		    	if (isISOspace(sc[i + n]) || (sc[i + n] == ';')) {
			    break;
			}
			n++;
		    }
		    arg = sc.substr(i, n);
		} else {
		    arg = "";
		}
	    	return true;
	    }
	}
    	return false;
    }
    
@
This static method tests for an argument to a header field and
stores the argument, if present, into |arg|.  The argument
name is canonicalised to lower case, but the argument is
left as-is.  Quotes are deleted from quoted arguments.

@<Class implementations@>=
    bool mailFolder::parseHeaderArgument(string &s, const string target, string &arg)
    {
	if (s.length() > target.length()) {
    	    string sc = s;
	    string::size_type p, p1;

	    stringCanonicalise(sc);
	    if (((p = sc.find(target)) != string::npos) &&
	    	(sc.length() > (p + target.length())) &&
		(sc[p + target.length()] == '=')) {
		p += target.length() + 1;
		if (p < s.length()) {
		    if (s[p] == '"') {
		    	if ((p1 = s.find('"', p + 1)) != string::npos) {
			    arg = s.substr(p + 1, p1 - (p + 1));
			    return true;
			}
		    } else {
			string::size_type i = p;

			for (; i < s.length(); i++) {
			    if (!isISOspace(s[i])) {
		    		break;
			    }
			}
			if (i < s.length()) {
			    int n = 0;

			    while ((i + n) < s.length()) {
		    		if ((isISOspace(s[i + n])) || (s[i + n] == ';')) {
				    break;
				}
				n++;
			    }
			    arg = s.substr(i, n);
			} else {
			    arg = "";
			}
	    		return true;
		    }
		}
	    }
	}
    	return false;
    }
    
@
Certain versions of Microsoft Outlook contain a horrific bug where
Outlook decides whether an attachment is executable based on its
``\.{Content-Type}'' declaration, but then actually decides whether to
execute it based on its ``file type'' (the extension on the file
name, for example ``\.{.EXE}'').  Predictably, mail worm programs
exploit this by tagging their payload as an innocuous file type
such as an audio or image file, but with an executable extension.

The static method tests an attachment's name against a list of
vulnerable extensions.  If it matches, this is almost certainly
a worm, which we should filter through the byte stream parser
rather than process normally.  This will crack out the strings
embedded in the worm, which will help us to fingerprint subsequent
worms of the same type.

The list of vulnerable extensions was compiled empirically from
examining mail worms collected over a three year period.  I do not
know if the list is exhaustive; Microsoft vulnerability experts aware
of any I omitted are encouraged to let me know about them.

@<Class implementations@>=
    bool mailFolder::isSpoofedExecutableFileExtension(const string &s)
    {
    	string sc = s;
	
	stringCanonicalise(sc);
	if ((sc.length() > 4) && (sc[sc.length() - 4] == '.')) {
	    string ext = sc.substr(sc.length() - 3);
	    stringCanonicalise(ext);
	    return ((ext == "exe") ||
	    	    (ext == "bat") ||
		    (ext == "scr") ||
		    (ext == "lnk") ||
		    (ext == "pif") ||
		    (ext == "com"));
    	}
	return false;
    }
    
@
Calculate the size in bytes of the message transcript if written
to a monolithic file with |lineOverhead| bytes (by default 1)
per line.

@<Class implementations@>=
    unsigned int mailFolder::sizeMessageTranscript(const unsigned int lineOverhead) const {
    	assert(tlist != NULL);
    	unsigned int n = tlist->size(), totsize = 0;
	if ((n > 1) &&
	    (tlist->back().substr(0, (sizeof messageSentinel) - 1) == messageSentinel)) {
	    n--;
	}
	list<string>::iterator p = tlist->begin();
    	for (unsigned int i = 0; i < n; i++) {
	    totsize += p->length() + lineOverhead;
	    p++;
	}
	return totsize;
    }

@
Write the message transcript saved in |tlist| to the designated
file name |fname|.  If |fname| is ``\.{-}'', the transcript is
written to standard output.  Depending upon their provenance,
transcripts may or may not contain the POP3 line end terminator
CR at the end of lines.  We append the line feed, which automatically
provides the correct line termination for \UNIX/ mail folders
and the CR/LF required for POP3 messages.

@<Class implementations@>=
    void mailFolder::writeMessageTranscript(ostream &os) const {
    	assert(tlist != NULL);
    	unsigned int n = tlist->size();
	if ((n > 1) &&
	    (tlist->back().substr(0, (sizeof messageSentinel) - 1) == messageSentinel)) {
	    n--;
	}
	list<string>::iterator p = tlist->begin();
    	for (unsigned int i = 0; i < n; i++) {
	    os << *p++ << endl;;
	}
    }
    
    void mailFolder::writeMessageTranscript(const string fname) const {
	if (fname != "-") {
    	    ofstream of(fname.c_str());
	    writeMessageTranscript(of);
	    of.close();
	} else {
	    writeMessageTranscript(cout);
	}
    }

@
When we detect an error within the message, it's reported
to standard error if we're in |verbose| mode and appended to
the |parserDiagnostics| for inclusion in the transcript if
the ``\.{p}'' annotation is selected.  This method is
|public| so higher-level parsing routines can use it
to append their own diagnostics.  Since in many cases we
compose the diagnostic in an |ostringstream|, we overload
a variant which accepts one directly as an argument.

@<Class implementations@>=
    void mailFolder::reportParserDiagnostic(const string s) {
    	if (verbose) {
	    if ((lastFromLine != fromLine) || (lastMessageID != messageID)) {
	    	cerr << fromLine << endl;
		if (messageID != "") {
	    	    cerr << "Message-ID: " << messageID << ":" << endl;
		}
		lastFromLine = fromLine;
		lastMessageID = messageID;
	    }
	    cerr << "    " << s << endl;
	}
    	if (Annotate('p') || Annotate('d')) {
	    parserDiagnostics.push(s);
	}
    }
    
    void mailFolder::reportParserDiagnostic(const ostringstream &os) {
    	reportParserDiagnostic(os.str());
    }
    
@** Token definition.

A |tokenDefinition| object provides the means by which the
|tokenParser| (below) distinguishes tokens in a stream of
text.  Tokens are defined by three arrays, each indexed by
ISO character codes between 0 and 255.  The first, |isToken|,
is |true| for characters which comprise tokens.  The second,
|notExclusively|, is |true| for characters which may appear
in tokens, but only in the company of other characters.  The
third, |notAtEnd| is |true| for characters which may appear
within a token, but not at the start or the end of one.

@<Class definitions@>=
class tokenDefinition {
protected:@/
    static const int numTokenChars = 256;
    bool isToken[numTokenChars],
    	 notExclusively[numTokenChars],
	 notAtEnd[numTokenChars];
    unsigned int minTokenLength, maxTokenLength;

public:@/
    tokenDefinition() {
    	clear();
    }
    
    void clear(void) {
    	for (int i = 0; i < numTokenChars; i++) {
	    isToken[i] = notExclusively[i] = notAtEnd[i] = false;
	}
	setLengthLimits(1, 65535);
    }
    
    void setLengthLimits(unsigned int lmin = 0, unsigned int lmax = 0) {
    	if (lmin > 0) {
	    minTokenLength = lmin;
	}
	if (lmax > 0) {
	    maxTokenLength = lmax;
	}
    }
    
    unsigned int getLengthMin(void) const {
    	return minTokenLength;
    }
    
    unsigned int getLengthMax(void) const {
    	return maxTokenLength;
    }
    
    bool isTokenMember(const int c) const {
    	assert(c >= 0 && c < numTokenChars);
	return isToken[c];
    }
    
    bool isTokenNotExclusively(const int c) const {
    	assert(c >= 0 && c < numTokenChars);
	return notExclusively[c];
    }
    
    bool isTokenNotAtEnd(const int c) const {
    	assert(c >= 0 && c < numTokenChars);
	return notAtEnd[c];
    }
    
    bool isTokenLengthAcceptable(string::size_type@, l) const {
    	return (l >= minTokenLength) && (l <= maxTokenLength);
    }
    
    bool isTokenLengthAcceptable(const string t) const {
    	return isTokenLengthAcceptable(t.length());
    }
    
    void setTokenMember(bool v, const int cstart, const int cend = -1) {
    	assert(cstart >= 0 && cstart <= numTokenChars);
    	assert((cend == -1) || (cend >= cstart && cend <= numTokenChars));
	for (int i = cstart; i <= cend; i++) {
	    isToken[i] = v;
	}
    }
    
    void setTokenNotExclusively(bool v, const int cstart, const int cend = -1) {
    	assert(cstart >= 0 && cstart <= numTokenChars);
    	assert((cend == -1) || (cend >= cstart && cend <= numTokenChars));
	for (int i = cstart; i <= cend; i++) {
	    notExclusively[i] = v;
	}
    }
    
    void setTokenNotAtEnd(bool v, const int cstart, const int cend = -1) {
    	assert(cstart >= 0 && cstart <= numTokenChars);
    	assert((cend == -1) || (cend >= cstart && cend <= numTokenChars));
	for (int i = cstart; i <= cend; i++) {
	    notAtEnd[i] = v;
	}
    }
    
    void setISO_8859defaults(unsigned int lmin = 0, unsigned int lmax = 0);
    void setUS_ASCIIdefaults(unsigned int lmin = 0, unsigned int lmax = 0);
};

@
Initialise a |tokenDefinition| for parsing ISO-8859 text with
our chosen defaults for punctuation embedded in such tokens.
Any pre-existing definitions are cleared.

@<Class implementations@>=
    void tokenDefinition::setISO_8859defaults(unsigned int lmin, unsigned int lmax) {
    	clear();
    	setLengthLimits(lmin, lmax);
	for (unsigned int c = 0; c < 256; c++) {
    	    isToken[c] = (isascii(c) && isdigit(c)) || isISOalpha(c) ||
	    	    		 (c == '-') || (c == '\'') || (c == '$');
	    notExclusively[c] = (isdigit(c) || (c == '-')) ? 1 : 0;
	}
#define CI(x)	static_cast<int>(x)
	notAtEnd[CI('-')] = notAtEnd[CI('\'')] = true;
#undef CI
    }
    

@
Initialise a |tokenDefinition| for parsing US-ASCII text with
our chosen defaults for punctuation embedded in such tokens.
Any pre-existing definitions are cleared.

@<Class implementations@>=
    void tokenDefinition::setUS_ASCIIdefaults(unsigned int lmin, unsigned int lmax) {
    	clear();
    	setLengthLimits(lmin, lmax);
	for (unsigned int c = 0; c < 128; c++) {
    	    isToken[c] = isalpha(c) || isdigit(c) ;
	    notExclusively[c] = (isdigit(c) || (c == '-')) ? 1 : 0;
	}
#define CI(x)	static_cast<int>(x)
	isToken[CI('_')] = notExclusively[CI('_')] = true;
	notAtEnd[CI('-')] = notAtEnd[CI('\'')] = true;
#undef CI
    }


@** Token parser.

A |tokenParser| reads lines from a |mailFolder| and returns
tokens as defined by its active |tokenDefinition|.  Separate
|tokenDefinition|s can be defined for use while parsing regular
text and binary byte streams, respectively. A |tokenParser| has
the ability to save the lines parsed from a message in a
|messageQueue|, permitting further subsequent analysis.  Note
that what is saved is ``what the parser saw''---after MIME
decoding or elision of ignored parts.

@<Class definitions@>=
class tokenParser {
protected:@/
    mailFolder *source;
    string cl;
    string::size_type clp;
    bool atEnd, inHTML, inHTMLcomment;
    tokenDefinition *td;    	    // Token definition for text mode
    tokenDefinition *btd;   	    // Token definition for byte stream parsing
    
    bool saveMessage;	    	    // Save current message in |messageQueue| ?
    
    bool assemblePhrases;   	    // Are we assembling phrases ?
    deque <string> phraseQueue;     // Phrase assembly queue
    deque <string> pendingPhrases;  // Queue of phrases awaiting return

public:@/
    list <string> messageQueue;     // Current message
    
    tokenParser() {
    	td = NULL;
    }

    void setSource(mailFolder &mf) {
    	source = &mf;
	cl = "";
	clp = 0;
	atEnd = inHTML = inHTMLcomment = false;
	saveMessage = false;
	messageQueue.clear();
	phraseQueue.clear();
	pendingPhrases.clear();
	@<Check phrase assembly parameters and activate if required@>;
    }
    
    void setTokenDefinition(tokenDefinition &t, tokenDefinition &bt) {
    	td = &t;
	btd = &bt;
    }
    
    void setTokenLengthLimits(unsigned int lMax, unsigned int lMin = 1,
    	    	    	      unsigned int blMax = 1, unsigned int blMin = 1) {
    	assert(td != NULL);
    	td->setLengthLimits(lMin, lMax);
    	assert(btd != NULL);
    	btd->setLengthLimits(blMin, blMax);
    }
    
    unsigned int getTokenLengthMin(void) const {
    	return td->getLengthMin();
    }
    
    unsigned int getTokenLengthMax(void) const {
    	return td->getLengthMax();
    }
    
    void reportParserDiagnostic(const string s) const {
    	assert(source != NULL);
	source->reportParserDiagnostic(s);
    }
    
    void reset(void) {
	if (inHTML) {
	    reportParserDiagnostic("<HTML> tag unterminated at end of message.");
	}
	if (inHTMLcomment) {
	    reportParserDiagnostic("HTML comment unterminated at end of message.");
	}
    	inHTML = inHTMLcomment = false;
    	clearMessageQueue();
	phraseQueue.clear();
	pendingPhrases.clear();
    }

    bool nextToken(dictionaryWord &d);
    
    void assembleAllPhrases(dictionaryWord &d);
    
    @<Message queue utilities@>;
    
    bool isNewMessage(void) const {
    	return atEnd || (source->isNewMessage());
    }

private:@/    
    void nextLine(void) {
    	while (true) {
    	    if (!(source->nextLine(cl))) {
		atEnd = true;
		cl = "";
		break;
	    }
	    if (saveMessage) {
	    	messageQueue.push_back(cl);
	    }
	    if (source->isNewMessage()) {
	    	reset();
	    }
	    break;
	}
	clp = 0;
    }
};

@
The |nextToken| method stores the next token from the input
source into its dictionary word argument and returns |true| if
a token was found or |false| if the end of the input source was
encountered whilst scanning for the next token.

@d ChIx(c) (static_cast<unsigned int>((c)) & 0xFF)

@<Class implementations@>=
    bool tokenParser::nextToken(dictionaryWord &d) {
	string token;

	while (!atEnd) {
	
	    @<Check for assembled phrases in queue and return next if so@>;

    	    token = "";
	    string::size_type necount = 0;

    	    if (source->isByteStream()) {
		@<Parse plausible tokens from byte stream@>;
	    }

    	    //  Ignore non-token characters until start of next token
    	    while ((clp < cl.length()) &&
		   (inHTMLcomment ||
    		   (!(td->isTokenMember(ChIx(cl[clp]))))
		   )) {
		@<Check for HTML comments and ignore them@>;
		@<Check for within HTML content@>;
		clp++;
	    }

	    //  If end of line encountered before token start, advance to next line
	    if (clp >= cl.length()) {
		nextLine();
		continue;
	    }

	    //  Check for characters we don't accept as the start of a token
    	    if (td->isTokenNotAtEnd(ChIx(cl[clp]))) {
		clp++;
		continue;
	    }

	    //  First character of token recognised; store and scan balance

    	    if (td->isTokenNotExclusively(ChIx(cl[clp]))) {
		necount++;
	    }
	    token += cl[clp++];
    	    while ((clp < cl.length())) {
    		if ((!inHTMLcomment) && (td->isTokenMember(ChIx(cl[clp])))) {
    		    if (td->isTokenNotExclusively(ChIx(cl[clp]))) {
	    		necount++;
		    }
		    token += cl[clp++];
		} else {
		    @<Check for HTML comments and ignore them@>;
		    if (inHTMLcomment) {
		    	clp++;
			continue;
		    }
	    	    break;
		}
	    }

	    //  Prune characters we don't accept at the end of a token
	    while ((token.length() > 0) &&
    		   td->isTokenNotAtEnd(ChIx(token[token.length() - 1]))
		   ) {
		 token.erase(token.length() - 1);
	    }

	    //  Verify that the token meets our minimum and maximum length constraints

    	    if (!(td->isTokenLengthAcceptable(token))) {
		continue;
	    }

	    @/@#
	    /*  We've either hit the end of the line or encountered a character
		that's not considered part of a token.  Return the token, leaving
		the class variables ready to carry on finding the next token
		when we're called again.  But first, if the token is composed
		entirely of characters in the |not_entirely| class, we discard
		it.  */

	    if (necount == token.length()) {
		continue;
	    }

	    d.set(token);
	    d.toLower();	    	    // Convert to canonical form
    	    @<Check for phrase assembly and generate phrases as required@>;
	    if (pTokenTrace && saveMessage) {
		messageQueue.push_back(string("  \"") + d.text + "\"");
	    }
    	    return true;
	}
	return false;
    }
    
@
If we're assembling phrases, there may be one or more already
assembled phrases sitting in the |pendingPhrases| queue.  If so,
remove it from the queue and return it.

@<Check for assembled phrases in queue and return next if so@>=
    if (!pendingPhrases.empty()) {
    	token = pendingPhrases.front();
	pendingPhrases.pop_front();
	d.set(token);
	d.toLower();
	if (pTokenTrace && saveMessage) {
	    messageQueue.push_back(string("  \"") + d.text + "\"");
	}
	return true;
    }

@
We wish to skip comments in HTML inclusions in mail, as junk
mail frequently uses void HTML comments to break up trigger
words for detectors. Strictly speaking, a space (or end of
line) is required after the HTML begin comment and before
the end comment delimiters, but most browsers don't enforce this
and real-world HTML frequently violates this rule.  So,
we treat any sequence of characters between the delimiters
as an HTML comment.

@d HTMLCommentBegin "<!--"   	    // HTML comment start sentinel
@d HTMLCommentEnd   "-->"   	    // HTML comment end sentinel

@<Check for HTML comments and ignore them@>=
    if (inHTML && !inHTMLcomment && (cl.substr(clp, 4) == HTMLCommentBegin)) {
    	inHTMLcomment = true;
	clp += 4;   	    	    // Skip over first HTML comment sentinel
#ifdef HTML_COMMENT_DEBUG
    	cout << "------------------------------ HTML Comment begin: " << cl << endl;
#endif
	continue;
    }
    if (inHTML && inHTMLcomment && (cl.substr(clp, 3) == HTMLCommentEnd)) {
    	inHTMLcomment = false;
	clp += 3;
#ifdef HTML_COMMENT_DEBUG
    	cout << "------------------------------ HTML Comment end: " << cl << endl;
#endif
	continue;
    }
#ifdef HTML_COMMENT_DEBUG
    if (inHTMLcomment) {
	cout << cl[clp];
	if (clp == (cl.length() - 1)) {
    	    cout << endl;
	}
    }
#endif

@
To avoid accidentally blundering into HTML comment discarding
in non-HTML text, we look for start and end HTML tags and only
activate HTML comment detection inside something which is
plausibly HTML.  Note that unclosed HTML tags and comments
are automatically closed out when |reset| is called
at the start of a new message from the mail folder.

@<Check for within HTML content@>=
    if (cl[clp] == '<' && (clp <= (cl.length() - 6))) {
    	if ((cl[clp + 1] == 'H' || cl[clp + 1] == 'h') &&
	    (cl[clp + 5] == '>' || cl[clp + 5] == ' ')) {
	    string tag;
	    for (int i = 1; i < 5; i++) {
	    	tag += (islower(cl[clp + i])) ? toupper(cl[clp + i]) : cl[clp + i];
	    }
	    if (tag == "HTML") {
	    	inHTML = true;
#ifdef HTML_COMMENT_DEBUG
    	    	cout << "------------------------------ In HTML: " << cl << endl;
#endif
	    }
    	}
    }
    
    if (cl[clp] == '<' && (clp <= (cl.length() - 7))) {
    	if ((cl[clp + 1] == '/') && (cl[clp + 2] == 'H' || cl[clp + 2] == 'h') &&
	    (cl[clp + 6] == '>')) {
	    string tag;
	    for (int i = 2; i < 6; i++) {
	    	tag += (islower(cl[clp + i])) ? toupper(cl[clp + i]) : cl[clp + i];
	    }
	    if (tag == "HTML") {
	    	inHTML = false;
#ifdef HTML_COMMENT_DEBUG
    	    	cout << "------------------------------ Out of HTML: " << cl << endl;
#endif
	    }
    	}
    }
    
@
If the item being read from the |mailFolder| has been identified
as a binary byte stream, read it character by character and parse
for probable strings.  We use the byte stream |tokenDefinition|
|btd| to determine token composition, permitting stricter
construction of plausible tokens in binary byte streams.

We get here only when our |source| identifies itself as
chewing through a byte stream with |isByteStream|.  While in
a byte stream, the |mailFolder| permits calls to its
|nextByte| method, which returns bytes directly from the
active stream decoder.  At the end of the stream (usually
denoted by the end sentinel of the MIME part containing
the stream), |nextByte| returns $-1$ and clears the
byte stream indicator.  We escape from here when that happens,
and go around the main loop in |nextToken| again, which will,
now that byte stream mode is cleared, resume dealing with the
mail folder at the |nextLine| level, where all of the housekeeping
related to the end of the byte stream will be dealt with.

This code is so similar to the main loop it's embedded in
it should probably be abstracted out as a token recogniser
engine parameterised by the means of obtaining bytes and
the token definition it applies.  I may get around to this
when I'm next in clean freak mode, but for the nonce I'll
leave it as-is until I'm sure no additional special pleading
is required when cracking byte streams.

@<Parse plausible tokens from byte stream@>=
    int b;
    
    while ((b = source->nextByte()) >= 0) {
    
    	//  Ignore non-token characters until start of next token
    	if (!(btd->isTokenMember(b))) {
    	    continue;
	}
	
	
	//  Check for characters we don't accept as the start of a token
    	if (btd->isTokenNotAtEnd(b)) {
	    continue;
	}
	
	//  First character of token recognised; store and scan balance

    	if (btd->isTokenNotExclusively(b)) {
	    necount++;
	}
	
	token += static_cast<char>(b);
    	while (((b = source->nextByte()) >= 0) &&
    	    	btd->isTokenMember(b)
	       ) {
    	    if (btd->isTokenNotExclusively(b)) {
	    	necount++;
	    }
	    token += static_cast<char>(b);
	}
		
	//  Prune characters we don't accept at the end of a token
	while ((token.length() > 0) &&
    	       btd->isTokenNotAtEnd(ChIx(token[token.length() - 1]))
	       ) {
	     token.erase(token.length() - 1);
	}
	
	//  Verify that the token meets our minimum and maximum length constraints

    	if (!(btd->isTokenLengthAcceptable(token))) {
    	    token = "";
	    continue;
	}

    	/* Verify that the token isn't composed exclusively of characters
	   permitted in a token but not allowed to comprise it in entirety. */
	if (necount == token.length()) {
    	    token = "";
	    continue;
	}
	d.set(token);
	d.toLower();	    	    // Convert to canonical form
	@<Check for phrase assembly and generate phrases as required@>;
	if (pTokenTrace && saveMessage) {
	    messageQueue.push_back(string("  \"") + d.text + "\"");
	}
	return true;
    }
    continue;
    
@
If the user has so requested, we can assemble tokens into
phrases in a given length range.  The default minimum and
maximum length phrase is 1 word, which causes individual tokens
to be returned as they are parsed.  When the maximum is greater
than one word, consecutive tokens (but never crossing a
|reset| or |setSource| boundary) are assembled into phrases
and output as pseudo-tokens of each length from the minimum
to maximum length phrase.

Here we examine the phrase length parameters, report any
erroneous specifications, and determine whether phrase assembly
is required at all.

@<Check phrase assembly parameters and activate if required@>=
    assemblePhrases = false;
    if ((phraseMin != 1) || (phraseMax != 1)) {
    	if ((phraseMin >= 1) && (phraseMax >= phraseMin)) {
	    if ((phraseLimit > 0) && (phraseLimit < ((phraseMax * 2) - 1))) {
	    	cerr << "Invalid --phraselimit setting.  Too small for specified --phrasemax." << endl;
	    } else {
	    	assemblePhrases = true;
	    }
	} else {
	    cerr << "Invalid --phrasemin/max parameters.  Must be 1 <= min <= max." << endl;
    	}
    }

@
When |assemblePhrases| is set, each arriving token is used to
generate all phrases including itself and previous tokens
within the specified phrase length limits.  Check for phrase
assembly and invoke the |assembleAllPhrases| method if
required.

@<Check for phrase assembly and generate phrases as required@>=
    if (assemblePhrases) {
    	assembleAllPhrases(d);
	continue;
    }
    
@
If we're assembling phrases, we take each token parsed (which has
already been stored into the |dictionaryWord| argument |d|
in canonical form) and place it on the |phraseQueue|
queue, removing the element at the tail if the queue is
longer than |phraseMax|.  Then, if the queue contains |phraseMin|
elements or more, iterate over the range of phrase lengths we
wish to generate, creating phrases and storing them onto
|pendingPhrases| for subsequent return.

@<Class implementations@>=
    void tokenParser::assembleAllPhrases(dictionaryWord &d) {
    	phraseQueue.push_back(d.text);
	if (phraseQueue.size() > phraseMax) {
	    phraseQueue.pop_front();
	    assert(phraseQueue.size() == phraseMax);
	}
	
	for (unsigned int p = phraseMin; p <= phraseMax; p++) {
	    if (p <= phraseQueue.size()) {
		deque<string>::const_reverse_iterator wp = phraseQueue.rbegin();
		string phrase = "";
		for (unsigned int i = 0; i < p; i++) {

		    phrase = (*wp) + ((phrase == "") ? "" : " ") + phrase;
		    wp++;
		}
		if ((phraseLimit == 0) || (phrase.length() <= phraseLimit)) {
		    pendingPhrases.push_back(phrase);
		}
	    }
	}
    }

@
The |messageQueue| can be used to store the lines of a message:
``what the parser saw,'' after MIME decoding (but not elision
of HTML comments or other processing in the parser itself).  This
is handy when debugging the lower level stuff.  To enable saving
messages in the queue, call |setSaveMessage| with an argument
of |true|.  The contents of |messageQueue| may be examined
directly (it is a |public| member of the class), or written
to an |ostream| with |writeMessageQueue|.  One little
detail---if you examine the |messageQueue| after the start
of the next message in a folder has been detected, the first
line of the next message will be the last item in the
message queue; |writeMessageQueue| understands this and
doesn't write the line, but if you're looking at the queue
yourself it's up to you to cope with this.

@<Message queue utilities@>=    
    void setSaveMessage(bool v) {
    	saveMessage = v;
	source->setDiagnosticList(saveMessage ? (&messageQueue) : NULL);
    }
    
    bool getSaveMessage(void) const {
    	return saveMessage;
    }
    
    void clearMessageQueue(void) {
    	if (saveMessage) {
    	    string s;

    	    if (isNewMessage()) {
		s = messageQueue.back();
	    }
    	    messageQueue.clear();
	    if (isNewMessage()) {
		messageQueue.push_back(s);
	    }
	}
    }
    
    void writeMessageQueue(ostream &os) {
    	list<string>::size_type l = messageQueue.size(), n = 0;
	
    	for (list<string>::iterator p = messageQueue.begin();
	    p != messageQueue.end(); p++, n++) {
	    if (!((n == (l - 1)) &&
	    	(p->substr(0, (sizeof messageSentinel) - 1) == messageSentinel))) {
	    	os << *p << endl;
	    }
	}
    }

@** Classify message.

The |classifyMessage| class reads input from a
|mailFolder| and returns the junk probability for successive
messages.  The input |mailFolder| may contain only a
single message.

@<Class definitions@>=
class classifyMessage {
public:@/
    mailFolder *mf;
    tokenParser tp;
    unsigned int nExtremal;
    dictionary *d;
    fastDictionary *fd;
    double unknownWordProbability;
    
    classifyMessage(mailFolder &m,
    	    	    dictionary &dt,
		    fastDictionary *fdt = NULL,
    	    	    unsigned int nExt = 15, double uwp = 0.2);
    
    double classifyThis(bool createTranscript = false);
    
protected:@/
    void addSignificantWordDiagnostics(list <string> &l,
    	    	list<string>::iterator where,
		multimap <double, string> &rtokens, string endLine = "");
};

@
The constructor initialises the classifier for the default
parsing of ISO-8859 messages.

@<Global functions@>=
    classifyMessage::classifyMessage(mailFolder &m,
    	    	    dictionary &dt, fastDictionary *fdt,
    	    	    unsigned int nExt, double uwp) {
    	mf = &m;
	tp.setSource(m);
    	tp.setTokenDefinition(isoToken, asciiToken);
    	tp.setTokenLengthLimits(maxTokenLength, minTokenLength,
	    streamMaxTokenLength, streamMinTokenLength);
    	if (pDiagFilename.length() > 0) {
    	    tp.setSaveMessage(true);
    	}
    	d = &dt;
	fd = fdt;
	nExtremal = nExt;
	unknownWordProbability = uwp;
    }

@
The |classifyThis| method reads the next message from the mail folder
and returns the probability that it is junk.  If the end of the mail
folder is encountered $-1$ is returned.

@<Class implementations@>=
    double classifyMessage::classifyThis(bool createTranscript) {
    	dictionaryWord dw;
	double junkProb = -1;
	
	if (createTranscript || (transcriptFilename != "")) {
	    mf->setTranscriptList(&messageTranscript);
	    if (Annotate('p') || Annotate('d')) {
	    	saveParserDiagnostics = true;
	    }
	}
	
	@<Build set of unique tokens in message@>;

    	@<Classify message tokens by probability of significance@>;
	
    	@<Compute probability message is junk from most significant tokens@>;

	if (tp.getSaveMessage()) {
	    @<Add classification diagnostics to parser diagnostics queue@>;
	    ofstream mdump(pDiagFilename.c_str());
	    tp.writeMessageQueue(mdump);
	    mdump.close();
	}
	
	if (createTranscript || (transcriptFilename != "")) {
	    @<Add annotation to message transcript@>;
	    if (transcriptFilename != "") {
	    	mf->writeMessageTranscript(transcriptFilename);
	    }
	}
	
	return junkProb;
    }

@
Just one more thing$\ldots\,$.  We need to define an absolute value function
for floating point quantities.  Make it so.

@<Class definitions@>=
#ifdef OLDWAY
    double abs(double x) {
    	return (x < 0) ? (-(x)) : x;
    }
#endif
    
@
Read the next message from the mail folder and build the
|set| |utokens| of unique tokens in the message.  |set|
insertion automatically discards tokens which appear more
than once.

@<Build set of unique tokens in message@>=
    set <string> utokens;

    while (tp.nextToken(dw)) {
	utokens.insert(dw.get());
    }

@
Once we've obtained a list of tokens in the message, we now
wish to filter it by the significance of the probability that
a token appears in junk or legitimate mail.  This is simply
the absolute value of the difference of the token's
|junkProbability| from 0.5---the probability for a
token equally likely to appear in junk and legitimate
mail.  We construct a |multimap| called |rtokens| which
maps this significance value to the token string; since
any number of tokens may have the same significance,
we must use a |multimap| as opposed to a |map|.

We count on |multimap| being an ordered collection class
which, when traversed by its |reverse_iterator|,
will return tokens in order of significance.  This
assumption may be unwarranted, but it's valid for
all the STL implementations I'm aware of (and is
essentially guaranteed since the fact that |multimap|
requires only the |<| operator for ordering effectively
mandates a binary tree implementation).

@<Classify message tokens by probability of significance@>=
    multimap <double, string> rtokens;
    
    for (set<string>::iterator t = utokens.begin(); t != utokens.end(); t++) {
	double pdiff;
	dictionary::iterator dp;

    	if (fd->isDictionaryLoaded()) {
	    pdiff = fd->find(*t);
	    if (pdiff < 0) {
	    	pdiff = unknownWordProbability;
	    }
	    pdiff = abs(pdiff - 0.5);
	} else {
	    if (((dp = d->find(*t)) != d->end()) &&
		(dp->second.getJunkProbability() >= 0)) {
		pdiff = abs(dp->second.getJunkProbability() - 0.5);
	    } else {
		pdiff = abs(unknownWordProbability - 0.5);
	    }
	}
	
	rtokens.insert(make_pair(pdiff, *t));
    }

@
Given the list of most signficant tokens, we now use Bayes' theorem
to compute the aggregate probability the message is junk.  If
$p_i$ is the probability word $i$ of the most significant
$n$ (\.{nExtremal}) words in a message appears in junk mail,
the probability the message as
a whole is junk is:
$${\prod\limits_{i=1}^{n}p_i}\over{{\prod\limits_{i=1}^{n}p_i}+{\prod\limits_{i=1}^{n}(1-p_i)}}$$

@<Compute probability message is junk from most significant tokens@>=
    unsigned int n = min(static_cast<multimap <double, string>::size_type>(nExtremal), rtokens.size());
    multimap <double, string>::const_reverse_iterator rp = rtokens.rbegin();
    double probP = 1, probQ = 1;
    if (verbose) {
	cerr << "Rank   Probability   Token" << endl;
    }
    
    for (unsigned int i = 0; i < n; i++) {
    	double p;
	
    	if (fd->isDictionaryLoaded()) {
	    p = fd->find(rp->second);
	    if (p < 0) {
	    	p = unknownWordProbability;
	    }
	} else {
    	    dictionary::iterator dp = d->find(rp->second);
    	    p = ((dp == d->end()) || (dp->second.getJunkProbability() < 0)) ?
	    	unknownWordProbability : dp->second.getJunkProbability();

	}
    	if (verbose) {
	    cerr << setw(3) << setiosflags(ios::right) << (i + 1) << "      " <<
	    	    setw(9) << setprecision(5) << setiosflags(ios::left) << p <<
	    	    "  " << rp->second << endl;
	}
    	probP *= p;
	probQ *= (1 - p);
	rp++;
    }
    junkProb = probP / (probP + probQ);
    if (verbose) {
    	cerr << "ProbP = " << probP << ", ProbQ = " << probQ << endl;
    }

@
When parser diagnostics are enabled, add lines to the header
of the message in the diagnostic queue to indicate the
words we used, their individual probabilities, and the
resulting classification of the message as a whole.

@<Add classification diagnostics to parser diagnostics queue@>=
    ostringstream os;
    list<string>::iterator p;
    
    /* Find the end of the header in the message.  If this
       fails we simply append the diagnostics to the end of
       the message. */
    for (p = tp.messageQueue.begin(); p != tp.messageQueue.end(); p++) {
	if (p->length() == 0) {
	   break;
	}
    }
    
    os << Xfile << "-Junk-Probability: " << setprecision(5) << junkProb;
    tp.messageQueue.insert(p, os.str());
    os.str("");
    
    addSignificantWordDiagnostics(messageTranscript, p, rtokens);    

@
If we're producing a message transcript, just before writing it
add the annotations to the end of the header which indicate the
junk probability and classification of the message based on the
threshold settings.  After these, other annotations requested by
the \.{--annotate} option are appended.

The test for the end of the message header where we insert the
annotations is a little curious.  When we're processing a message
received from a |POP3Proxy| server, the transcript will contain
the CR from the CR/LF termination sequences as required by POP3\null.
(The final line feed will have been stripped by |getline|
as the message was read.)
Preserving these terminators allows us to use the standard
mechanisms of |mailFolder| without lots of special flags,
so we deem a line the end of the header if it's either
zero length (read from a \UNIX/ mail folder with |getline|
{\it or} if it contains a single CR (received from a
POP3 server).  In the latter case, we set |transEndl| so
as terminate annotations we add to the transcript with
CR/LF as well.

@<Add annotation to message transcript@>=
    ostringstream os;
    list<string>::iterator p;
    string transEndl = "";

    /* Find the end of the header in the message.  If this
       fails simply append the annotations to the end of
       the message. */
    for (p = messageTranscript.begin(); p != messageTranscript.end(); p++) {
	if (p->length() == 0) {
	    break;
	}
	if (*p == "\r") {
	    transEndl = "\r";
    	    break;
	}
    }
    
    double jp = junkProb;
    /*	If the probability is sufficiently small it to be edited
    	in scientific notation, force it to zero so it's easier to
	parse.  */
    if (jp < 0.001) {
    	jp = 0;
    }
    os << Xfile << "-Junk-Probability: " << setprecision(3) << jp << transEndl;
    messageTranscript.insert(p, os.str());
    os.str("");
    os << Xfile << "-Classification: ";
    if (junkProb >= junkThreshold) {
	os << "Junk";
    } else if (junkProb <= mailThreshold) {
	os << "Mail";
    } else {
    	os << "Indeterminate";
    }
    os << transEndl;
    messageTranscript.insert(p, os.str());
    
    if (Annotate('w')) {
    	addSignificantWordDiagnostics(messageTranscript, p, rtokens, transEndl);
    }
    
    if (Annotate('p') || Annotate('d')) {
    	while (!parserDiagnostics.empty()) {
	    ostringstream os;
	    
	    os << Xfile << "-Parser-Diagnostic: " << parserDiagnostics.front() << transEndl;
	    messageTranscript.insert(p, os.str());
	    parserDiagnostics.pop();
	}
    }
    
@
Here's the little function which adds the most signficant words
and their probabilities to either the parser diagnostics or
the transcript.  We break it out into a function to avoid
duplicating the code.

@<Class implementations@>=
    void classifyMessage::addSignificantWordDiagnostics(list <string> &l,
    	    list<string>::iterator where,
	    multimap <double, string> &rtokens, string endLine) {
	unsigned int n = min(static_cast<multimap <double, string>::size_type>(nExtremal), rtokens.size());
	multimap <double, string>::const_reverse_iterator rp = rtokens.rbegin();

	for (unsigned int i = 0; i < n; i++) {
    	    dictionary::iterator dp = d->find(rp->second);
    	    double wp = ((dp == d->end()) || ((dp->second.getJunkProbability() < 0))) ?
	    	    	    unknownWordProbability : dp->second.getJunkProbability();
    	    ostringstream os;

	    os << Xfile << "-Significant-Word: " <<
		setw(3) << setiosflags(ios::right) << (i + 1) << "  " <<
		setw(8) << setprecision(5) << setiosflags(ios::left) << wp <<
		"  \"" << rp->second << "\"" << endLine;
    	    l.insert(where, os.str());
    	    os.str("");
	    rp++;
	}
    }
    
@** POP3 proxy server.

If the system provides the required network access facilities,
we can act as a POP3 proxy server, mediating the protocol
defined by
\pdfURL{RFC~1939}{http://www.ietf.org/rfc/rfc1939.txt?number=1939}.
The |POP3Proxy| class manages this service when invoked from the
command line.

@*1 POP3 proxy server class definition.

We begin by defining the |POP3Proxy| class, which implements a
general purpose POP3 proxy capability.

@d POP_MAX_MESSAGE 512
@d POP_BUFFER  ((POP_MAX_MESSAGE) + 2)

@<Class definitions@>=
#ifdef POP3_PROXY_SERVER

@<Declare signal handler function for broken pipes@>@\

typedef void (*POP3ProxyFilterFunction)(const string command, const string argument, char *replyBuffer, int *replyLength, string &reply);

class POP3Proxy {
protected:@/
    unsigned short popProxyPort; 	// Port on which POP proxy server listens
    string serverName;	    	    	// Domain name or IP address of POP server
    unsigned short serverPort;	    	// Port on which POP server listens
    bool opened;    	    	    	// Have we established connection ?
    
private:@/
    set <string> multiLine, cMultiLine; // POP3 multi-line command lists
    int listenSocket;	    	    	// Socket on which we listen for connections
    POP3ProxyFilterFunction filterFunction; // Filter function for replies from server
    
public:@/

    POP3Proxy(unsigned short proxyPort = 9110,
    	      string serverN = "",
	      unsigned short serverP = 110,
	      POP3ProxyFilterFunction filterF = NULL
	     ) :
	    	    popProxyPort(proxyPort),
		    serverName(serverN),
		    serverPort(serverP),
    	    	    opened(false),
		    listenSocket(-1),
		    filterFunction(filterF) {
    	@<Define multi-line and conditional multi-line commands@>;
    }
    
    ~POP3Proxy() {
    	if (listenSocket != -1) {
	    close(listenSocket);
	    signal(SIGPIPE, SIG_DFL);
	}
    }
    
    void setPopProxyPort(unsigned short p) {
    	@<Check for POP3 connection already opened@>;
    	popProxyPort = p;
    }
    
    void setServerName(string &s) {
    	@<Check for POP3 connection already opened@>;
    	serverName = s;
    }
    
    void setServerPort(unsigned short p) {
    	@<Check for POP3 connection already opened@>;
    	serverPort = p;
    }
    
    void setFilterFunction(POP3ProxyFilterFunction ff) {
    	filterFunction = ff;
    }
    
    bool acceptConnections(int maxBacklog = 25);
    
    bool serviceConnection(void);
    
    bool operateProxyServer(int maxBacklog = 25);
    
};
#endif

@
Some of the POP3 protocol command return multiple-line responses,
terminated with a line containing a single ``\.{.}'' (text lines
with this value are quoted by appending a single period).  We
initialise the |multiLine| |set| with commands which always
return multiple-line results and |cMultiLine| with those
which return multiple-line results when invoked with no
arguments.

@<Define multi-line and conditional multi-line commands@>=
    multiLine.insert("capa");	    // Extension in \pdfURL{RFC~2449}{http://www.ietf.org/rfc/rfc2449.txt?number=2449}
    multiLine.insert("retr");
    multiLine.insert("top");

    cMultiLine.insert("list");
    cMultiLine.insert("uidl");

@
The requestor is supposed to define all the properties of the
POP3 connection before it is opened.  Here we check for violations
of this rule and chastise offenders.

@<Check for POP3 connection already opened@>=
#ifndef NDEBUG
    if (opened) {
    	cerr << "Attempt to modify POP3 connection settings after connection opened." << endl;
	abort();
    }
#endif

@
In order to accept connections, we need to create a socket,
|listenSocket| which is bound to the port address on which
we listen.  We accept connections from any IP address.
The |acceptConnections| must be called to activate the
socket before connections may be processed.

@<Class implementations@>=
#ifdef POP3_PROXY_SERVER
    bool POP3Proxy::acceptConnections(int maxBacklog) {
	struct sockaddr_in name;

	listenSocket = socket(AF_INET, SOCK_STREAM, 0);
	if (listenSocket < 0) {
            perror("POP3Proxy opening socket to listen for connections");
	    listenSocket = -1;
	    return false;
	}

	/* Create name with wildcards. */

	name.sin_family = AF_INET;
	name.sin_addr.s_addr = INADDR_ANY;
	name.sin_port = htons(popProxyPort);
	if (bind(listenSocket, (struct sockaddr *) &name, sizeof name) < 0) {
    	    close(listenSocket);
            perror("POP3Proxy binding socket to listen for connections");
	    listenSocket = -1;
	    return false;
	}

	if (listen(listenSocket, maxBacklog) < 0) {
    	    close(listenSocket);
            perror("POP3Proxy calling listen for connection socket");
	    listenSocket = -1;
	    return false;
	}

	signal(SIGPIPE, absentPlumber); // Catch "broken pipe" signals from disconnects
	opened = true;
	return opened;
    }
#endif    

@
The |serviceConnection| method waits for the next client connection
to the |listenSocket|, accepts it, and then conducts the dialogue
with the client.

@<Class implementations@>=
#ifdef POP3_PROXY_SERVER
    bool POP3Proxy::serviceConnection(void) {
    	assert(opened);
	
    	int clientSocket;   	    // Socket for talking to client
	struct sockaddr_in from;    // Client IP address
	socklen_t fromlen;	    // Length of client address

    	@<Wait for next client connection and accept it@>;
	
	@<Conduct dialogue with client@>;
	
	return true;
    }
#endif

@
First of all, we have to camp on the |listenSocket|
with |accept| until somebody connects to it.  At that point
we obtain the |clientSocket| we'll use to conduct the
dialogue with the client.

@<Wait for next client connection and accept it@>=
    errno = 0;
    do {
	fromlen = sizeof from;
	clientSocket = accept(listenSocket, (struct sockaddr *) &from, &fromlen);
	if (clientSocket >= 0) {
	    break;
	}
    } while (errno == EINTR);
    if (clientSocket < 0) {
        perror("POP3Proxy accepting connection from client");
	return false;
    }
    if (verbose) {
       cout << "Accepting POP3 connection from " << inet_ntoa(from.sin_addr) << endl;
    }
    
@
Once a connection has been accepted, we use the |clientSocket|
to conduct the dialogue until it's concluded.

@<Conduct dialogue with client@>=
    int clientLength, serverLength;
    char clientBuffer[POP_BUFFER], serverBuffer[POP_BUFFER];
    int serverSocket;
    u_int32_t serverIP;
    struct hostent *h;
    int cstat = -1;
    bool ok = true;
    string command, argument, reply;

    @<Look up address of server@>;
    @<Open connection to server@>;
    @<Read the greeting from the server and relay to the client@>;
    @<Conduct client/server dialogue@>;
    @<Close the connection to the client and server@>;
    
@
We need to obtain the IP address of the server host we're supposed
to be connecting to.  This can be specified by the user either in
``dotted quad'' notation, for example, ``\.{192.168.82.13}'' or as
a fully qualified domain name such as ``\.{pop3.fourmilab.ch}''.
If the former case, we convert the address to binary with |inet_addr|,
in the latter, we invoke the resolver with |gethostbyname| to
obtain the IP address.  We do not handle IPv6 addresses at the present
time.

@<Look up address of server@>=
    if (isdigit(serverName[0]) && (serverIP = inet_addr(serverName.c_str())) != static_cast<u_int32_t>(-1)) {
    	cstat = 0;
    } else {
	h = gethostbyname(serverName.c_str());
	if (h != NULL) {
	    memcpy(&serverIP, h->h_addr, sizeof serverIP);
	    cstat = 0;
	} else {
            cerr << "POP3Proxy: POP3 server " << serverName.c_str() << " unknown." << endl;
	    close(clientSocket);
	    return false;
	}
    }

@
Once we've determined the IP address of the POP3 server, we next
need to open a socket connection to it on the TCP/IP port
on which it listens.

@<Open connection to server@>=
    struct sockaddr_in serverHost;
    serverHost.sin_family = AF_INET;

    serverSocket = socket(AF_INET, SOCK_STREAM, 0);
    if (serverSocket < 0) {
        perror("POP3Proxy opening socket to POP server");
	cstat = -1;
    } else {
    	if (popProxyTrace) {
    	    cerr << "POP3: serverSocket opened." << endl;
    	}
	serverHost.sin_port = htons(serverPort);
	memcpy((char *) &serverHost.sin_addr.s_addr, (char *) (&serverIP),
	      sizeof serverHost.sin_addr.s_addr);

	errno = 0;
	do {
	    cstat = connect(serverSocket, (struct sockaddr *) &(serverHost), sizeof serverHost);
    	    if (popProxyTrace) {
    		cerr << "POP3: serverSocket connected." << endl;
	    }
	    if (cstat == 0) {
    	    	if (popProxyTrace) {
    	    	    cerr << "POP3: Connected to POP server on " << inet_ntoa(serverHost.sin_addr) << 
    	    	    	    ":" << ntohs(serverHost.sin_port) << endl;
    	    	}
		break;
	    } else {
    	    	perror("POP3Proxy connection to POP server failed");
	    }
	} while (errno == EINTR);

	if (cstat < 0) {
    	    cerr << "POP3Proxy: Cannot connect to POP3 server " << serverName.c_str() << endl;
	}
    }

@    
Read the greeting from the server and forward it to the client.
We do this prior to the dialogue loop to avoid tangled
logic there when processing requests with multiple-line
replies.

@<Read the greeting from the server and relay to the client@>=
    serverLength = recv(serverSocket, serverBuffer, POP_MAX_MESSAGE, 0);
    if (serverLength < 0) {
    	perror("POP3Proxy reading greeting from server");
	ok = false;
    } else {
	clientLength = send(clientSocket, serverBuffer, serverLength, 0);
	if (clientLength < 0) {
    	    perror("POP3Proxy forwarding greeting to client");
	    ok = false;
	}
    }

@
This is the main client/server dialogue loop.  We read
successive requests from the client, forward them to the
server, then receive the reply from the server (which,
depending on the request, may contain variable-length
information after the obligatory status line).  Before
returning the reply to the client, we check whether this is
a mail body we wish to pass through the filtering step
and proceed accordingly.  Finally, the results are written
back to the client.  If the command we've just completed
is ``\.{QUIT}'', we're done with this client.

@<Conduct client/server dialogue@>=
    while (ok) {
    
    	@<Read request from client@>;
	@<Check for blank request and discard@>;
	@<Forward request to server@>;
	@<Parse request and argument into canonical form@>;
	@<Read status line from server@>;
	@<Read multi-line reply from server if present@>;
	
	@<Fiddle with the reply from the server as required@>;
	
	@<Relay the status line from the server to the client@>;
	@<Relay multi-line reply, if any, to the client@>;
	
	if (command == "quit") {
	    break;
	}
    }

@
Read the next request from the client.  Requests are always
a single line consisting of |POP_MAX_MESSAGE| characters
or fewer.

@<Read request from client@>=
    if (popProxyTrace) {
    	cerr << "POP3: Reading request from client." << endl;
    }
    clientLength = recv(clientSocket, clientBuffer, POP_MAX_MESSAGE, 0);
    if (popProxyTrace) {
    	cerr << "POP3: Read " << clientLength << " request bytes from client." << endl;
    }
    if (clientLength <= 0) {
	break;
    }

@
RFC~1939 is silent on the issue, but the POP3 server I tested
with seems to silently discard blank lines without issuing an
``\.{-ERR}'' response.  Since this can hang up our proxy cycle,
eat blank lines without passing them on to the server.  This
shouldn't happen with a properly operating client, but it's
all too easy to do when testing with Telnet, and besides, we
have to cope with screwball clients which may do anything.

@<Check for blank request and discard@>=
    if (isspace(clientBuffer[0])) {
	continue;
    }

@
Pass on the client request to the server.

@<Forward request to server@>=
    serverLength = send(serverSocket, clientBuffer, clientLength, 0);
    if (serverLength != clientLength) {
	perror("POP3Proxy forwarding request to server");
	break;
    }

@
In order to determine whether the server will respond with
a multi-line reply in addition to a status line, we must
examine the command and its arguments.  The command, which
is case-insensitive, is forced to lower case to
facilitate comparisons.  Note that since we've already
forwarded the request to the server, it's OK to diddle
|clientBuffer| here.

@<Parse request and argument into canonical form@>=
    while ((clientLength > 0) && isspace(clientBuffer[clientLength - 1])) {
    	clientLength--;
    }
    command = argument = "";
    int i;
    for (i = 0; i < clientLength; i++) {
    	if (isspace(clientBuffer[i])) {
	    break;
	}
	char ch = clientBuffer[i];
	if (isalpha(ch) && isupper(ch)) {
	    ch = tolower(ch);
	}
	command += ch;
    }
    
    while ((i < clientLength) && isspace(clientBuffer[i])) {
    	i++;
    }
    
    if (i < clientLength) {
    	argument = string(clientBuffer + i, clientLength - i);
    }

    if (popProxyTrace) {
    	cerr << "POP3: Client command (" << command << ")  Argument (" << argument << ")" << endl;
    }

@
Now we're ready to read the status line from the server.  This will
begin with ``\.{+OK}'' if the request was successful and
``\.{-ERR}'' if now.

@<Read status line from server@>=
    serverLength = 0;
    int rl = -1;
    while (true) {
    	rl = recv(serverSocket, serverBuffer + serverLength, 1, 0);
	if (rl < 0) {
	    perror("POP3Proxy reading request status from server");
	    break;
	}
	serverLength++;
	if (serverBuffer[serverLength - 1] == '\n') {
	    break;
	}
	if (serverLength >= POP_MAX_MESSAGE) {
	    cerr << "POP3Proxy reply from server too long." << endl;
	    rl = -1;
	    break;
	}
    }
    if (rl < 0) {
    	break;
    }
    if (popProxyTrace) {
    	cerr << "POP3: Server reply is " << serverLength << " bytes" << endl;
    }

@
If the status from the server is positive and the command
is one which elicits a multiple-line reply, read the reply
from the server until the terminating sentinel, a single period
followed by the CR/LF line terminator.  Any line in the reply
which begins with a period is quoted by prefixing a
period.

We concatenate replies from the server into the |reply| string
until the end sentinel is encountered.

@<Read multi-line reply from server if present@>=
    reply = "";
    if ((serverBuffer[0] == '+') &&
	((multiLine.find(command) != multiLine.end()) ||
	 ((argument == "") && (cMultiLine.find(command) != cMultiLine.end())))) {
	 int bll;
	 char bp[POP_BUFFER];

    	if (popProxyTrace) {
    	    cerr << "POP3: Reading multi-line reply from server." << endl;
    	}  
	do {
    	    bll = recv(serverSocket, bp, POP_MAX_MESSAGE, 0);
	    if (bll < 0) {
		perror("POP3Proxy reading multi-line reply to request from server");
		break;
	    }
#ifdef POP3_TRACE_TRANSFER_DETAIL
    	    if (popProxyTrace) {
    	    	cerr << "POP3: Appending " << bll << " bytes to multi-line reply." << endl;
    	    }
#endif
	    reply += string(bp, bll);
	} while ((reply.length() < 3) || 
	    ((reply != ".\r\n") &&
	     (reply.substr(reply.length() - 5) != "\r\n.\r\n")));
    }
    
@
Here's where we permit the |filterFunction| to get into
the act.  If there's a |filterFunction|, we hand it everything it
needs to modify the status line and reply from the server.  Note
that even though we go to the effort to pass the canonicalised and
parsed command and argument, it's up to the filter function
to compose the rough-and-ready status string in the
|serverBuffer| string, which must be zero terminated.

@<Fiddle with the reply from the server as required@>=
    if (popProxyTrace) {
    	cerr << "POP3: Calling filter function." << endl;
    }
    if (filterFunction != NULL) {
    	serverBuffer[serverLength] = 0;
	filterFunction(command, argument, serverBuffer, &serverLength, reply);
    }
    if (popProxyTrace) {
    	cerr << "POP3: Returned from filter function." << endl;
    }
    
@
Send the status line received from the server back to the client.
Why wait so long?  Because if we've modified the multi-line
reply, we also may wish to modify the status line to reflect
the length of the modified reply.

@<Relay the status line from the server to the client@>=
    clientLength = send(clientSocket, serverBuffer, serverLength, 0);
    if (clientLength != serverLength) {
	perror("POP3Proxy relaying status of request to client");
	break;
    }
    if (popProxyTrace) {
    	cerr << "POP3: Relaying " << serverLength <<
	    	" byte status line to client: " << serverBuffer;
	if ((serverLength == 0) || (serverBuffer[serverLength - 1]) != '\n') {
	    cerr << endl;   	// ``Can't happen''---but just in case
	}
    }
    
@
If the server's reponse included a multi-line reply, relay it
to the client.  We write it with a single |send| unless
|POP3_MAX_CLIENT_WRITE| is defined, in which case we write
the reply in chunks of that size; if you wish to be ultra-conservative,
you might define it to be |POP_MAX_MESSAGE|.

@<Relay multi-line reply, if any, to the client@>=
    if (reply != "") {
    	if (popProxyTrace) {
    	    cerr << "POP3: Relaying " << reply.length() << " byte multi-line reply to client." << endl;
    	}
	
#ifdef POP3_MAX_CLIENT_WRITE
	clientLength = 0;
	int rpl = reply.length();
	
	while (clientLength < ((int) reply.length())) {
    	    int bcl, pcl;

	    bcl = min(rpl, POP3_MAX_CLIENT_WRITE);
#ifdef POP3_TRACE_TRANSFER_DETAIL
    	    if (popProxyTrace) {
    		cerr << "POP3: Writing " << bcl << " bytes of multi-line reply to client." << endl;
    	    }
#endif
	    pcl = send(clientSocket, reply.data() + clientLength, bcl, 0);
	    if (pcl != bcl) {
    		if (popProxyTrace) {
    		    cerr << "POP3: Error writing " << bcl << " bytes: wrote " << pcl << " bytes." << endl;
    		}
		break;  	    	// Note that test below will error transfer
	    }
	    clientLength += pcl;
	    rpl -= pcl;
	}
#else
	clientLength = send(clientSocket, reply.data(), reply.length(), 0);
#endif

	if (clientLength != static_cast<int>(reply.length())) {
	    perror("POP3Proxy relaying multi-line reply to request to client");
	    break;
	}
#ifdef POP3_TRACE_TRANSFER_DETAIL
    	if (popProxyTrace) {
    	    cerr << "POP3: <<<<<< Relaying " << reply.length() << " byte multi-line reply body to client. >>>>>>" << endl;
    	    cerr << reply;
    	    cerr << "POP3: <<<<<< End multi-line reply body. >>>>>>" << endl;
    	}	    
#endif
    }
    
@
We're all done.  Having relayed the reply to the ``\.{quit}'' command,
or having something go blooie in the processing loop, we close
the client and server sockets and get ready to bail out from
servicing this connection.

@<Close the connection to the client and server@>=
    close(clientSocket);
    close(serverSocket);
    if (verbose) {
    	cerr << "Closing POP3 connection from " << inet_ntoa(from.sin_addr) << endl;
    }
    
@
If you simply wish to run a POP3 proxy server until the end of time,
you can invoke this method which puts it all together.  We return
only if something blows up, after which the caller is well-advised
to destroy the |POP3Proxy| object and try again.

@<Class implementations@>=
#ifdef POP3_PROXY_SERVER
    bool POP3Proxy::operateProxyServer(int maxBacklog) {
    	if (acceptConnections(maxBacklog)) {
    	    while (serviceConnection()) ;
	}
	return false;
    }
#endif

@
Various alarums and diversions will result in our receiving a
|SIGPIPE| signal whilst acting as a POP3 server.  These may be
safely ignored, as the following function does.

@<Declare signal handler function for broken pipes@>=
static RETSIGTYPE absentPlumber(int)
{
    if (popProxyTrace) {
    	cerr << "POP3: Caught SIGPIPE--continuing." << endl;
    }
    signal(SIGPIPE, absentPlumber);	      /* Reset signal just in case */
}

@*1 POP3 proxy server implementation.

Using the |POP3Proxy| class defined above, the following code actually
provides the proxying for \PRODUCT, including running filtering
retrieved messages and returning them to the client annotated with
their classification.

@
This is the entire proxy server!  It is invoked by the main program
after processing command line options if |popProxyServer| has been
set.  It creates a |POP3Proxy| with the specified arguments and
puts it to work.  There is no escape from here except through
catastrophic circumstances.

@<Operate POP3 proxy server, filtering replies@>=
    if (dict.empty() && (!fDict.isDictionaryLoaded())) {
    	cerr << "You cannot operate a --pop3proxy server "
	    	"unless you have first loaded a dictionary." << endl;
	return 1;
    }
    
    if (verbose) {
    	cerr << "Starting POP3 proxy server on port " << popProxyPort <<
	    	" with server " <<  popProxyServer << ":" << popProxyServerPort << endl;
    }
    POP3Proxy pp(popProxyPort, popProxyServer, popProxyServerPort, &popFilter);
    
    pp.operateProxyServer();

@
The |popFilter| function handles the actual filtering of messages
retrieved by the POP proxy server.  It takes the text of each message,
creates mail folder to read it as an |istringstream|, then classifies
the message, generating a transcript annotated with the classification,
which is returned to the client in lieu of the raw message received
from the server.

@<Utility functions@>=
#ifdef POP3_PROXY_SERVER
void popFilter(const string command, const string argument, char *replyBuffer, int *replyLength, string &reply) {
    if ((command == "retr") && ((*replyLength) > 0) && (replyBuffer[0] == '+')) {
	
    	@<Create mail folder to read reply from POP3 server@>;
	@<Classify the message, generating an in-memory transcript of the results@>;

#define not_POPFILTER_TRACE
#ifdef POPFILTER_TRACE
cerr << "Classification done." << endl;
#endif
#ifdef OLDWAY
    	ostringstream os;
#else
    	unsigned int mtl = mf.sizeMessageTranscript();
#ifdef POPFILTER_TRACE
cerr << "Message transcript predicted size: " << mtl << endl;
#endif
    	char *mtbuf = new char[mtl + 16];
	ostrstream os(mtbuf, mtl + 16);
#endif
	mf.writeMessageTranscript(os);
#ifdef POPFILTER_TRACE
cerr << "Transcript written." << endl;
#endif
	mf.clearMessageTranscript();
#ifdef POPFILTER_TRACE
cerr << "Transcript cleared." << endl;
cerr << "Message transcript actual size: " << os.tellp() << endl;
#endif
    	reply.erase();
#ifndef OLDWAY
    	os << '\0';
#endif
	reply = os.str();
#ifdef POPFILTER_TRACE
cerr << "Reply string length: " << reply.length() << endl;
#endif
#ifndef OLDWAY
    	delete mtbuf;
#endif
#ifdef POPFILTER_TRACE
cerr << "Reply created." << endl;
#endif
    	@<Modify POP3 reply message to reflect change in text length@>;
#ifdef POPFILTER_TRACE
cerr << "Reply length modification done." << endl;
#endif	
    }
}
#endif

@
We use the |reply| from the POP3 server to initialise an
|istringstream| whence |mailFolder| can read the message.
As usual, POP3 throws us a curve ball.  When returning
message text with a ``\.{RETR}'' command, the POP3
server (or at least the ones I've tested), {\it does
not} return the initial ``\.{From\ }'' line which
denotes the start of a message in a normal
\UNIX/ mail folder.  In order to correctly parse
the message header, we must invoke |forceInHeader|
on the |mailFolder| rather than rely on the
``\.{From\ }'' to set this state.

@<Create mail folder to read reply from POP3 server@>=
    istrstream is(reply.data(), reply.length());

    mailFolder mf(is, dictionaryWord::Mail);
    mf.forceInHeader();

@
Now we can classify the message in the |mailFolder| we've just
created by instantiating a |classifyMessage| object attached to
the folder.  We then call |classifyThis| with a |true| argument
which causes it to generate a transcript with the classification
annotations included, leaving it in the in-memory
|messageTranscript|.

@<Classify the message, generating an in-memory transcript of the results@>=
    classifyMessage cm(mf, dict, &fDict, significantWords, novelWordProbability);
    double jp = cm.classifyThis(true);
    if (verbose) {
    	cerr << "Message junk probability: " << setprecision(5) << jp << endl;
    }

@
Strictly speaking, the only part of the status reply to
a successful ``\.{RETR}'' request is ``\.{+OK}'', but
many POP3 servers actually suffix the length in octets
of the multi-line data which follows (but {\it not} including
the three byte terminator of a period followed by CR/LF)
at the end.  As Russell Nelson observes in
\pdfURL{RFC~1957}{http://www.ietf.org/rfc/rfc1957.txt?number=1957},
sometimes implementations are mistaken for standards,
especially by those who prefer \.{telnet} experiments
to actually reading the RFCs.  So, on the off chance that
some misguided POP3 client might be parsing this value to
decide how many text bytes to read from the socket,
we go the trouble here to re-generate the reply with the
actual length of the filtered reply, reflecting the
annotations we've added to the header.

@<Modify POP3 reply message to reflect change in text length@>=
	ostringstream rs;
    	rs << "+OK " << (reply.length() - 3) << " octets\r\n";
	memcpy(replyBuffer, rs.str().data(), rs.str().length());
	*replyLength = rs.str().length();

@** Main program.

The main program is rather simple.  We initialise the global
variables then chew through the command line, doing whatever
the options request.

@<Main program@>=

@<Global declarations used by component in temporary jig@>;

int main(int argc, char *argv[])
{
    int opt;

    @<Initialise global variables@>;

    @<Process command-line options@>;
    
#ifdef POP3_PROXY_SERVER
    if (popProxyServer != "") {
    	@<Operate POP3 proxy server, filtering replies@>;
    }
#endif

    
   return exitStatus;
}

@
@<Initialise global variables@>=
    memset(messageCount, 0, sizeof messageCount);
    isoToken.setISO_8859defaults(minTokenLength, maxTokenLength);
    asciiToken.setUS_ASCIIdefaults(streamMinTokenLength, streamMaxTokenLength);

@
The master dictionary is global to the main program and all
of its support functions.  It's declared after all the class
definitions it requires.  We also support a |fastDictionary|
for classification runs.  If loaded, it takes precedence over
any loaded |dictionary|.

@<Master dictionary@>=
static dictionary dict;     	    	// Master dictionary
static fastDictionary fDict;	    	// Fast dictionary

@
@<Global variables@>=
static unsigned int messageCount[2];	// Total messages per category
static list <string> messageTranscript; // Message transcript list
static queue <string> parserDiagnostics; // List of diagnostics generated by the parser
static bool saveParserDiagnostics = false;  // Save parser diagnostics in |parserDiagnostics| ?

@
The |addFolder| procedure reads a mail folder and adds the tokens
it contains to the master dictionary |dict| with the specified |category|.
The global |messageCount| for the given |category| is updated to reflect
the number of messages added from the folder.

@<Utility functions@>=
static void addFolder(const char *fname, dictionaryWord::mailCategory cat)
{
    if (verbose) {
    	cerr << "Adding " << (bsdFolder ? "BSD " : "") << "folder " <<
	    	fname << " as " << dictionaryWord::categoryName(cat) << ":" << endl;
    }
    
    mailFolder mf(fname, cat);
    mf.setBSDmode(bsdFolder);
    bsdFolder = false;	    	    // Reset BSD folder semantics
    tokenParser tp;
    
    tp.setSource(mf);
    tp.setTokenDefinition(isoToken, asciiToken);
    tp.setTokenLengthLimits(maxTokenLength, minTokenLength,
    	    streamMaxTokenLength, streamMinTokenLength);
    if (pDiagFilename.length() > 0) {
    	tp.setSaveMessage(true);
    }
    dictionaryWord dw;
    unsigned int ntokens = 0;
    
    while (tp.nextToken(dw)) {
    	dict.add(dw, mf.getCategory());
	ntokens++;
	@<Prune unique words from dictionary if autoPrune threshold is exceeded@>;
    }
    messageCount[mf.getCategory()] += mf.getMessageCount();
    
    if (verbose) {
    	cerr << "  Added " << mf.getMessageCount() << " messages, " <<
	    	ntokens << " tokens in " << mf.getLineCount() << " lines." << endl;
	cerr << "  Dictionary contains " << dict.size() << " unique tokens." << endl;
	cerr << "  Dictionary size " << dict.estimateMemoryRequirement() << " bytes." << endl;
    }
}

@
If \.{--autoprune} is specified, the memory consumed by the
dictionary is estimated as tokens are added and, if the
threshold is exceeded, all unique words are pruned from the
dictionary.  If, after the prune is complete, the dictionary
still exceeds 90% of the pruning threshold, we're on the verge
of beginning to thrash, pruning over and over to no effect.
If this is the case, we automatically increase the
\.{--autoprune} setting by 25\% to stave off thrashing
(while, of course, running the risk of {\it paging} thrashing
if physical memory is exceeded.

@<Prune unique words from dictionary if autoPrune threshold is exceeded@>=
    if ((autoPrune != 0) && (dict.estimateMemoryRequirement() > autoPrune)) {
	if (verbose) {
	    cerr << "Dictionary size " << dict.estimateMemoryRequirement() <<
		    "; starting automatic prune." << endl;
	}
	dict.purge(1);
	if (dict.estimateMemoryRequirement() > ((autoPrune * 9) / 10)) {
	    cerr << "Dictionary size after --autoprune is larger than 90%" << endl;
	    cerr << "of --autoprune setting of " << autoPrune << " bytes." << endl;
	    autoPrune = static_cast<unsigned int>(autoPrune * 1.25);
	    cerr << "Increasing --autoprune threshold 25% to " << autoPrune <<
		    " to avoid thrashing." << endl;
    	}
    }

@
The |updateProbability| function recomputes word probabilities in
the dictionary.  It should be called after any changes are made to the
contents of the dictionary.  Any operation which recomputes the
probabilities makes us ineligible for optimising out probability
computation loading the first dictionary, so we clear
the |singleDictionaryRead| flag.

@<Global functions@>=
static void updateProbability(void)
{
    dict.computeJunkProbability(messageCount[dictionaryWord::Mail], messageCount[dictionaryWord::Junk],
    	    mailBias, minOccurrences);
    singleDictionaryRead = false;
}

@
The |printDictionary| function dumps the dictionary in human-readable form
to a specified output stream,

@<Global functions@>=
static void printDictionary(ostream &os = cout)
{
    updateProbability();
    os << "Dictionary contains " << dict.size() << " unique tokens." << endl;
    for (dictionary::iterator dp = dict.begin(); dp != dict.end(); dp++) {
    	dp->second.describe(os);
    }
}

@
The |classifyMessages| function classifies the first message in the mail
folder |fname|.

@<Global functions@>=
static double classifyMessages(const char *fname)
{
    double jp;
    
    if (dict.empty() && !fDict.isDictionaryLoaded()) {
    	cerr << "You cannot --classify or --test a message "
	    	"unless you have first loaded a dictionary." << endl;
	jp = 0.5; 	    // Beats me--call it fifty-fifty junk probability
    } else {
	mailFolder mf(fname, dictionaryWord::Mail);

	classifyMessage cm(mf, dict, &fDict, significantWords, novelWordProbability);

	jp = cm.classifyThis();
	if (verbose) {
    	    cerr << "Message junk probability: " << setprecision(5) << jp << endl;
	}
    }
    nTested++;
    return jp;
}
           
@*1 Header include files.

The following include files provide access to system and
library components.

@<Include header files@>=
#include "config.h" 	    // Configuration definitions from \.{./configure}

@<Tweak configuration when building for Win32@>@/

@<C++ standard library include files@>@/
@<C library include files@>@/
@<Conditional C library include files@>@/

#ifdef WIN32
#define __GNU_LIBRARY__
#undef __GETOPT_H__
#endif
#include "getopt.h"     // Use our own |getopt|, which supports |getopt_long|
#include "statlib.h"	// Statistical library

@<Configuration of conditional capabilities@>@/

@<Network library include files@>@/

@
We use the following \CPP/ standard library include files.
Note that current \CPP/ theology prescribes that these files
not bear the traditional \.{.h} extension; since some libraries
have gotten it into their pointy little heads to natter about
this, we conform.  If you're using an older \CPP/ system, you
may have to restore the \.{.h} extension if one or more of these
come up ``not found''.

@<C++ standard library include files@>=
#include <iostream>
#include <iomanip>
#include <fstream>
#include <cstdlib>
#include <string>
#include <sstream>
#ifdef HAVE_FDSTREAM_COMPATIBILITY
#include "fdstream.hpp"
#endif
#ifdef HAVE_NEW_STRSTREAM
#include "mystrstream_new.h"
#else
#include "mystrstream.h"
#endif
#include <vector>
#include <algorithm>
#include <map>
#include <stack>
#include <deque>
#include <queue>
#include <list>
#include <set>
#include <bitset>
#include <functional>
#include <cmath>
using namespace std;

@
We also use the following \CEE/ library include files for
low-level operations.

@<C library include files@>=
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <ctype.h>
#include <string.h>
#include <assert.h>

@
Some \CEE/ library header files are included only on platforms which
support the facilities they provide.  This is determined by the
\.{./configure} script, which sets variables in
\.{config.h} which we use to include them if present.

@<Conditional C library include files@>=
#ifdef HAVE_STAT
#include <sys/stat.h>
#endif
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#ifdef HAVE_DIRENT_H
#include <dirent.h>
#endif
#ifdef HAVE_MMAP
#include <sys/mman.h>
#endif

@
The following libraries are required to support the network
operations required by the POP3 proxy server.  If the minimal
subset required to support the server are not present, it will be
disabled.

@<Network library include files@>=
#if defined(HAVE_SOCKET) && defined(HAVE_SIGNAL)
#define POP3_PROXY_SERVER
#endif

#ifdef POP3_PROXY_SERVER
#include <signal.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <errno.h>
#endif

@
Some capabilities of the program depend in non-trivial ways on
the presence of certain system features detected by the
\.{./configure} script.  Here we test for the prerequisites
and define an internal tag to enable the feature if all are
met.

@<Configuration of conditional capabilities@>=
#if defined(HAVE_GNUPLOT) && defined(HAVE_NETPBM) && defined(HAVE_SYSTEM)
#define HAVE_PLOT_UTILITIES
#endif

#if defined(HAVE_DIRENT_H) && defined(HAVE_STAT)
#define HAVE_DIRECTORY_TRAVERSAL
#endif

#if defined(HAVE_PDFTOTEXT) && defined(HAVE_POPEN) && (defined(HAVE_MKSTEMP) || defined(HAVE_TMPNAM))
#define HAVE_PDF_DECODER
#endif

@
It's a pain in the posterior to have to edit the
\.{config.h} file to disable features not supported on
Win32 platforms.  Since we can't run \.{./configure}
there, the process can't be automated.  So, we take the
lazy way out and manually undefine features absent on
Win32, even if they were auto-detected on the platform
which generated \.{config.h}.  Tacky.

@<Tweak configuration when building for Win32@>=
#ifdef WIN32
#undef HAVE_MMAP
#endif


@
The following global variables are used to keep track of command
line options.

@d Annotate(c)	(annotations.test(c)) 	// Test if annotation is requested

@<Command line arguments@>=
static double mailBias = 2.0;	    	// Bias for words in legitimate mail
static unsigned int minOccurrences = 5; // Minimum occurrences to trust probability
static double junkThreshold = 0.9;  	// Threshold above which we  classify mail as junk
static double mailThreshold = 0.9;  	// Threshold below which we classify as mail
static int significantWords = 15;   	// Number of words to use in classifying message
static double novelWordProbability = 0.2;   // Probability assigned to words not in dictionary
static bitset <1 << (sizeof(char) * 8)> annotations;	// Annotations requested in transcript
#ifdef POP3_PROXY_SERVER
static int popProxyPort = 9110;     	// POP3 proxy server listen port
static string popProxyServer = "";	// POP3 server (IP address or fully-qualified domain name)
static int popProxyServerPort = 110;	// POP3 server port
#endif
static bool bsdFolder = false;	    	// Does mail folder use pure BSD ``\.{From\ }'' semantics ?

@
These globals are used to check for inconsistent option
specifications.

@<Command line arguments@>=
static unsigned int nTested = 0;    	// Number of messages tested

@
The following options are referenced in class definitions and must
be placed in the |@<Global variables@>| section so they'll be declared
prior to references to them.

@<Global variables@>=
static bool verbose = false;	    	// Print verbose processing information
#ifdef TYPE_LOG
static ofstream typeLog("/tmp/typelog.txt");
#endif
static string pDiagFilename = "";   	// Parser diagnostic file name
static string transcriptFilename = "";	// Message transcript file name
static bool pTokenTrace = false;    	// Include detailed token trace in pDiagFilename output ?
static unsigned int maxTokenLength = 64, minTokenLength = 1;	// Minimum and maximum token length limits
static unsigned int streamMaxTokenLength = 64, streamMinTokenLength = 5;    // Minimum and maximum byte stream token length limits
static bool singleDictionaryRead = true; // Can we optimise probability computation after dictionary import ?
static unsigned int phraseMin = 1, phraseMax = 1;   // Minimum and maximum phrase length in words
static unsigned int phraseLimit = 48;	// Maximum phrase length
static unsigned int autoPrune = 0;  	// Automatic prune based on dictionary memory consumption
static bool popProxyTrace = false;  	// Should POP3 server write trace to |cerr|?
static bool sloppyheaders = false;  	// Accept messages with malformed MIME headers

@ Procedure |usage|
prints how-to-call information.  This serves as a reference for the
option processing code which follows.  Don't forget to update
|usage| when you add an option!

@<Global functions@>=
static void usage(void)
{
    cout << PRODUCT << "  --  Annoyance Filter.  Call" << endl;
    cout << "                      with "<< PRODUCT << " [options]" << endl;
    cout << "" << endl;
    cout << "Options:" << endl;
    cout << "    --annotate options     Specify optional annotations in --transcript" << endl;
    cout << "    --autoprune n          Automatically prune unique words when dictionary exceeds n bytes" << endl;
    cout << "    --biasmail n           Set frequency bias for words and phrases in legitimate mail to n" << endl;
    cout << "    --binword n            Scan binary streams for words >= n characters (0 = none)" << endl;
    cout << "    --bsdfolder            Next --mail or --junk folder uses BSD \"From \" separator" << endl;
    cout << "    --classify fname       Classify first message in fname" << endl;
    cout << "    --clearjunk            Clear junk counts in dictionary" << endl;
    cout << "    --clearmail            Clear mail counts in dictionary" << endl;
    cout << "    --copyright            Print copyright information" << endl;
    cout << "    --csvread fname        Import dictionary from fname in CSV format" << endl;
    cout << "    --csvwrite fname       Export dictionary to fname in CSV format" << endl;
    cout << "    --fread fname          Load fast dictionary from fname" << endl;
    cout << "    --fwrite fname         Write fast dictionary to fname" << endl;
    cout << "    --help, -u             Print this message" << endl;
#ifdef Jig
    cout << "    --jig                  Test component in temporary jig" << endl;
#endif
    cout << "    --junk, -j folder      Add folder contents to junk mail dictionary" << endl;
    cout << "    --list                 Print dictionary on standard output" << endl;
    cout << "    --mail, -m folder      Add folder contents to legitimate mail dictionary" << endl;
    cout << "    --newword n            Set probability for words not in dictionary to n" << endl;
    cout << "    --pdiag fname          Print parser diagnostics to fname" << endl;
    cout << "    --phraselimit n        Set phrase maximum length to n characters" << endl;
    cout << "    --phrasemax n          Set phrase maximum to n words" << endl;
    cout << "    --phrasemin n          Set phrase minimum to n words" << endl;
#ifdef HAVE_PLOT_UTILITIES
    cout << "    --plot fname           Plot histogram of word probabilities in dictionary" << endl;
#endif
#ifdef POP3_PROXY_SERVER
    cout << "    --pop3port n           Listen for POP3 proxy requests on port n (default 9110)" << endl;
    cout << "    --pop3server serv[:p]  Operate POP3 proxy for server, port p (default 110)" << endl;
    cout << "    --pop3trace            Trace POP3 proxy traffic on standard error" << endl;
#endif
    cout << "    --prune                Prune infrequently used words from dictionary" << endl;
    cout << "    --ptrace               Include detailed trace in --pdiag output" << endl;
    cout << "    --read, -r fname       Import dictionary from fname" << endl;
    cout << "    --sigwords n           Classify message based on n most significant words" << endl;
    cout << "    --sloppyheaders        Accept messages with malformed MIME part separators" << endl;
    cout << "    --statistics           Print statistics of dictionary" << endl;
    cout << "    --test, -t fname       Test first message in fname" << endl;
    cout << "    --threshjunk n         Set junk threshold to n" << endl;
    cout << "    --threshmail n         Set mail threshold to n" << endl;
    cout << "    --transcript fname     Write annotated message transcript to fname" << endl;
    cout << "    --verbose, -v          Print processing information" << endl;
    cout << "    --version              Print version number" << endl;
    cout << "    --write fname          Export dictionary to fname" << endl;
    cout << "" << endl;
    cout << "by John Walker" << endl;
    cout << "http://www.fourmilab.ch/" << endl;
}

@
We use |getopt_long| to process command line options.  This
permits aggregation of single letter options without arguments and
both \.{-d}{\it arg} and \.{-d\ }{\it arg} syntax.  Long options,
preceded by \.{--}, are provided as alternatives for all single
letter options and are used exclusively for less frequently used
facilities.

@<Process command-line options@>=
    @q *** DID YOU REMEMBER TO ADD THE OPTION TO USAGE()? *** @>
    @q   Next free case number: 234  Freed up: none    @>
    static const struct option long_options[] = {@/
	{ "annotate", 1, NULL, 222 },@/
    	{ "autoprune", 1, NULL, 232 },@/
    	{ "biasmail", 1, NULL, 225 },@/
    	{ "binword", 1, NULL, 221 },@/
	{ "bsdfolder", 0, NULL, 231 },@/
    	{ "classify", 1, NULL, 209 },@/
    	{ "clearjunk", 0, NULL, 215 },@/
    	{ "clearmail", 0, NULL, 216 },@/
    	{ "copyright", 0, NULL, 200 },@/
    	{ "csvread", 1, NULL, 205 },@/
    	{ "csvwrite", 1, NULL, 207 },@/
    	{ "fread", 1, NULL, 228 },@/
    	{ "fwrite", 1, NULL, 229 },@/
	{ "help", 0, NULL, 'u' },@/
#ifdef Jig
	{ "jig", 0, NULL, 206 },@/
#endif
	{ "junk", 1, NULL, 'j' },@/
    	{ "list", 0, NULL, 202 },@/
	{ "mail", 1, NULL, 'm' },@/
	{ "newword", 1, NULL, 220 },@/
	{ "pdiag", 1, NULL, 212 },@/
    	{ "phraselimit", 1, NULL, 224 },@/
    	{ "phrasemax", 1, NULL, 223 },@/
    	{ "phrasemin", 1, NULL, 217 },@/
#ifdef HAVE_PLOT_UTILITIES
    	{ "plot", 1, NULL, 211 },@/
#endif
#ifdef POP3_PROXY_SERVER
	{ "pop3port", 1, NULL, 226 },@/
	{ "pop3server", 1, NULL, 227 },@/
	{ "pop3trace", 0, NULL, 230 },@/
#endif
    	{ "prune", 0, NULL, 203 },@/
	{ "ptrace", 0, NULL, 213 },@/
    	{ "purge", 0, NULL, 203 },@/	    // For compatibility, it's \.{--prune} now
    	{ "read", 1, NULL, 'r' },@/
    	{ "sigwords", 1, NULL, 219 },@/
	{ "sloppyheaders", 0, NULL, 233 },@/
	{ "statistics", 0, NULL, 210 },@/
    	{ "test", 1, NULL, 't' },@/
    	{ "threshjunk", 1, NULL, 208 },@/
    	{ "threshmail", 1, NULL, 214 },@/
	{ "transcript", 1, NULL, 204 },@/
	{ "verbose", 0, NULL, 'v' },@/
	{ "version", 0, NULL, 201 },@/
    	{ "write", 1, NULL, 218 },@/
	{ 0, 0, 0, 0 }@/
    };
    int option_index = 0;
    bool lastOption = false;	    	    // Set |true| to exit command line processing after option
    int exitStatus = 0;    	    	    // Program exit status
    
    while ((!lastOption) &&
    	(opt = getopt_long(argc, argv, "j:m:r:t:uv", long_options, &option_index)) != -1) {

        switch (opt) {

    	    @#
            case 222:	    	    // \.{--annotate} {\it options}  Add annotation {\it options} to \.{--transcript} output
	    	while ((*optarg) != 0) {
		    unsigned int ch = (*optarg++) & 0xFF;

		    if (isascii(ch) && isalpha(ch) && isupper(ch)) {
			ch = islower(ch);
		    }
    	    	    annotations.set(ch);
		}
		break;

    	    @#
            case 232:	    	    // \.{--autoprune} {\it n}  Automatically prune unique words when dictionary exceeds {\it n} bytes
	    	autoPrune = atoi(optarg);
		if (verbose) {
		    cerr << "Unique words will be automatically pruned from dictionary when it exceeds " <<
		    	    autoPrune << " bytes." << endl;
		}
		break;

    	    @#
            case 225:	    	    // \.{--biasmail} {\it n}  Set frequency bias of words in legitimate mail to {\it n}
	    	mailBias = atof(optarg);
		if (verbose) {
		    cerr << "Frequency bias for words and phrases in legitimate mail set to " <<
		    	    mailBias << "." << endl;
		}
		break;

    	    @#
            case 221:	    	    // \.{--binwords} {\it n}  Parse binary streams for words of {\it n} characters or more
	    	streamMinTokenLength = atoi(optarg);
		if (verbose) {
		    if (streamMinTokenLength > 0) {
		    	cerr << "Binary streams will be parsed for words of " <<
			    	streamMinTokenLength << " characters or more." << endl;
		    } else {
		    	cerr << "Binary streams will not be parsed for words." << endl;
		    }
		}
		break;
		
    	    @#
            case 231:	    	    // \.{--bsdfolder}  Next \.{--mail} or \.{--junk} folder uses BSD ``\.{From\ }'' separator
	    	bsdFolder = true;
	    	break;

    	    @#
            case 209:	    	    // \.{--classify} {\it fname}  Classify message in {\it fname}
	    	{
		    if (optind < argc) {
		    	cerr << "Warning: command line arguments after \"--classify " <<
			    	optarg << " will be ignored." << endl;
    	    	    }
		    double score = classifyMessages(optarg);
		    
	    	    if (score >= junkThreshold) {
			cout << "JUNK" << endl;
			exitStatus = 3;
    	    	    } else if (score <= mailThreshold) {
		    	cout << "MAIL" << endl;
		    	exitStatus = 0;
		    } else {
		    	cout << "INDT" << endl; 	// ``INDeTerminate''
		    	exitStatus = 4;
		    }
		    lastOption = true;	    // Bail out, ignoring any (erroneous) subsequent options
		    break;
		}
		
    	    @#
            case 215:	    	    // \.{--clearjunk}  Clear junk counts in dictionary
	    	dict.resetCat(dictionaryWord::Junk);
		messageCount[dictionaryWord::Junk] = 0;
	    	break;
		
    	    @#
            case 216:	    	    // \.{--clearmail}  Clear mail counts in dictionary
	    	dict.resetCat(dictionaryWord::Mail);
		messageCount[dictionaryWord::Mail] = 0;
	    	break;
		
    	    @#
            case 200:	    	    // \.{--copyright}  Print copyright information
                cout << "This program is in the public domain.\n";
                return 0;

    	    @#
	    case 205:	    	    // \.{--csvread} {\it fname}  Import dictionary from CSV {\it fname}
	    	{   ifstream is(optarg);
		    if (!is) {
		    	cerr << "Cannot open CSV dictionary file " << optarg << endl;
			return 1;
		    }
	    	    dict.importCSV(is);
		    if (!singleDictionaryRead) {
	    	    	updateProbability();
		    }
		    singleDictionaryRead = false;
		    is.close();
		}
	    	break;

    	    @#
            case 207:	    	    // \.{--csvwrite} {\it fname}  Export dictionary to CSV {\it fname}
	    	{   ofstream of(optarg);
		    if (!of) {
		    	cerr << "Cannot create CSV export file " << optarg << endl;
			return 1;
		    }
	    	    updateProbability();
	    	    dict.exportCSV(of);
		    of.close();
		}
		break;

    	    @#
	    case 228:	    	    // \.{--fread} {\it fname}  Load fast dictionary from {\it fname}
	    	if (!fDict.load(optarg)) {
		    cerr << "Unable to load fast dictionary file." << endl;
		    return 1;
		}
	    	break;

    	    @#
            case 229:	    	    // \.{--fwrite} {\it fname}  Export dictionary to fast dictionary {\it fname}
	    	if (dict.size() == 0) {
		    cerr << "No dictionary loaded when --fwrite command issued." << endl;
		    return 1;
		}
    	    	fastDictionary::exportDictionary(dict, optarg);
		break;
		
    	    @#
            case 'u':@;	    	    // \.{-u}, \.{--help}  Print how-to-call information
            case '?':	    	    // \.{-?}  Indication of error parsing command line
                usage();
                return 0;
		
#ifdef Jig
    	    @#
            case 206:	    	    // \.{--jig}  Test component in temporary jig
	    	{
		    @<Test component in temporary jig@>;
		}
	    	break;
#endif
		
    	    @#
	    case 'j':	    	    // \.{-j}, \.{--junk} {\it folder}  Add {\it folder} contents to junk mail dictionary
    	    	addFolder(optarg, dictionaryWord::Junk);
	    	updateProbability();
		break;

    	    @#
            case 202:	    	    // \.{--list}  Print dictionary on standard output
	    	printDictionary();
	    	break;
		
    	    @#
	    case 'm':	    	    // \.{-m}, \.{--mail} {\it folder}  Add {\it folder} contents to legitimate mail dictionary
    	    	addFolder(optarg, dictionaryWord::Mail);
	    	updateProbability();
		break;

    	    @#
            case 220:	    	    // \.{--newword} {\it n}  Set probability for words not in dictionary to {\it n}
	    	novelWordProbability = atof(optarg);
		if (verbose) {
		    cerr << "Probability for words not in dictionary set to " << novelWordProbability  << "." << endl;
		}
		break;

    	    @#
            case 212:	    	    // \.{--pdiag} {\it fname}  Write parser diagnostic log to  {\it fname}
	    	pDiagFilename = optarg;
    	    	break;

    	    @#
            case 224:	    	    // \.{--phraselimit} {\it n}  Set phrase maximum length to {\it n} characters
	    	phraseLimit = atoi(optarg);
		if (verbose) {
		    cerr << "Phrase maximum length set to " << phraseLimit << " characters."  << endl;
		}
		break;

    	    @#
            case 223:	    	    // \.{--phrasemax} {\it n}  Set phrase maximum to {\it n} words
	    	phraseMax = atoi(optarg);
		if (verbose) {
		    cerr << "Phrase maximum length set to " << phraseMax << " word"  <<
		    	(phraseMax == 1 ? "" : "s") << "." << endl;
		}
		break;

    	    @#
            case 217:	    	    // \.{--phrasemin} {\it n}  Set phrase minimum to {\it n} words
	    	phraseMin = atoi(optarg);
		if (verbose) {
		    cerr << "Phrase minimum length set to " << phraseMin << " word"  <<
		    	(phraseMin == 1 ? "" : "s") << "." << endl;
		}
		break;

    	    @#
#ifdef HAVE_PLOT_UTILITIES
            case 211:	    	    // \.{--plot} {\it fname}  Plot dictionary histogram as {\it fname}\.{.png}
	    	updateProbability();
		dict.plotProbabilityHistogram(optarg);
    	    	break;
#endif

    	    @#
#ifdef POP3_PROXY_SERVER
            case 226:	    	    // \.{--pop3port} {\it p} Listen for POP3 proxy requests on port {\it n} (default 9110)
	    	popProxyPort = atoi(optarg);
		if (verbose) {
		    cerr << "POP3 proxy server will listen on port " << popProxyPort << endl;
		}
		break;
#endif

    	    @#
#ifdef POP3_PROXY_SERVER
            case 227:	    	    // \.{--pop3server} {\it serv:p}  Operate POP3 proxy for server {\it serv:p}.  Port {\it p} defaults to 110
	    	{
		    if (optind < argc) {
		    	cerr << "Warning: command line arguments after \"--pop3server " <<
			    	optarg << " will be ignored." << endl;
    	    	    }
		    string sarg = optarg;
		    string::size_type pind = sarg.find_last_of(':');
		    if (pind != string::npos) {
		    	if ((pind < (sarg.length() - 1)) &&
			    (pind > 0) &&
			    isdigit(sarg[pind + 1])) {
		    	    popProxyServerPort = atoi(sarg.substr(pind + 1).c_str());
			} else {
			    cerr << "Invalid port number specification in --pop3server argument." << endl;
			    return 1;
			}
			sarg = sarg.substr(0, pind);
		    }
		    popProxyServer = sarg;
		    if (verbose) {
		    	cerr << "POP3 server will act as proxy for " << popProxyServer << ":" <<
			    popProxyServerPort << endl;
		    }
		    lastOption = true;	    // Bail out, ignoring any (erroneous) subsequent options
		    break;
		}
#endif

    	    @#
#ifdef POP3_PROXY_SERVER
            case 230:	    	    // \.{--pop3trace}  Trace POP3 proxy server operations on |cerr|
	    	popProxyTrace = true;
	    	break;
#endif
		
    	    @#
            case 203:	    	    // \.{--prune}  Purge dictionary of infrequently used words
	    	updateProbability();
	    	dict.purge();
	    	break;

    	    @#
            case 213:	    	    // \.{--ptrace}  Include token by token trace in \.{--pdiag} output
	    	pTokenTrace = true;
	    	break;

    	    @#
	    case 'r':	    	    // \.{-r}, \.{--read} {\it fname}  Read dictionary from {\it fname}
	    	{
#ifdef HAVE_MMAP
		    int fileHandle = open(optarg, O_RDONLY);
		    if (fileHandle == -1) {
		    	cerr << "Cannot open dictionary file " << optarg << endl;
			return 1;
		    }
		    long fileLength = lseek(fileHandle, 0, 2);
		    lseek(fileHandle, 0, 0);
		    char *dp = static_cast<char *>(mmap((caddr_t) 0, fileLength,
			    PROT_READ, MAP_SHARED | MAP_NORESERVE,
			    fileHandle, 0));
		    istrstream is(dp, fileLength);
#else
		    ifstream is(optarg, ios::binary);
		    if (!is) {
		    	cerr << "Cannot open dictionary file " << optarg << endl;
			return 1;
		    }
#endif
	    	    dict.importFromBinaryFile(is);
#ifdef HAVE_MMAP
    		    munmap(dp, fileLength);
		    close(fileHandle);
#else
		    is.close();
#endif
		    if (!singleDictionaryRead) {
	    	    	updateProbability();
		    }
		    singleDictionaryRead = false;
		}
	    	break;

    	    @#
            case 219:	    	    // \.{--sigwords} {\it n}  Classify message based on {\it n} most significant words
	    	significantWords = atoi(optarg);
		if (verbose) {
		    cerr << "Significant words set to " << significantWords << "." << endl;
		}
		break;
		
    	    @#
	    case 234:	    	    // \.{--sloppyheaders}  Accept messages with malformed MIME part separators
	    	sloppyheaders = true;
		break;

    	    @#
            case 210:	    	    // \.{--statistics}  Print statistics of dictionary
	    	updateProbability();
	    	dict.printStatistics();
	    	break;

    	    @#
            case 't':	    	    // \.{-t}, \.{--test} {\it fname}  Test message in {\it fname}
	    	{   double score = classifyMessages(optarg);
		
		    if (transcriptFilename != "-") {
	    	    	cout << "Junk probability " << score << endl;
		    }
		}
		break;

    	    @#
            case 208:	    	    // \.{--threshjunk} {\it n}  Set junk threshold to {\it n}
	    	junkThreshold = atof(optarg);
		if (verbose) {
		    cerr << "Junk threshold set to " << setprecision(5) << junkThreshold << "." << endl;
		}
		break;

    	    @#
            case 214:	    	    // \.{--threshmail} {\it n}  Set mail threshold to {\it n}
	    	mailThreshold = atof(optarg);
		if (verbose) {
		    cerr << "Mail threshold set to " << setprecision(5) << mailThreshold << "." << endl;
		}
		break;

    	    @#
            case 204:	    	    // \.{--transcript} {\it fname}  Write annotated message transcript to  {\it fname}
	    	transcriptFilename = optarg;
    	    	break;
		
    	    @#
	    case 'v':	    	    // \.{-v}, \.{--verbose}  Print processing information
	    	verbose = true;
		break;

    	    @#
            case 201:	    	    // \.{--version}  Print version information
	    	{
		    @<Print program version information@>;
		}
                return 0;
		
    	    @#
	    case 218:	    	    // \.{--write} {\it fname}  Write dictionary to {\it fname}
	    	{   ofstream of(optarg, ios::binary);
		    if (!of) {
		    	cerr << "Cannot create dictionary file " << optarg << endl;
			return 1;
		    }
	    	    updateProbability();
	    	    dict.exportToBinaryFile(of);
		    of.close();
		}
	    	break;

    	    @#
	    default:
	    	cerr << "***Internal error: unhandled case " << opt <<
		    	" in option processing." << endl;
		return 1;
        }
    }
    
    @<Check for inconsistencies in option specifications@>;
    
@
Some combinations of option specifications make no sense or indicate
the user doesn't understand how they're processed.  Check for
such circumstances and issue warnings to point out the error of
the user's ways.

@<Check for inconsistencies in option specifications@>=
    if (pTokenTrace && (pDiagFilename == "")) {
    	cerr << "Warning: --ptrace requested but no --pdiag file specified." << endl;
    }
    
    if ((transcriptFilename != "") && (nTested == 0)) {
    	cerr << "Warning: --transcript requested but no message --test or --classify done." << endl;
    }
    
    if ((pDiagFilename != "") && (nTested == 0)) {
    	cerr << "Warning: --pdiag requested but no message --test or --classify done." << endl;
    }
    
    if (annotations.count() > 0 && (transcriptFilename == "")
#ifdef POP3_PROXY_SERVER
    	    	    	    	&& (popProxyServer == "")
#endif
        ) {
    	cerr << "Warning: --annotate requested but no --transcript or --pop3proxy requested." << endl;
    }
    
@
Print a primate-readable message giving the version of the program,
source and contact information, and optional features compiled in.

@<Print program version information@>=
    cout << PRODUCT @, " " @, VERSION << endl;
    cout << "Last revised: " @, REVDATE << endl;
    @<List optional capabilities configured in this build@>;
    cout << "The latest version is always available from:" << endl;
    cout << "    http://www.fourmilab.ch/annoyance-filter/" << endl;
    cout << "Please report bugs to:" << endl;
    cout << "    bugs@@fourmilab.ch" << endl;
    
@
This little utility function worries about printing the label
before the first optional capability and keeping track of how
many we've printed in order to say ``none'' if that's the case.

@<Utility functions@>=
static unsigned int nOptionalCaps = 0;

static void printOptionalCapability(const string &s)
{
    if (nOptionalCaps == 0) {
    	cout << "Optional capabilities configured:" << endl;
	nOptionalCaps++;
    }
    cout << "    " << s << "." << endl;
}
    
@
Show which optional features detected by \.{configure} were
built into the program.

@<List optional capabilities configured in this build@>=
#ifdef HAVE_PDF_DECODER
    printOptionalCapability("Decoding strings in PDF attachments");
#endif

#ifdef HAVE_DIRECTORY_TRAVERSAL
    printOptionalCapability("Directory traversal in the --mail and --junk options");
#endif

#ifdef HAVE_MMAP
    printOptionalCapability("Memory mapped access to dictionary and fast dictionary files");
#endif

#ifdef HAVE_PLOT_UTILITIES
    printOptionalCapability("Plotting distribution histogram (--plot option)");
#endif

#ifdef POP3_PROXY_SERVER
    printOptionalCapability("POP3 proxy server");
#endif

    if (nOptionalCaps == 0) {
    	cout << "Optional capabilities configured: none." << endl;
    }
        
@** Character set definitions and translation tables.

The following sections define the character set used in the
program and provide translation tables among various representations
used in formats we emit.

@
Define the various kinds of tokens we parse from the input stream.

@<Master dictionary@>=
static tokenDefinition isoToken;   	    // ISO-8859 token definition
static tokenDefinition asciiToken;  	    // US-ASCII token definition

@*1 ISO 8859-1 character types.

The following definitions provide equivalents for \.{ctype.h} macros
which work for ISO-8859 8 bit characters.  They require that
\.{ctype.h} be included before they're used.

@<Global variables@>=

#define ISOch(x)    	(static_cast<unsigned char>((x) & 0xFF))
#define isISOspace(x)   (isascii(ISOch(x)) && isspace(ISOch(x)))
#define isISOalpha(x)   ((isoalpha[ISOch(x) / 8] & (0x80 >> (ISOch(x) % 8))) != 0)
#define isISOupper(x)   ((isoupper[ISOch(x) / 8] & (0x80 >> (ISOch(x) % 8))) != 0)
#define isISOlower(x)   ((isolower[ISOch(x) / 8] & (0x80 >> (ISOch(x) % 8))) != 0)
#define toISOupper(x)   (isISOlower(x) ? (isascii(((unsigned char) (x))) ?  \
                            toupper(x) : (((ISOch(x) != 0xDF) && \
                            (ISOch(x) != 0xFF)) ? \
                            (ISOch(x) - 0x20) : (x))) : (x))
#define toISOlower(x)   (isISOupper(x) ? (isascii(ISOch(x)) ?  \
                            tolower(x) : (ISOch(x) + 0x20)) \
                            : (x))

@
The following tables are bit vectors which define membership in the character
classes tested for by the preceding macros.

@<Global variables@>=
const unsigned char isoalpha[32] = {
    0,0,0,0,0,0,0,0,127,255,255,224,127,255,255,224,0,0,0,0,0,0,0,0,255,255,
    254,255,255,255,254,255
};

const unsigned char isoupper[32] = {
    0,0,0,0,0,0,0,0,127,255,255,224,0,0,0,0,0,0,0,0,0,0,0,0,255,255,254,254,
    0,0,0,0
};

const unsigned char isolower[32] = {
    0,0,0,0,0,0,0,0,0,0,0,0,127,255,255,224,0,0,0,0,0,0,0,0,0,0,0,1,255,255,
    254,255
};

@
To perform component tests during the development process
we provide a {\it test jig} in which the component may
be figuratively mounted and exercised.  When compiled
with \.{Jig} defined, a \.{--jig} option (without
argument) is included to activate the test.

@<Test component in temporary jig@>=
#ifdef Jig
#endif

@
The component in the temporary test jig may require some
items declared in global context.  Here's where you can put
such declarations.

@<Global declarations used by component in temporary jig@>=
#ifdef Jig
#endif

@** Overall program structure.

Here we put all the pieces together in the order required
by the digestive tract of the \CPP/ compiler.  Like programmers,
who must balance their diet among the four basic food groups:
sugar, salt, fat, and caffeine, compilers require a
suitable mix of definitions, declarations, classes,
and functions to get along.  Compilers are rather more
picky than programmers in the order in which these
delectations are consumed.

@c
@h

@<Include header files@>@/
@<Global variables@>@/
@<Class definitions@>@/
@<Command line arguments@>@/
@<Class implementations@>@/
@<Master dictionary@>@/
@<Global functions@>@/
@<Utility functions@>@/
@<Main program@>@/

@q  Release History and Development Log  @>

@i log.w

@** Index.
The following is a cross-reference table for \PRODUCT. 
Single-character identifiers are not indexed, nor are
reserved words.  Underlined entries indicate where
an identifier was declared.
