´ This file is part of TagSoup and is Copyright 2002‐2008 by John
Cowan.  ´ ´ TagSoup is licensed under the Apache License, ´  Ver‐
sion   2.0.   You  may  obtain  a  copy  of  this  license  at  ´
http://www.apache.org/licenses/LICENSE‐2.0 .  You may also have ´
additional legal rights not granted by this license.  ´ ´ TagSoup
is distributed in the hope that it will be useful, but  ´  unless
required  by applicable law or agreed to in writing, TagSoup ´ is
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
´  OF  ANY  KIND, either express or implied; not even the implied
warranty ´ of MERCHANTABILITY or FITNESS FOR  A  PARTICULAR  PUR‐
TAGSOUP(1)                        User Commands                       TAGSOUP(1)



POSE.  ´

NAME
       tagsoup - convert nasty, ugly HTML to clean XHTML

SYNOPSIS
       java -jar tagsoup-1.2.1 [ options ] [ files ]

DESCRIPTION
       Rectify  arbitrary HTML into clean XHTML, using a tailored description of
       HTML.  The output will be well-formed  XML,  but  not  necessarily  valid
       XHTML.

       --files
              multiple input files should be processed into corresponding output
              files

       --encoding=encoding
              specifies the encoding of input files

       --output-encoding=encoding
              specifies the encoding of the output (if the encoding name  begins
              with ``utf'', the output will not contain character entities; oth‐
              erwise, all non-ASCII characters are represented as entities)

       --html output rectified HTML rather than XML, omitting the  XML  declara‐
              tion and any namespace declarations

       --method=html
              output  rectified  HTML  rather than XML (end-tags are omitted for
              empty elements, and no character escaping is done  in  script  and
              style elements)

       --omit-xml-declaration
              omit the XML declaration

       --lexical
              output  lexical  features  (specifically  comments and any DOCTYPE
              declaration)

       --nons suppress namespaces in output

       --nobogons
              suppress unknown non-HTML elements in output

       --nodefaults
              suppress default attribute values

       --nocolons
              change explicit colons in element and attribute  names  to  under‐
              scores

       --norestart
              don't restart any restartable elements

       --ignorable
              pass through ignorable whitespace (whitespace in element-only con‐
              tent) via SAX method handler ignorableWhitespace

       --any  treat unknown non-HTML elements as allowing any content (default)

       --emptybogons
              treat unknown non-HTML elements as empty elements

       --norootbogons
              don't allow unknown non-HTML elements to be root elements

       --doctype-system=system-id
              force DOCTYPE declaration to be output with specified system iden‐
              tifier

       --doctype-public=public-id
              force DOCTYPE declaration to be output with specified public iden‐
              tifier

       --standalone=[yes|no]
              specify standalone pseudo-attribute in output XML declaration

       --version=version
              specify version pseudo-attribute in output XML  declaration  (does
              not affect actual version of XML output)

       --nocdata
              treat the CDATA-content elements script and style as ordinary ele‐
              ments (mostly for testing)

       --pyx  output PYX format rather than XML (mostly for testing)

       --pyxin
              input is PYX-format HTML (mostly for testing)

       --reuse
              reuse the same Parser object internally (for testing only)

       --help output basic help

       --version
              output version number

       TagSoup is a parser and reformatter for nasty,  ugly  HTML.   Its  normal
       processing  mode is to accept HTML files on the command line, or from the
       standard input if none are given, and output them as  clean  XML  to  the
       standard output.  The encoding is assumed to be the platform-local encod‐
       ing on input, and is always UTF-8 on output.

       When the --files option is given, each input file is  processed  into  an
       output  file  of  the  corresponding  name, with the extension changed to
       xhtml.  If the extension is already xhtml, it is changed to xhtml_.

       TagSoup will repair, by whatever means necessary, violations of XML well-
       formedness.   In particular, it will fix up malformed attribute names and
       supply missing attribute-value quotation marks.  More  significantly,  it
       supplies  end-tags  where  HTML  allows them to be omitted, and sometimes
       where it doesn't.  It will even supply start-tags  where  necessary;  for
       example, if a document begins with a <li> tag, TagSoup will automatically
       prefix it with <html><body><ul>.

BUGS
       TagSoup can be fooled by missing close quotes after attribute values, and
       by  incorrect  character  encodings  (it  does  not  contain  an encoding
       guesser).

       TagSoup doesn't understand namespace declarations, which are not properly
       part of HTML.  Instead, any element or attribute name beginning foo: will
       be put into the artificial namespace urn:x-prefix:foo.

       For the same reasons, namespace-qualified attributes like xml:space can't
       be  returned  as  default values, though an explicit attribute in the xml
       namespace will be returned with the proper namespace URI.

AUTHOR
       John Cowan <cowan@ccil.org>

COPYRIGHT
       Copyright © 2002-2008 John Cowan
       TagSoup is free software; see the source for copying  conditions.   There
       is  NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
       PURPOSE.



TagSoup 1.2.1                       July 2011                         TAGSOUP(1)
