html-parsing: Permissive Parsing of HTML to SXML🔗ℹ

Neil Van Dyke

License: LGPLv3 Web: http://www.neilvandyke.org/racket/html-parsing/

1 Introduction🔗ℹ

The html-parsing library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. html-parsing emits SXML/xexp, so that conventional HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure.

The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML.

html-parsing also has some support for XHTML, although XML namespace qualifiers are accepted but stripped from the resulting SXML/xexp. Note that valid XHTML input might be better handled by a validating XML parser like Kiselyov’s SSAX.

2 Interface🔗ℹ

procedure
(html->xexp input) → xexp
input : (or/c input-port? string?)

Parse HTML permissively from input, which is either an input port or a string, and emit an SXML/xexp equivalent or approximation. To borrow and slightly modify an example from Kiselyov’s discussion of his HTML parser:

> (html->xexp
 (string-append
 "<html><head><title></title><title>whatever</title></head>"
 "<body> <a href=\"url\">link</a>"
 "<ul compact style=\"aa\"> BLah"
 " italic bold <tt> ened still < bold "
 "</body> But not done yet..."))

(*TOP* (html (head (title) (title "whatever"))

(body "\n"

(a (@ (href "url")) "link")

(p (@ (align "center"))

(ul (@ (compact) (style "aa")) "\n"))

(p "BLah"

(*COMMENT* " comment <comment> ")

" "

(i " italic " (b " bold " (tt " ened")))

"\n"

"still < bold "))

(p " But not done yet...")))

Note that, in the emitted SXML/xexp, the text token "still < bold" is not inside the b element. This is one old Web browser quirk-handling of invalid HTML that this parser does not try to emulate.

3 History🔗ℹ

Version 11:0 — 2022-07-19
- An object element is no longer considered always-empty. Incrementing major version again, because this could break parses.
Version 10:0 — 2022-07-19
- To pass a "microformats" test suite (impliedname.html), an area element can now be a child of a span element. In the future, we might be even more flexible about where span elements are permitted. (Thanks to Jacob Hall for discussing.)
Version 9:0 — 2022-04-16
- Header elements may once again appear as children of li elements (which we broke in the previous version), as we see how far we can stretch a 20 year-old hack for invalid HTML. (Thanks for Simon Budinsky for reporting.)
Version 8:0 — 2022-04-03
- The original "H" elements (h1, h2, etc.) now are parsed with "parent constraints" for handling invalid HTML, to accommodate a need to parse mid-1990s HTML in which was used as a separator or terminator, rather than a start delimeter. There is a chance that change this will break a real-world scraper or other tool.
Version 7:1 — 2022-04-02
- Include a test case #:fail that was unsaved in DrRacket.
Version 7:0 — 2022-04-02
- Fixed a quirks-handling bug in which p elements would be (directly or indirectly) nested under other p elements in cases in which there was no body element, but there was an html element. (Thanks to Jonathan Simpson for reporting.)
Version 6:1 — 2022-01-22
- Permit details element to be parent of p element in quirks handling. (Thanks to Jacder for reporting.)
Version 6:0 — 2018-05-22
- Fix to permit p elements as children of blockquote elements. Incrementing major version number because this is a breaking change of 17 years, but seems an appropriate change for modern HTML, and fixes a particular real-world problem. (Thanks to Sorawee Porncharoenwase for reporting.)
Version 5:0 — 2018-05-15
- In a breaking change of handing invalid HTML, most named character entity references that are invalid because (possibly among multiple reasons) they are not terminated by semicolon, now are treated as literal strings (including the ampersand), rather than as named character entites. For example, parser input string "A&B Co." will now parse as (p "A&B Co.") rather than as (p "A" (& B) " Co."). (Thanks for Greg Hendershott for suggesting this, and discussing.)
- For support of historical quirks handling, five early HTML named character entities (specifically, amp, apos, lt, gt, quot) do not need to be terminated with a semicolon, and will even be recognized if followed immediately by an alphabetic. For example, "a&ltz" will now parse as (p "a<z"), rather than as (p (& ltz)).
- Invalid character entity references that are terminated by EOF rather than semicolon may now be parsed as literal strings, rather than as entity references.
Version 4:3 — 2016-12-15
- Error message “%html-parsing:parse-html: invalid input type:” now abbreviates the invalid value, to avoid possibly huge messages. (Thanks to John B. Clements.)
Version 4:2 — 2016-03-02
- Tweaked info.rkt, filenames.
Version 4:1 — 2016-02-25
- Updated deps.
- Documentation tweaks.
Version 4:0 — 2016-02-21
- Moving from PLaneT to new package system.
- Moved unit tests into main source file.
Version 3:0 — 2015-04-24
- Numeric character entities now parse to Racket strings instead of Racket characters, to bring SXML/xexp back closer to SXML. (Thanks to John Clements for reporting.)
Version 2:0 — 2012-06-13
- Converted to McFly.
Version 0.3 — Version 1:2 — 2011-08-27
- Converted test suite from Testeez to Overeasy.
Version 0.2 — Version 1:1 — 2011-08-27
- Fixed embarrassing bug due to code tidying. (Thanks to Danny Yoo for reporting.)
Version 0.1 — Version 1:0 — 2011-08-21
- Part of forked development from HtmlPrag, parser originally written 2001-04.

4 Legal🔗ℹ

Copyright 2001–2012, 2015–2016, 2018, 2022 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.