CommonMark: Standard Markdown
The source of this manual is available on GitHub.
(require commonmark) | package: commonmark-lib |
1 Quick start
For information about the Markdown syntax supported by commonmark, see the CommonMark website.
In commonmark, processing Markdown is split into two steps: parsing and rendering. To get started, use string->document or read-document to parse Markdown input into a document structure:
> (require commonmark) > (define doc (string->document "*Hello*, **markdown**!")) > doc
(document
(list (paragraph (list (italic "Hello") ", " (bold "markdown") "!")))
'())
A document is an abstract syntax tree that represents Markdown content. If you’d like, you can choose to render it however you wish, but most uses of Markdown render it to HTML, so commonmark provides the document->html and write-document-html functions, which render a document to HTML in the way recommended by the CommonMark specification:
> (write-document-html doc) <p><em>Hello</em>, <strong>markdown</strong>!</p>
The document->xexprs function can also be used to render a document to a list of X-expressions, which can make it more convenient to incorporate rendered Markdown into a larger HTML document (though do be aware of the caveats involving HTML blocks and HTML spans described in the documentation for document->xexprs):
> (document->xexprs doc) '((p (em "Hello") ", " (strong "markdown") "!"))
2 Parsing
(require commonmark/parse) | package: commonmark-lib |
procedure
(string->document str) → document?
str : string?
> (define doc (string->document "*Hello*, **markdown**!")) > doc
(document
(list (paragraph (list (italic "Hello") ", " (bold "markdown") "!")))
'())
> (write-document-html doc) <p><em>Hello</em>, <strong>markdown</strong>!</p>
This function cannot fail: every string of Unicode characters can—
procedure
(read-document in) → document?
in : input-port?
> (define doc (read-document (open-input-string "*Hello*, **markdown**!"))) > doc
(document
(list (paragraph (list (italic "Hello") ", " (bold "markdown") "!")))
'())
> (write-document-html doc) <p><em>Hello</em>, <strong>markdown</strong>!</p>
This function can sometimes be more efficient than (read-document (port->string in)), but probably not significantly so, as the entire document structure must be realized in memory regardless.
parameter
(current-parse-footnotes? parse-footnotes?) → void? parse-footnotes? : any/c
= #f
Note that the value of current-parse-footnotes? only affects parsing, not rendering. If a document containing footnotes is rendered to HTML, the footnotes will still be rendered even if (current-parse-footnotes?) is #f.
Added in version 1.1 of package commonmark-lib.
3 Rendering HTML
(require commonmark/render/html) | package: commonmark-lib |
procedure
(document->html doc) → string?
doc : document?
> (document->html (string->document "*Hello*, **markdown**!")) "<p><em>Hello</em>, <strong>markdown</strong>!</p>"
procedure
(write-document-html doc [out]) → void?
doc : document? out : output-port? = (current-output-port)
> (write-document-html (string->document "*Hello*, **markdown**!")) <p><em>Hello</em>, <strong>markdown</strong>!</p>
procedure
(document->xexprs doc) → (listof xexpr/c)
doc : document?
> (document->xexprs (string->document "*Hello*, **markdown**!")) '((p (em "Hello") ", " (strong "markdown") "!"))
Note that HTML blocks and HTML spans are not parsed and may not even contain valid HTML, which makes them difficult to represent as an X-expression. As a workaround, raw HTML will be represented as cdata elements:
> (document->xexprs (string->document "A paragraph with <marquee>raw HTML</marquee>."))
(list
(list
'p
"A paragraph with "
(cdata #f #f "<marquee>")
"raw HTML"
(cdata #f #f "</marquee>")
"."))
This generally works out okay, since cdata elements render directly as their unescaped content, but it is, strictly speaking, an abuse of cdata.
parameter
(current-italic-tag tag) → void? tag : symbol?
= 'em
parameter
(current-bold-tag tag) → void? tag : symbol?
= 'strong
Reasonable alternate values for current-italic-tag and current-bold-tag include 'i, 'b, 'mark, 'cite, or 'defn, all of which are elements with semantic (rather than presentational) meaning in HTML5. Of course, the “most correct” choice depends on how italic spans and bold spans will actually be used, so no one set of choices can be universally called the best.
> (parameterize ([current-italic-tag 'cite] [current-bold-tag 'mark]) (document->xexprs (string->document (string-append "> First, programming is about stating and solving problems,\n" "> and this activity normally takes place in a context with its\n" "> own language of discourse; **good programmers ought to\n" "> formulate this language as a programming language**.\n" "\n" "— *The Racket Manifesto* (emphasis mine)"))))
'((blockquote
(p
"First, programming is about stating and solving problems,\n"
"and this activity normally takes place in a context with its\n"
"own language of discourse; "
(mark
"good programmers ought to\n"
"formulate this language as a programming language")
"."))
(p "— " (cite "The Racket Manifesto") " (emphasis mine)"))
4 Document structure
(require commonmark/struct) | package: commonmark-lib |
struct
(struct document (blocks footnotes) #:transparent) blocks : (listof block?) footnotes : (listof footnote-definition?)
Changed in version 1.1 of package commonmark-lib: Added the footnotes field.
struct
(struct footnote-definition (blocks label) #:transparent) blocks : (listof block?) label : string?
Footnotes are an extension to the CommonMark specification and are not enabled by default; see Footnotes in the Extensions section of this manual for more details.
A footnote definition contains a flow that can be referenced by a footnote reference via its footnote label.
Note: although footnote definitions are syntactically blocks in Markdown input, they are not a type of block (as recognized by the block? predicate) and cannot be included directly in the main document flow. Footnote definitions are collected into the separate document-footnotes field of the document structure during parsing, since they represent auxiliary definitions, and their precise location in the Markdown input does not matter.
(This is quite similar to the way the parser processes link reference definitions, except that footnote definitions must be retained separately for later rendering, whereas link reference definitions can be discarded after all link targets have been resolved.)
Added in version 1.1 of package commonmark-lib.
4.1 Blocks
See § Blocks and inlines in the CommonMark specification for more information about blocks.
Returns #t if v is a block: a paragraph, itemization, block quote, code block, HTML block, heading, or thematic break. Otherwise, returns #f.
A flow is a list of blocks. The body of a document, the contents of a block quote, and each item in an itemization are flows.
See § Paragraphs in the CommonMark specification for more information about paragraphs.
A paragraph is a block that contains inline content. In HTML output, it corresponds to a <p> element. Most blocks in a document are usually paragraphs.
struct
(struct itemization (blockss style start-num) #:transparent) blockss : (listof (listof block?)) style : (or/c 'loose 'tight) start-num : (or/c exact-nonnegative-integer? #f)
See § Lists and § List items in the CommonMark specification for more information about itemizations.
An itemization is a block that contains a list of flows. In HTML output, it corresponds to a <ul> or <ol> element.
The style field records whether the itemization is loose or tight: if style is 'tight, paragraphs in HTML output are not wrapped in <p> tags.
If start-num is #f, then the itemization represents a bullet list. Otherwise, the itemization represents an ordered list, and the value of start-num is its start number.
struct
(struct blockquote (blocks) #:transparent) blocks : (listof block?)
See § Block quotes in the CommonMark specification for more information about block quotes.
A block quote is a block that contains a nested flow. In HTML output, it corresponds to a <blockquote> element.
struct
(struct code-block (content info-string) #:transparent) content : string? info-string : (or/c string? #f)
See § Indented code blocks and § Fenced code blocks in the CommonMark specification for more information about code blocks.
A code block is a block that has unformatted content and an optional info string. In HTML output, it corresponds to a <pre> element that contains a <code> element.
The CommonMark specification does not mandate any particular treatment of the info string, but it notes that “the first word is typically used to specify the language of the code block.” In HTML output, the language is indicated by adding a CSS class to the rendered <code> element consisting of language- followed by the language name, per the spec’s recommendation.
struct
(struct html-block (content) #:transparent) content : string?
See § HTML Blocks in the CommonMark specification for more information about HTML blocks.
An HTML block is a block that contains raw HTML content (and will be left unescaped in HTML output). Note that, in general, the content may not actually be well-formed HTML, as CommonMark simply treats everything that “looks sufficiently like” HTML—
struct
(struct heading (content depth) #:transparent) content : inline? depth : (integer-in 1 6)
See § ATX headings and § Setext headings in the CommonMark specification for more information about headings.
A heading has inline content and a heading depth. In HTML output, it corresponds to one of the <h1> through <h6> elements.
A heading depth is an integer between 1 and 6, inclusive, where higher numbers correspond to more-nested headings.
value
procedure
(thematic-break? v) → boolean?
v : any/c
See § Thematic breaks in the CommonMark specification for more information about thematic breaks.
A thematic break is a block. It is usually rendered as a horizontal rule, and in HTML output, it corresponds to an <hr> element.
4.2 Inline content
See § Blocks and inlines in the CommonMark specification for more information about inline content.
Returns #t if v is inline content: a string, italic span, bold span, code span, link, image, footnote reference, HTML span, hard line break, or list of inline content. Otherwise, returns #f.
See § Emphasis and strong emphasis in the CommonMark specification for more information about italic spans.
An italic span is inline content that contains nested inline content. By default, in HTML output, it corresponds to an <em> element (but an alternate tag can be used by modifying current-italic-tag).
See § Emphasis and strong emphasis in the CommonMark specification for more information about bold spans.
A bold span is inline content that contains nested inline content. By default, in HTML output, it corresponds to a <strong> element (but an alternate tag can be used by modifying current-bold-tag).
See § Code spans in the CommonMark specification for more information about code spans.
A code span is inline content that contains unformatted content. In HTML output, it corresponds to a <code> element.
struct
(struct link (content dest title) #:transparent) content : inline? dest : string? title : (or/c string? #f)
See § Links in the CommonMark specification for more information about links.
A link is inline content that contains nested inline content, a link destination, and an optional link title. In HTML output, it corresponds to an <a> element.
struct
(struct image (description source title) #:transparent) description : inline? source : string? title : (or/c string? #f)
See § Images in the CommonMark specification for more information about images.
An image is inline content with a source path or URL that should point to an image. It has an inline content description (which is used as the alt attribute in HTML output) and an optional title. In HTML output, it corresponds to an <img> element.
struct
(struct footnote-reference (label) #:transparent) label : string?
Footnotes are an extension to the CommonMark specification and are not enabled by default; see Footnotes in the Extensions section of this manual for more details.
A footnote reference is inline content that references a footnote definition with a matching footnote label. In HTML output, it corresponds to a superscript <a> element.
Added in version 1.1 of package commonmark-lib.
See § Raw HTML in the CommonMark specification for more information about HTML spans.
An HTML span is inline content that contains raw HTML content (and will be left unescaped in HTML output). Note that, in general, the content may not actually be well-formed HTML, as CommonMark simply treats everything that “looks sufficiently like” HTML—
value
procedure
(line-break? v) → boolean?
v : any/c
See § Hard line breaks in the CommonMark specification for more information about hard line breaks.
A hard line break is inline content used for separating inline content within a block. In HTML output, it corresponds to a <br> element.
5 Extensions
By default, commonmark adheres precisely to the CommonMark specification, which is the subset of Markdown that behaves consistently across implementations. However, many Markdown libraries implement extensions beyond what is specified, several of which are useful enough to have become de facto standards across major Markdown implementations.
Unfortunately, since such features are not precisely specified, implementations of Markdown extensions rarely agree on how exactly they ought to be parsed and rendered, especially when interactions with other Markdown features leave edge cases and ambiguities. commonmark therefore deviates from the standard only if explicitly instructed to do so, and hopefully programmers who choose to venture into such uncharted waters understand they bear some responsibility for what they are getting themselves into.
This section documents all of the extensions commonmark currently supports. Note that, due to their inherently ill-specified nature, it can sometimes be difficult to determine whether a divergence in behavior between two Markdown implementations constitutes a bug or two incompatible features. For that reason, backwards compatibility of extensions’ behavior may not be perfectly maintained wherever the interpretation is not sufficiently “obvious”. Consider yourself warned.
5.1 Footnotes
Footnotes enjoy support from a wide variety of Markdown implementations, including PHP Markdown Extra, Python-Markdown, Pandoc, GitHub Flavored Markdown, and markdown. The [^label] syntax for references and definitions is nearly universal, but minor differences exist in interpretation, and rendering varies significantly. commonmark’s implementation is not precisely identical to any of them, but it was originally based on the cmark-gfm implementation of GitHub Flavored Markdown.
Footnotes allow auxiliary information to be lifted out of the main document flow to avoid cluttering the body text. When footnote parsing is enabled via the current-parse-footnotes? parameter, shortcut reference links with a link label that begins with a ^ character are instead parsed as footnote references. For example, the following paragraph includes three footnote references:
Racket is a programming language[^1] descended from Scheme.[^scheme]
Although not all Racket programs retain Lisp syntax, most Racket
programs still include a great many parentheses.[^(()())]
Text between the [^ and ] characters constitutes the footnote label, and the content of the footnote is provided via a footnote definition with a matching footnote label. Footnote definitions have similar syntax to link reference definitions, but unlike link reference definitions the body of a footnote definition is an arbitrary flow. For example, the following syntax defines two footnotes matched by the footnote references above:
[^1]: Technically, the name *Racket* refers to both the runtime
environment and the primary language used to program it.
[^scheme]: The original name for the Racket project was PLT Scheme,
but it was renamed in 2010 [to avoid confusion and to reflect its
departure from its roots](https://racket-lang.org/new-name.html).
Syntactically, footnote definitions are a type of container block and may appear within any flow, though they are not semantically children of any flow in which they appear. Their placement does not affect their interpretation—
As mentioned above, a footnote definition may contain an arbitrary flow consisting of any number of blocks. All lines after the first must be indented by 4 spaces to be included in the definition (unless they are lazy continuation lines). For example, the following footnote definition includes a block quote, an indented code block, and a paragraph:
[^long note]:
> This is a block quote that is nested inside
> the body of a footnote.
This is an indented code block
inside of a footnote.
This paragraph is also inside the footnote.
A footnote reference must match a footnote definition somewhere in the document to be parsed as a footnote reference. If no such definition exists, the label will be parsed as literal text. Each footnote definition can be referenced an arbitrary number of times.
When footnotes are parsed, each footnote reference is represented in-place by an instance of footnote-reference, but footnote definitions are removed from the main document flow and collected into a list of footnote-definition instances in a separate document-footnotes field. This allows renderers to more easily match references to their corresponding definitions and ensures that the placement of definitions within a document cannot affect the rendered output.
When given a document containing footnotes, the default HTML renderer mimicks the output produced by cmark-gfm. Specifically, the renderer appends a <section class="footnotes"> element to the end of the output, which wraps an <ol> element containing the footnotes’ content:
markdown
Here is a paragraph[^1] with |
two footnote references.[^2] |
|
[^1]: Here is the first footnote. |
[^2]: And here is the second. |
Each rendered footnote definition includes a backreference link, denoted by a ↩ character, that links to the corresponding footnote reference in the body text. If a definition is referenced multiple times, the rendered footnote will include multiple backreference links:
markdown
Here is a paragraph[^1] that |
references a footnote twice.[^1] |
|
[^1]: Here is the footnote. |
In both of the previous examples, the chosen footnote labels happen to line up with the rendered footnote numbers, but in general, that does not need to be the case. Footnote references are always rendered numerically, in the order they appear in the document, regardless of the footnote labels used in the document’s source:
Although footnotes are visually renumbered by the renderer, the generated links and link anchors are based on the original footnote labels. This means that a link to particular footnote definition will remain stable even if a document is modified as long as its label remains unchanged.
markdown
Here are some footnotes[^a] |
with non-numeric[^b] names. |
|
And here are some footnotes[^2] |
numbered out of order.[^3] |
|
[^a]: Here is footnote a. |
[^b]: Here is footnote b. |
[^2]: Here is footnote 2. |
[^3]: Here is footnote 3. |
In a similar vein, the order in which footnote definitions appear does not matter, as they will be rendered in the order they are first referenced in the document. If a definition is never referenced, it will not be rendered at all:
markdown
Here is a paragraph[^1] with |
two footnote references.[^3] |
|
[^3]: Here is footnote 3. |
[^2]: Here is footnote 2. |
[^1]: Here is footnote 1. |
Footnote references may appear inside footnote definitions, and commonmark will not object (though your readers might). Footnotes that are first referenced in a footnote definition will be numbered so that they immediately follow the referencing footnote:
markdown
Here is a paragraph[^1] with |
two footnote references.[^2] |
|
[^1]: Here is footnote 1.[^3] |
[^2]: Here is footnote 2. |
[^3]: Here is footnote 3. |
Note that while matching footnote references to their corresponding definitions is handled by the parser, pruning and renumbering of footnote definitions is handled entirely by the renderer, which allows alternate renderers to use alternate schemes if they so desire.
6 Comparison with markdown
The commonmark library is not the first Markdown parser implemented in Racket: it is long predated by the venerable markdown library, which in fact also predates the CommonMark specification itself. The libraries naturally provide similar functionality, but there are some key differences:
Most obviously and most significantly, commonmark conforms to the CommonMark specification, while markdown does not. This has both pros and cons:
commonmark enjoys consistency with other CommonMark implementations and is therefore likely to behave better on existing Markdown content than markdown is. Additionally, commonmark handles some tricky edge cases more gracefully than markdown does, such as parsing of emphasis adjacent to Unicode punctuation.
On the other hand, markdown is more featureful than commonmark, as it provides some extensions that commonmark does not. Additionally, some users may find some of the ways that markdown’s parser diverges from the CommonMark specification more intuitive (which is largely just a matter of personal taste).
commonmark provides a full Markdown AST, while markdown always parses directly to HTML (in the form of X-expressions). For many users, this difference is unlikely to be important, as almost all uses of Markdown render it to HTML, anyway. However, the option to process the intermediate representation affords additional flexibility if it is needed.
commonmark is appreciably faster than markdown. On most documents, markdown is about 5× slower than commonmark, but the performance gap increases dramatically given unusually large inputs: markdown is about 8× slower to parse a 4 MiB document and 28× slower to parse an 11 MiB document.
Takeaway: if you need the extra features provided by markdown, use markdown, otherwise use commonmark.