read-xhtml
read-html
read-html-as-xml
1 Example
2 HTML Structures
html-element
html-full
html
div
center
blockquote
ins
del
dd
li
th
td
iframe
noframes
noscript
style
script
basefont
br
area
alink
img
param
hr
input
col
isindex
base
meta
option
textarea
title
head
tr
colgroup
thead
tfoot
tbody
tt
i
b
u
s
strike
big
small
em
strong
dfn
code
samp
kbd
var
cite
abbr
acronym
sub
sup
span
bdo
font
p
h1
h2
h3
h4
h5
h6
q
dt
legend
caption
table
button
fieldset
optgroup
select
label
form
ol
ul
dir
menu
dl
pre
object
applet
map
a
address
body
Version: 4.1.0.2

HTML: Parsing Library

 (require html)

The html library provides functions to read html documents and structures to represent them.

(read-xhtml port)  html?

  port : input-port?

(read-html port)  html?

  port : input-port?

Reads (X)HTML from a port, producing an html instance.

(read-html-as-xml port)  (listof content?)

  port : input-port?

Reads HTML from a port, producing an xexpr compatible with the xml library (which defines content?).

1 Example

  (module html-example scheme

  

    ; Some of the symbols in html and xml conflict with

    ; each other and with scheme/base language, so we prefix

    ; to avoid namespace conflict.

    (require (prefix-in h: html)

             (prefix-in x: xml))

  

    (define an-html

      (h:read-xhtml

       (open-input-string

        (string-append

         "<html><head><title>My title</title></head><body>"

         "<p>Hello world</p><p><b>Testing</b>!</p>"

         "</body></html>"))))

  

    ; extract-pcdata: html-content -> (listof string)

    ; Pulls out the pcdata strings from some-content.

    (define (extract-pcdata some-content)

      (cond [(x:pcdata? some-content)

             (list (x:pcdata-string some-content))]

            [(x:entity? some-content)

             (list)]

            [else

             (extract-pcdata-from-element some-content)]))

  

    ; extract-pcdata-from-element: html-element -> (listof string)

    ; Pulls out the pcdata strings from an-html-element.

    (define (extract-pcdata-from-element an-html-element)

      (match an-html-element

        [(struct h:html-full (content))

         (apply append (map extract-pcdata content))]

  

        [(struct h:html-element (attributes))

         '()]))

  

    (printf "~s~n" (extract-pcdata an-html)))

  > (require 'html-example)

  ("My title" "Hello world" "Testing" "!")

2 HTML Structures

pcdata, entity, and attribute are defined in the xml documentation.

A html-content is either

(struct

 

html-element

 

(attributes))

  attributes : (listof attribute)

Any of the structures below inherits from html-element.

(struct

 

(html-full struct:html-element)

 

(content))

  content : (listof html-content)

Any html tag that may include content also inherits from html-full without adding any additional fields.

(struct

 

(html html-full)

 

())

A html is (make-html (listof attribute) (listof Contents-of-html))

A Contents-of-html is either

(struct

 

(div html-full)

 

())

A div is (make-div (listof attribute) (listof G2))

(struct

 

(center html-full)

 

())

A center is (make-center (listof attribute) (listof G2))

(struct

 

(blockquote html-full)

 

())

A blockquote is (make-blockquote (listof attribute) G2)

(struct

 

(ins html-full)

 

())

An Ins is (make-ins (listof attribute) (listof G2))

(struct

 

(del html-full)

 

())

A del is (make-del (listof attribute) (listof G2))

(struct

 

(dd html-full)

 

())

A dd is (make-dd (listof attribute) (listof G2))

(struct

 

(li html-full)

 

())

A li is (make-li (listof attribute) (listof G2))

(struct

 

(th html-full)

 

())

A th is (make-th (listof attribute) (listof G2))

(struct

 

(td html-full)

 

())

A td is (make-td (listof attribute) (listof G2))

(struct

 

(iframe html-full)

 

())

An iframe is (make-iframe (listof attribute) (listof G2))

(struct

 

(noframes html-full)

 

())

A noframes is (make-noframes (listof attribute) (listof G2))

(struct

 

(noscript html-full)

 

())

A noscript is (make-noscript (listof attribute) (listof G2))

(struct

 

(style html-full)

 

())

A style is (make-style (listof attribute) (listof pcdata))

(struct

 

(script html-full)

 

())

A script is (make-script (listof attribute) (listof pcdata))

(struct

 

(basefont html-element)

 

())

A basefont is (make-basefont (listof attribute))

(struct

 

(br html-element)

 

())

A br is (make-br (listof attribute))

(struct

 

(area html-element)

 

())

An area is (make-area (listof attribute))

(struct

 

(alink html-element)

 

())

A alink is (make-alink (listof attribute))

(struct

 

(img html-element)

 

())

An img is (make-img (listof attribute))

(struct

 

(param html-element)

 

())

A param is (make-param (listof attribute))

(struct

 

(hr html-element)

 

())

A hr is (make-hr (listof attribute))

(struct

 

(input html-element)

 

())

An input is (make-input (listof attribute))

(struct

 

(col html-element)

 

())

A col is (make-col (listof attribute))

(struct

 

(isindex html-element)

 

())

An isindex is (make-isindex (listof attribute))

(struct

 

(base html-element)

 

())

A base is (make-base (listof attribute))

(struct

 

(meta html-element)

 

())

A meta is (make-meta (listof attribute))

(struct

 

(option html-full)

 

())

An option is (make-option (listof attribute) (listof pcdata))

(struct

 

(textarea html-full)

 

())

A textarea is (make-textarea (listof attribute) (listof pcdata))

(struct

 

(title html-full)

 

())

A title is (make-title (listof attribute) (listof pcdata))

(struct

 

(head html-full)

 

())

A head is (make-head (listof attribute) (listof Contents-of-head))

A Contents-of-head is either

(struct

 

(tr html-full)

 

())

A tr is (make-tr (listof attribute) (listof Contents-of-tr))

A Contents-of-tr is either

(struct

 

(colgroup html-full)

 

())

A colgroup is (make-colgroup (listof attribute) (listof col))

(struct

 

(thead html-full)

 

())

A thead is (make-thead (listof attribute) (listof tr))

(struct

 

(tfoot html-full)

 

())

A tfoot is (make-tfoot (listof attribute) (listof tr))

(struct

 

(tbody html-full)

 

())

A tbody is (make-tbody (listof attribute) (listof tr))

(struct

 

(tt html-full)

 

())

A tt is (make-tt (listof attribute) (listof G5))

(struct

 

(i html-full)

 

())

An i is (make-i (listof attribute) (listof G5))

(struct

 

(b html-full)

 

())

A b is (make-b (listof attribute) (listof G5))

(struct

 

(u html-full)

 

())

An u is (make-u (listof attribute) (listof G5))

(struct

 

(s html-full)

 

())

A s is (make-s (listof attribute) (listof G5))

(struct

 

(strike