4.2 KiB
#LuaRock "htmlparser"
Parse HTML text into a tree of elements with selectors
##License
MIT; see ./doc/LICENSE
##Usage Start off with
require("luarocks.loader")
local htmlparser = require("htmlparser")
Then, parse some html:
local root = htmlparser.parse(htmlstring)
The input to parse may be the contents of a complete html document, or any valid html snippet, as long as all tags are correctly opened and closed. Now, find sepcific contained elements by selecting:
local elements = root:select(selectorstring)
Or in shorthand:
local elements = root(selectorstring)
This wil return a Set of elements, all of which are of the same type as the root element, and thus support selecting as well, if ever needed:
for e in pairs(elements) do
print(e.name)
local subs = e(subselectorstring)
for sub in pairs(subs) do
print("", sub.name)
end
end
The root element is a container for the top level elements in the parsed text, i.e. the <html>
element in a parsed html document would be a child of the returned root element.
##Selectors Supported selectors are a subset of jQuery's selectors:
"*"
all contained elements"element"
elements with the given tagname"#id"
elements with the given id attribute value".class"
elements with the given classname in the class attribute"[attribute]"
elements with an attribute of the given name"[attribute='value']"
equals: elements with the given value for the attribute with the given name"[attribute!='value']"
not equals: elements without an attribute of the given name, or with that attribute, but with a value that is different from the given value"[attribute|='value']"
prefix: attribute's value is given value, or starts with given value, followed by a hyphen (-
)"[attribute*='value']"
contains: attribute's value contains given value"[attribute~='value']"
word: attribute's value is a space-separated token, where one of the tokens is the given value"[attribute^='value']"
starts with: attribute's value starts with given value"[attribute$='value']"
ends with: attribute's value ends with given value":not(selector)"
"ancestor descendant"
"parent > child"
Selectors can be combined; e.g. ".class:not([attribute]) element.class"
###Limitations
- Attribute values in selectors currently cannot contain any spaces, since space is interpreted as a delimiter between the
ancestor
anddescendant
,parent
and>
, or>
andchild
parts of the selector - Likewise, for the
parent > child
relation, the spaces before and after the>
are mandatory <!
elements are not parsed, including doctype and comments- Textnodes are not seperate entries in the tree, so the content of
<p>line1<br />line2</p>
is plainly"line1<br />line2"
##Examples
See ./doc/samples.lua
##Element type
All tree elements provide, apart from :select
and ()
, the following accessors:
###Basic
.name
the element's tagname.attributes
a table with keys and values for the element's attributes;{}
if none.id
the value of the element's id attribute;nil
if not present.classes
an array with the classes listed in element's class attribute;{}
if none:getcontent()
the raw text between the opening and closing tags of the element;""
if none.nodes
an array with the element's child elements,{}
if none.parent
the elements that contains this element;root.parent
isnil
###Other
:gettext()
the raw text of the complete element, starting with"<tagname"
and ending with"/>"
.level
how deep the element is in the tree; root level is0
.root
the root element of the tree;root.root
isroot
.deepernodes
a Set containing all elements in the tree beneath this element, including this element's.nodes
;{}
if none.deeperelements
a table with a key for each distinct tagname in.deepernodes
, containing a Set of all deeper element nodes with that name;{}
in none.deeperattributes
as.deeperelements
, but keyed on attribute name.deeperids
as.deeperelements
, but keyed on id value.deeperclasses
as.deeperelements
, but keyed on class name