2013-03-28 12:38:48 +00:00
<!DOCTYPE html>
< html >
< head >
< meta charset = 'utf-8' / >
< meta http-equiv = "X-UA-Compatible" content = "chrome=1" / >
< meta name = "description" content = "LuaRock "htmlparser" : Parse HTML text into a tree of elements with selectors" / >
< link rel = "stylesheet" type = "text/css" media = "screen" href = "stylesheets/stylesheet.css" >
< title > LuaRock " htmlparser" < / title >
< / head >
< body >
<!-- HEADER -->
< div id = "header_wrap" class = "outer" >
< header class = "inner" >
< a id = "forkme_banner" href = "https://github.com/wscherphof/lua-htmlparser" > View on GitHub< / a >
< h1 id = "project_title" > LuaRock " htmlparser" < / h1 >
< h2 id = "project_tagline" > Parse HTML text into a tree of elements with selectors< / h2 >
< section id = "downloads" >
< a class = "zip_download_link" href = "https://github.com/wscherphof/lua-htmlparser/zipball/master" > Download this project as a .zip file< / a >
< a class = "tar_download_link" href = "https://github.com/wscherphof/lua-htmlparser/tarball/master" > Download this project as a tar.gz file< / a >
< / section >
< / header >
< / div >
<!-- MAIN CONTENT -->
< div id = "main_content_wrap" class = "outer" >
< section id = "main_content" class = "inner" >
2013-12-06 13:40:28 +00:00
< h2 >
< a name = "install" class = "anchor" href = "#install" > < span class = "octicon octicon-link" > < / span > < / a > Install< / h2 >
2013-03-28 22:41:23 +00:00
< p > Htmlparser is a listed < a href = "http://luarocks.org/repositories/rocks/" > LuaRock< / a > . Install using < a href = "http://www.luarocks.org/" > LuaRocks< / a > : < code > luarocks install htmlparser< / code > < / p >
2013-12-06 13:40:28 +00:00
< h3 >
< a name = "dependencies" class = "anchor" href = "#dependencies" > < span class = "octicon octicon-link" > < / span > < / a > Dependencies< / h3 >
2013-03-28 22:41:23 +00:00
2013-12-10 12:42:34 +00:00
< p > Htmlparser depends on < a href = "http://www.lua.org/download.html" > Lua 5.2< / a > , and on the < a href = "http://wscherphof.github.com/lua-set/" > "set"< / a > LuaRock, which is installed along automatically. To be able to run the tests, < a href = "https://github.com/dcurrie/lunit" > lunitx< / a > also comes along as a LuaRock< / p >
2013-03-28 22:41:23 +00:00
2013-12-06 13:40:28 +00:00
< h2 >
< a name = "usage" class = "anchor" href = "#usage" > < span class = "octicon octicon-link" > < / span > < / a > Usage< / h2 >
2013-03-28 12:38:48 +00:00
< p > Start off with< / p >
2013-12-06 13:40:28 +00:00
< div class = "highlight highlight-lua" > < pre > < span class = "nb" > require< / span > < span class = "p" > (< / span > < span class = "s2" > "< / span > < span class = "s" > luarocks.loader"< / span > < span class = "p" > )< / span >
2013-03-28 12:38:48 +00:00
< span class = "kd" > local< / span > < span class = "n" > htmlparser< / span > < span class = "o" > =< / span > < span class = "nb" > require< / span > < span class = "p" > (< / span > < span class = "s2" > "< / span > < span class = "s" > htmlparser"< / span > < span class = "p" > )< / span >
< / pre > < / div >
< p > Then, parse some html:< / p >
2013-12-06 13:40:28 +00:00
< div class = "highlight highlight-lua" > < pre > < span class = "kd" > local< / span > < span class = "n" > root< / span > < span class = "o" > =< / span > < span class = "n" > htmlparser< / span > < span class = "p" > .< / span > < span class = "n" > parse< / span > < span class = "p" > (< / span > < span class = "n" > htmlstring< / span > < span class = "p" > )< / span >
2013-03-28 12:38:48 +00:00
< / pre > < / div >
< p > The input to parse may be the contents of a complete html document, or any valid html snippet, as long as all tags are correctly opened and closed.
2013-12-06 13:40:28 +00:00
Now, find specific contained elements by selecting:< / p >
2013-03-28 12:38:48 +00:00
2013-12-06 13:40:28 +00:00
< div class = "highlight highlight-lua" > < pre > < span class = "kd" > local< / span > < span class = "n" > elements< / span > < span class = "o" > =< / span > < span class = "n" > root< / span > < span class = "p" > :< / span > < span class = "nb" > select< / span > < span class = "p" > (< / span > < span class = "n" > selectorstring< / span > < span class = "p" > )< / span >
2013-03-28 12:38:48 +00:00
< / pre > < / div >
< p > Or in shorthand:< / p >
2013-12-06 13:40:28 +00:00
< div class = "highlight highlight-lua" > < pre > < span class = "kd" > local< / span > < span class = "n" > elements< / span > < span class = "o" > =< / span > < span class = "n" > root< / span > < span class = "p" > (< / span > < span class = "n" > selectorstring< / span > < span class = "p" > )< / span >
2013-03-28 12:38:48 +00:00
< / pre > < / div >
2013-12-10 12:42:34 +00:00
< p > This wil return a list of elements, all of which are of the same type as the root element, and thus support selecting as well, if ever needed:< / p >
2013-03-28 12:38:48 +00:00
2013-12-10 12:42:34 +00:00
< div class = "highlight highlight-lua" > < pre > < span class = "k" > for< / span > < span class = "n" > _< / span > < span class = "p" > ,< / span > < span class = "n" > e< / span > < span class = "k" > in< / span > < span class = "nb" > ipairs< / span > < span class = "p" > (< / span > < span class = "n" > elements< / span > < span class = "p" > )< / span > < span class = "k" > do< / span >
2013-03-28 12:38:48 +00:00
< span class = "nb" > print< / span > < span class = "p" > (< / span > < span class = "n" > e< / span > < span class = "p" > .< / span > < span class = "n" > name< / span > < span class = "p" > )< / span >
< span class = "kd" > local< / span > < span class = "n" > subs< / span > < span class = "o" > =< / span > < span class = "n" > e< / span > < span class = "p" > (< / span > < span class = "n" > subselectorstring< / span > < span class = "p" > )< / span >
2013-12-10 12:42:34 +00:00
< span class = "k" > for< / span > < span class = "n" > _< / span > < span class = "p" > ,< / span > < span class = "n" > sub< / span > < span class = "k" > in< / span > < span class = "nb" > ipairs< / span > < span class = "p" > (< / span > < span class = "n" > subs< / span > < span class = "p" > )< / span > < span class = "k" > do< / span >
2013-03-28 12:38:48 +00:00
< span class = "nb" > print< / span > < span class = "p" > (< / span > < span class = "s2" > "< / span > < span class = "s" > "< / span > < span class = "p" > ,< / span > < span class = "n" > sub< / span > < span class = "p" > .< / span > < span class = "n" > name< / span > < span class = "p" > )< / span >
< span class = "k" > end< / span >
< span class = "k" > end< / span >
< / pre > < / div >
< p > The root element is a container for the top level elements in the parsed text, i.e. the < code > < html> < / code > element in a parsed html document would be a child of the returned root element.< / p >
2013-12-06 13:40:28 +00:00
< h2 >
< a name = "selectors" class = "anchor" href = "#selectors" > < span class = "octicon octicon-link" > < / span > < / a > Selectors< / h2 >
2013-03-28 12:38:48 +00:00
2013-12-10 12:42:34 +00:00
< p > Supported selectors are a subset of < a href = "http://api.jquery.com/category/selectors/" > jQuery's selectors< / a > :< / p >
2013-03-28 12:38:48 +00:00
< ul >
< li >
< code > "*"< / code > all contained elements< / li >
< li >
< code > "element"< / code > elements with the given tagname< / li >
< li >
< code > "#id"< / code > elements with the given id attribute value< / li >
< li >
< code > ".class"< / code > elements with the given classname in the class attribute< / li >
< li >
< code > "[attribute]"< / code > elements with an attribute of the given name< / li >
< li >
2013-04-08 12:57:33 +00:00
< code > "[attribute='value']"< / code > equals: elements with the given value for the given attribute< / li >
2013-03-28 12:38:48 +00:00
< li >
2013-04-08 12:57:33 +00:00
< code > "[attribute!='value']"< / code > not equals: elements without the given attribute, or having the attribute, but with a different value< / li >
2013-03-28 12:38:48 +00:00
< li >
< code > "[attribute|='value']"< / code > prefix: attribute's value is given value, or starts with given value, followed by a hyphen (< code > -< / code > )< / li >
< li >
< code > "[attribute*='value']"< / code > contains: attribute's value contains given value< / li >
< li >
< code > "[attribute~='value']"< / code > word: attribute's value is a space-separated token, where one of the tokens is the given value< / li >
< li >
< code > "[attribute^='value']"< / code > starts with: attribute's value starts with given value< / li >
< li >
< code > "[attribute$='value']"< / code > ends with: attribute's value ends with given value< / li >
< li >
< code > ":not(selectorstring)"< / code > elements not selected by given selector string< / li >
< li >
< code > "ancestor descendant"< / code > elements selected by the < code > descendant< / code > selector string, that are a descendant of any element selected by the < code > ancestor< / code > selector string< / li >
< li >
< code > "parent > child"< / code > elements selected by the < code > child< / code > selector string, that are a child element of any element selected by the < code > parent< / code > selector string< / li >
< / ul > < p > Selectors can be combined; e.g. < code > ".class:not([attribute]) element.class"< / code > < / p >
2013-12-06 13:40:28 +00:00
< h2 >
< a name = "element-type" class = "anchor" href = "#element-type" > < span class = "octicon octicon-link" > < / span > < / a > Element type< / h2 >
2013-03-28 12:38:48 +00:00
< p > All tree elements provide, apart from < code > :select< / code > and < code > ()< / code > , the following accessors:< / p >
2013-12-06 13:40:28 +00:00
< h3 >
< a name = "basic" class = "anchor" href = "#basic" > < span class = "octicon octicon-link" > < / span > < / a > Basic< / h3 >
2013-03-28 12:38:48 +00:00
< ul >
< li >
< code > .name< / code > the element's tagname< / li >
< li >
< code > .attributes< / code > a table with keys and values for the element's attributes; < code > {}< / code > if none< / li >
< li >
< code > .id< / code > the value of the element's id attribute; < code > nil< / code > if not present< / li >
< li >
< code > .classes< / code > an array with the classes listed in element's class attribute; < code > {}< / code > if none< / li >
< li >
< code > :getcontent()< / code > the raw text between the opening and closing tags of the element; < code > ""< / code > if none< / li >
< li >
< code > .nodes< / code > an array with the element's child elements, < code > {}< / code > if none< / li >
< li >
2013-12-11 07:54:28 +00:00
< code > .parent< / code > the element that contains this element; < code > root.parent< / code > is < code > nil< / code >
2013-03-28 12:38:48 +00:00
< / li >
2013-12-06 13:40:28 +00:00
< / ul > < h3 >
< a name = "other" class = "anchor" href = "#other" > < span class = "octicon octicon-link" > < / span > < / a > Other< / h3 >
2013-03-28 12:38:48 +00:00
< ul >
< li >
2013-12-10 12:42:34 +00:00
< code > .index< / code > sequence number of elements in order of appearance; root index is < code > 0< / code >
< / li >
< li >
2013-04-08 12:57:33 +00:00
< code > :gettext()< / code > the complete element text, starting with < code > "< tagname"< / code > and ending with < code > "/> "< / code > or < code > "< /tagname> "< / code >
2013-03-28 12:38:48 +00:00
< / li >
< li >
< code > .level< / code > how deep the element is in the tree; root level is < code > 0< / code >
< / li >
< li >
< code > .root< / code > the root element of the tree; < code > root.root< / code > is < code > root< / code >
< / li >
< li >
2013-12-10 12:42:34 +00:00
< code > .deepernodes< / code > a < a href = "http://wscherphof.github.com/lua-set/" > Set< / a > containing all elements in the tree beneath this element, including this element's < code > .nodes< / code > ; < code > {}< / code > if none< / li >
2013-03-28 12:38:48 +00:00
< li >
2013-12-11 07:54:28 +00:00
< code > .deeperelements< / code > a table with a key for each distinct tagname in < code > .deepernodes< / code > , containing a < a href = "http://wscherphof.github.com/lua-set/" > Set< / a > of all deeper element nodes with that name; < code > {}< / code > if none< / li >
2013-03-28 12:38:48 +00:00
< li >
< code > .deeperattributes< / code > as < code > .deeperelements< / code > , but keyed on attribute name< / li >
< li >
< code > .deeperids< / code > as < code > .deeperelements< / code > , but keyed on id value< / li >
< li >
< code > .deeperclasses< / code > as < code > .deeperelements< / code > , but keyed on class name< / li >
2013-12-06 13:40:28 +00:00
< / ul > < h2 >
< a name = "limitations" class = "anchor" href = "#limitations" > < span class = "octicon octicon-link" > < / span > < / a > Limitations< / h2 >
2013-04-08 12:57:33 +00:00
< ul >
2014-01-10 19:59:55 +00:00
< li > Attribute values in selector strings cannot contain any spaces< / li >
2013-04-08 12:57:33 +00:00
< li > The spaces before and after the < code > > < / code > in a < code > parent > child< / code > relation are mandatory < / li >
< li >
< code > < !< / code > elements (including doctype, comments, and CDATA) are not parsed; markup within CDATA is < em > not< / em > escaped< / li >
2013-12-06 13:40:28 +00:00
< li > Textnodes are no separate tree elements; in < code > local root = htmlparser.parse("< p> line1< br /> line2< /p> ")< / code > , < code > root.nodes[1]:getcontent()< / code > is < code > "line1< br /> line2"< / code > , while < code > root.nodes[1].nodes[1].name< / code > is < code > "br"< / code >
2013-04-08 12:57:33 +00:00
< / li >
< li > No start or end tags are implied when < a href = "http://www.w3.org/TR/html5/syntax.html#optional-tags" > omitted< / a > . Only the < a href = "http://www.w3.org/TR/html5/syntax.html#void-elements" > void elements< / a > should not have an end tag< / li >
< li > No validation is done for tag or attribute names or nesting of element types. The list of void elements is in fact the only part specific to HTML< / li >
2013-12-06 13:40:28 +00:00
< / ul > < h2 >
< a name = "examples" class = "anchor" href = "#examples" > < span class = "octicon octicon-link" > < / span > < / a > Examples< / h2 >
2013-04-08 12:57:33 +00:00
< p > See < code > ./doc/sample.lua< / code > < / p >
2013-12-06 13:40:28 +00:00
< h2 >
< a name = "tests" class = "anchor" href = "#tests" > < span class = "octicon octicon-link" > < / span > < / a > Tests< / h2 >
2013-04-08 12:57:33 +00:00
< p > See < code > ./tst/init.lua< / code > < / p >
2013-12-06 13:40:28 +00:00
< h2 >
< a name = "license" class = "anchor" href = "#license" > < span class = "octicon octicon-link" > < / span > < / a > License< / h2 >
2013-04-08 12:57:33 +00:00
2013-12-06 13:40:28 +00:00
< p > LGPL+; see < code > ./doc/LICENSE< / code > < / p >
2013-03-28 12:38:48 +00:00
< / section >
< / div >
<!-- FOOTER -->
< div id = "footer_wrap" class = "outer" >
< footer class = "inner" >
< p class = "copyright" > LuaRock " htmlparser" maintained by < a href = "https://github.com/wscherphof" > wscherphof< / a > < / p >
< p > Published with < a href = "http://pages.github.com" > GitHub Pages< / a > < / p >
< / footer >
< / div >
< / body >
< / html >