lua-htmlparser/index.html
2013-12-06 05:40:28 -08:00

209 lines
11 KiB
HTML

<!DOCTYPE html>
<html>
<head>
<meta charset='utf-8' />
<meta http-equiv="X-UA-Compatible" content="chrome=1" />
<meta name="description" content="LuaRock &quot;htmlparser&quot; : Parse HTML text into a tree of elements with selectors" />
<link rel="stylesheet" type="text/css" media="screen" href="stylesheets/stylesheet.css">
<title>LuaRock &quot;htmlparser&quot;</title>
</head>
<body>
<!-- HEADER -->
<div id="header_wrap" class="outer">
<header class="inner">
<a id="forkme_banner" href="https://github.com/wscherphof/lua-htmlparser">View on GitHub</a>
<h1 id="project_title">LuaRock &quot;htmlparser&quot;</h1>
<h2 id="project_tagline">Parse HTML text into a tree of elements with selectors</h2>
<section id="downloads">
<a class="zip_download_link" href="https://github.com/wscherphof/lua-htmlparser/zipball/master">Download this project as a .zip file</a>
<a class="tar_download_link" href="https://github.com/wscherphof/lua-htmlparser/tarball/master">Download this project as a tar.gz file</a>
</section>
</header>
</div>
<!-- MAIN CONTENT -->
<div id="main_content_wrap" class="outer">
<section id="main_content" class="inner">
<h2>
<a name="install" class="anchor" href="#install"><span class="octicon octicon-link"></span></a>Install</h2>
<p>Htmlparser is a listed <a href="http://luarocks.org/repositories/rocks/">LuaRock</a>. Install using <a href="http://www.luarocks.org/">LuaRocks</a>: <code>luarocks install htmlparser</code></p>
<h3>
<a name="dependencies" class="anchor" href="#dependencies"><span class="octicon octicon-link"></span></a>Dependencies</h3>
<p>Htmlparser depends on <a href="http://www.lua.org/download.html">Lua 5.2</a>, and on the ["set"][1] LuaRock, which is installed along automatically. To be able to run the tests, <a href="https://github.com/dcurrie/lunit">lunitx</a> also comes along as a LuaRock</p>
<h2>
<a name="usage" class="anchor" href="#usage"><span class="octicon octicon-link"></span></a>Usage</h2>
<p>Start off with</p>
<div class="highlight highlight-lua"><pre><span class="nb">require</span><span class="p">(</span><span class="s2">"</span><span class="s">luarocks.loader"</span><span class="p">)</span>
<span class="kd">local</span> <span class="n">htmlparser</span> <span class="o">=</span> <span class="nb">require</span><span class="p">(</span><span class="s2">"</span><span class="s">htmlparser"</span><span class="p">)</span>
</pre></div>
<p>Then, parse some html:</p>
<div class="highlight highlight-lua"><pre><span class="kd">local</span> <span class="n">root</span> <span class="o">=</span> <span class="n">htmlparser</span><span class="p">.</span><span class="n">parse</span><span class="p">(</span><span class="n">htmlstring</span><span class="p">)</span>
</pre></div>
<p>The input to parse may be the contents of a complete html document, or any valid html snippet, as long as all tags are correctly opened and closed.
Now, find specific contained elements by selecting:</p>
<div class="highlight highlight-lua"><pre><span class="kd">local</span> <span class="n">elements</span> <span class="o">=</span> <span class="n">root</span><span class="p">:</span><span class="nb">select</span><span class="p">(</span><span class="n">selectorstring</span><span class="p">)</span>
</pre></div>
<p>Or in shorthand:</p>
<div class="highlight highlight-lua"><pre><span class="kd">local</span> <span class="n">elements</span> <span class="o">=</span> <span class="n">root</span><span class="p">(</span><span class="n">selectorstring</span><span class="p">)</span>
</pre></div>
<p>This wil return a [Set][1] of elements, all of which are of the same type as the root element, and thus support selecting as well, if ever needed:</p>
<div class="highlight highlight-lua"><pre><span class="k">for</span> <span class="n">e</span> <span class="k">in</span> <span class="nb">pairs</span><span class="p">(</span><span class="n">elements</span><span class="p">)</span> <span class="k">do</span>
<span class="nb">print</span><span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="kd">local</span> <span class="n">subs</span> <span class="o">=</span> <span class="n">e</span><span class="p">(</span><span class="n">subselectorstring</span><span class="p">)</span>
<span class="k">for</span> <span class="n">sub</span> <span class="k">in</span> <span class="nb">pairs</span><span class="p">(</span><span class="n">subs</span><span class="p">)</span> <span class="k">do</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"</span><span class="s">"</span><span class="p">,</span> <span class="n">sub</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">end</span>
</pre></div>
<p>The root element is a container for the top level elements in the parsed text, i.e. the <code>&lt;html&gt;</code> element in a parsed html document would be a child of the returned root element.</p>
<h2>
<a name="selectors" class="anchor" href="#selectors"><span class="octicon octicon-link"></span></a>Selectors</h2>
<p>Supported selectors are a subset of [jQuery's selectors][2]:</p>
<ul>
<li>
<code>"*"</code> all contained elements</li>
<li>
<code>"element"</code> elements with the given tagname</li>
<li>
<code>"#id"</code> elements with the given id attribute value</li>
<li>
<code>".class"</code> elements with the given classname in the class attribute</li>
<li>
<code>"[attribute]"</code> elements with an attribute of the given name</li>
<li>
<code>"[attribute='value']"</code> equals: elements with the given value for the given attribute</li>
<li>
<code>"[attribute!='value']"</code> not equals: elements without the given attribute, or having the attribute, but with a different value</li>
<li>
<code>"[attribute|='value']"</code> prefix: attribute's value is given value, or starts with given value, followed by a hyphen (<code>-</code>)</li>
<li>
<code>"[attribute*='value']"</code> contains: attribute's value contains given value</li>
<li>
<code>"[attribute~='value']"</code> word: attribute's value is a space-separated token, where one of the tokens is the given value</li>
<li>
<code>"[attribute^='value']"</code> starts with: attribute's value starts with given value</li>
<li>
<code>"[attribute$='value']"</code> ends with: attribute's value ends with given value</li>
<li>
<code>":not(selectorstring)"</code> elements not selected by given selector string</li>
<li>
<code>"ancestor descendant"</code> elements selected by the <code>descendant</code> selector string, that are a descendant of any element selected by the <code>ancestor</code> selector string</li>
<li>
<code>"parent &gt; child"</code> elements selected by the <code>child</code> selector string, that are a child element of any element selected by the <code>parent</code> selector string</li>
</ul><p>Selectors can be combined; e.g. <code>".class:not([attribute]) element.class"</code></p>
<h2>
<a name="element-type" class="anchor" href="#element-type"><span class="octicon octicon-link"></span></a>Element type</h2>
<p>All tree elements provide, apart from <code>:select</code> and <code>()</code>, the following accessors:</p>
<h3>
<a name="basic" class="anchor" href="#basic"><span class="octicon octicon-link"></span></a>Basic</h3>
<ul>
<li>
<code>.name</code> the element's tagname</li>
<li>
<code>.attributes</code> a table with keys and values for the element's attributes; <code>{}</code> if none</li>
<li>
<code>.id</code> the value of the element's id attribute; <code>nil</code> if not present</li>
<li>
<code>.classes</code> an array with the classes listed in element's class attribute; <code>{}</code> if none</li>
<li>
<code>:getcontent()</code> the raw text between the opening and closing tags of the element; <code>""</code> if none</li>
<li>
<code>.nodes</code> an array with the element's child elements, <code>{}</code> if none</li>
<li>
<code>.parent</code> the elements that contains this element; <code>root.parent</code> is <code>nil</code>
</li>
</ul><h3>
<a name="other" class="anchor" href="#other"><span class="octicon octicon-link"></span></a>Other</h3>
<ul>
<li>
<code>:gettext()</code> the complete element text, starting with <code>"&lt;tagname"</code> and ending with <code>"/&gt;"</code> or <code>"&lt;/tagname&gt;"</code>
</li>
<li>
<code>.level</code> how deep the element is in the tree; root level is <code>0</code>
</li>
<li>
<code>.root</code> the root element of the tree; <code>root.root</code> is <code>root</code>
</li>
<li>
<code>.deepernodes</code> a [Set][1] containing all elements in the tree beneath this element, including this element's <code>.nodes</code>; <code>{}</code> if none</li>
<li>
<code>.deeperelements</code> a table with a key for each distinct tagname in <code>.deepernodes</code>, containing a [Set][1] of all deeper element nodes with that name; <code>{}</code> in none</li>
<li>
<code>.deeperattributes</code> as <code>.deeperelements</code>, but keyed on attribute name</li>
<li>
<code>.deeperids</code> as <code>.deeperelements</code>, but keyed on id value</li>
<li>
<code>.deeperclasses</code> as <code>.deeperelements</code>, but keyed on class name</li>
</ul><h2>
<a name="limitations" class="anchor" href="#limitations"><span class="octicon octicon-link"></span></a>Limitations</h2>
<ul>
<li>Attribute values in selector strings cannot contain any spaces, nor any of <code>#</code>, <code>.</code>, <code>[</code>, <code>]</code>, <code>:</code>, <code>(</code>, or <code>)</code>
</li>
<li>The spaces before and after the <code>&gt;</code> in a <code>parent &gt; child</code> relation are mandatory </li>
<li>
<code>&lt;!</code> elements (including doctype, comments, and CDATA) are not parsed; markup within CDATA is <em>not</em> escaped</li>
<li>Textnodes are no separate tree elements; in <code>local root = htmlparser.parse("&lt;p&gt;line1&lt;br /&gt;line2&lt;/p&gt;")</code>, <code>root.nodes[1]:getcontent()</code> is <code>"line1&lt;br /&gt;line2"</code>, while <code>root.nodes[1].nodes[1].name</code> is <code>"br"</code>
</li>
<li>No start or end tags are implied when <a href="http://www.w3.org/TR/html5/syntax.html#optional-tags">omitted</a>. Only the <a href="http://www.w3.org/TR/html5/syntax.html#void-elements">void elements</a> should not have an end tag</li>
<li>No validation is done for tag or attribute names or nesting of element types. The list of void elements is in fact the only part specific to HTML</li>
</ul><h2>
<a name="examples" class="anchor" href="#examples"><span class="octicon octicon-link"></span></a>Examples</h2>
<p>See <code>./doc/sample.lua</code></p>
<h2>
<a name="tests" class="anchor" href="#tests"><span class="octicon octicon-link"></span></a>Tests</h2>
<p>See <code>./tst/init.lua</code></p>
<h2>
<a name="license" class="anchor" href="#license"><span class="octicon octicon-link"></span></a>License</h2>
<p>LGPL+; see <code>./doc/LICENSE</code></p>
</section>
</div>
<!-- FOOTER -->
<div id="footer_wrap" class="outer">
<footer class="inner">
<p class="copyright">LuaRock &quot;htmlparser&quot; maintained by <a href="https://github.com/wscherphof">wscherphof</a></p>
<p>Published with <a href="http://pages.github.com">GitHub Pages</a></p>
</footer>
</div>
</body>
</html>