2017-04-09 07:32:38 +00:00
[![Build Status ](https://travis-ci.org/msva/lua-htmlparser.png?branch=master )](https://travis-ci.org/msva/lua-htmlparser)
[![Coverage Status ](https://coveralls.io/repos/msva/lua-htmlparser/badge.png?branch=master )](https://coveralls.io/r/msva/lua-htmlparser?branch=master)
[![License ](http://img.shields.io/badge/License-LGPL+-brightgreen.svg )](doc/LICENSE)
2017-04-08 03:46:27 +00:00
# LuaRock "htmlparser"
2013-03-19 10:24:17 +00:00
2013-03-28 11:24:10 +00:00
Parse HTML text into a tree of elements with selectors
2017-04-08 18:55:25 +00:00
[1]: https://api.jquery.com/category/selectors/
2013-03-28 12:22:57 +00:00
2017-04-08 03:46:27 +00:00
## Install
2013-03-28 22:36:43 +00:00
Htmlparser is a listed [LuaRock ](http://luarocks.org/repositories/rocks/ ). Install using [LuaRocks ](http://www.luarocks.org/ ): `luarocks install htmlparser`
2017-04-08 03:46:27 +00:00
### Dependencies
2022-04-26 17:46:50 +00:00
Htmlparser depends on [Lua 5.1-5.4 ](https://www.lua.org/download.html ) or [LuaJIT ](https://luajit.org/download.html ), which provides 5.1-compatible API/ABI.
2017-04-08 18:55:25 +00:00
To be able to run the tests, [lunitx ](https://github.com/dcurrie/lunit ) also comes along as a LuaRock
2013-03-28 22:36:43 +00:00
2017-04-08 03:46:27 +00:00
## Usage
2013-03-28 11:24:10 +00:00
Start off with
```lua
local htmlparser = require("htmlparser")
```
Then, parse some html:
```lua
local root = htmlparser.parse(htmlstring)
```
2017-04-08 18:55:25 +00:00
Optionally, you can pass loop-limit value (integer). This value means the deepness of the tree, after which parser will give up. Default value is 1000.
Also, global variable `htmlparser_looplimit` is supported (while this optional argument takes priority over global value)
2013-03-28 11:24:10 +00:00
The input to parse may be the contents of a complete html document, or any valid html snippet, as long as all tags are correctly opened and closed.
2013-12-01 23:40:13 +00:00
Now, find specific contained elements by selecting:
2013-03-28 11:24:10 +00:00
```lua
local elements = root:select(selectorstring)
```
Or in shorthand:
```lua
local elements = root(selectorstring)
```
2013-12-10 12:31:17 +00:00
This wil return a list of elements, all of which are of the same type as the root element, and thus support selecting as well, if ever needed:
2013-03-28 11:24:10 +00:00
```lua
2013-12-10 12:31:17 +00:00
for _,e in ipairs(elements) do
2013-03-28 11:24:10 +00:00
print(e.name)
local subs = e(subselectorstring)
2013-12-10 12:31:17 +00:00
for _,sub in ipairs(subs) do
2013-03-28 11:24:10 +00:00
print("", sub.name)
end
end
```
2013-03-28 12:22:57 +00:00
The root element is a container for the top level elements in the parsed text, i.e. the `<html>` element in a parsed html document would be a child of the returned root element.
2013-03-28 11:24:10 +00:00
2017-04-08 03:46:27 +00:00
## Selectors
2017-04-08 18:55:25 +00:00
Supported selectors are a subset of [jQuery's selectors][1]:
2013-03-28 12:22:57 +00:00
- `"*"` all contained elements
- `"element"` elements with the given tagname
- `"#id"` elements with the given id attribute value
- `".class"` elements with the given classname in the class attribute
- `"[attribute]"` elements with an attribute of the given name
2013-04-08 12:54:00 +00:00
- `"[attribute='value']"` equals: elements with the given value for the given attribute
- `"[attribute!='value']"` not equals: elements without the given attribute, or having the attribute, but with a different value
2013-03-28 12:22:57 +00:00
- `"[attribute|='value']"` prefix: attribute's value is given value, or starts with given value, followed by a hyphen (`-`)
- `"[attribute*='value']"` contains: attribute's value contains given value
- `"[attribute~='value']"` word: attribute's value is a space-separated token, where one of the tokens is the given value
- `"[attribute^='value']"` starts with: attribute's value starts with given value
- `"[attribute$='value']"` ends with: attribute's value ends with given value
2013-03-28 12:29:53 +00:00
- `":not(selectorstring)"` elements not selected by given selector string
2013-03-28 12:31:08 +00:00
- `"ancestor descendant"` elements selected by the `descendant` selector string, that are a descendant of any element selected by the `ancestor` selector string
- `"parent > child"` elements selected by the `child` selector string, that are a child element of any element selected by the `parent` selector string
2013-03-28 11:38:13 +00:00
2013-03-28 11:40:47 +00:00
Selectors can be combined; e.g. `".class:not([attribute]) element.class"`
2013-03-28 11:24:10 +00:00
2017-04-08 03:46:27 +00:00
## Element type
2013-03-28 11:38:13 +00:00
All tree elements provide, apart from `:select` and `()` , the following accessors:
2013-03-28 11:47:22 +00:00
2017-04-08 03:46:27 +00:00
### Basic
2013-03-28 12:22:57 +00:00
- `.name` the element's tagname
- `.attributes` a table with keys and values for the element's attributes; `{}` if none
- `.id` the value of the element's id attribute; `nil` if not present
- `.classes` an array with the classes listed in element's class attribute; `{}` if none
- `:getcontent()` the raw text between the opening and closing tags of the element; `""` if none
- `.nodes` an array with the element's child elements, `{}` if none
2013-12-11 07:48:10 +00:00
- `.parent` the element that contains this element; `root.parent` is `nil`
2013-03-28 11:47:22 +00:00
2017-04-08 03:46:27 +00:00
### Other
2013-12-10 12:31:17 +00:00
- `.index` sequence number of elements in order of appearance; root index is `0`
2013-04-08 12:54:00 +00:00
- `:gettext()` the complete element text, starting with `"<tagname"` and ending with `"/>"` or `"</tagname>"`
2013-03-28 12:22:57 +00:00
- `.level` how deep the element is in the tree; root level is `0`
2013-03-28 11:38:13 +00:00
- `.root` the root element of the tree; `root.root` is `root`
2013-03-28 12:22:57 +00:00
- `.deepernodes` a [Set][1] containing all elements in the tree beneath this element, including this element's `.nodes` ; `{}` if none
2013-12-11 07:50:27 +00:00
- `.deeperelements` a table with a key for each distinct tagname in `.deepernodes` , containing a [Set][1] of all deeper element nodes with that name; `{}` if none
2013-03-28 12:22:57 +00:00
- `.deeperattributes` as `.deeperelements` , but keyed on attribute name
- `.deeperids` as `.deeperelements` , but keyed on id value
- `.deeperclasses` as `.deeperelements` , but keyed on class name
2013-04-08 12:17:14 +00:00
2017-04-08 03:46:27 +00:00
## Limitations
2014-01-10 19:55:10 +00:00
- Attribute values in selector strings cannot contain any spaces
2013-04-08 12:54:00 +00:00
- The spaces before and after the `>` in a `parent > child` relation are mandatory
- `<!` elements (including doctype, comments, and CDATA) are not parsed; markup within CDATA is *not* escaped
2013-12-01 23:40:13 +00:00
- Textnodes are no separate tree elements; in `local root = htmlparser.parse("<p>line1<br />line2</p>")` , `root.nodes[1]:getcontent()` is `"line1<br />line2"` , while `root.nodes[1].nodes[1].name` is `"br"`
2013-04-08 12:54:00 +00:00
- No start or end tags are implied when [omitted ](http://www.w3.org/TR/html5/syntax.html#optional-tags ). Only the [void elements ](http://www.w3.org/TR/html5/syntax.html#void-elements ) should not have an end tag
- No validation is done for tag or attribute names or nesting of element types. The list of void elements is in fact the only part specific to HTML
2013-04-08 12:17:14 +00:00
2017-04-08 03:46:27 +00:00
## Examples
2013-04-08 12:17:14 +00:00
See `./doc/sample.lua`
2017-04-08 03:46:27 +00:00
## Tests
2013-04-08 12:17:14 +00:00
See `./tst/init.lua`
2017-04-08 03:46:27 +00:00
## License
2013-12-06 13:31:49 +00:00
LGPL+; see `./doc/LICENSE`