lua-htmlparser/README.md
2013-03-28 12:40:47 +01:00

77 lines
2.9 KiB
Markdown

#LuaRock "htmlparser"
Parse HTML text into a tree of elements with selectors
###License
MIT; see ./doc/LICENSE
###Usage
Start off with
```lua
require("luarocks.loader")
local htmlparser = require("htmlparser")
```
Then, parse some html:
```lua
local root = htmlparser.parse(htmlstring)
```
The input to parse may be the contents of a complete html document, or any valid html snippet, as long as all tags are correctly opened and closed.
Now, find specific elements by selecting:
```lua
local elements = root:select(selectorstring)
```
Or in shorthand:
```lua
local elements = root(selectorstring)
```
This wil return a Set of elements, all of which are of the same type as the root element, and thus support selecting as well, if ever needed:
```lua
for e in pairs(elements) do
print(e.name)
local subs = e(subselectorstring)
for sub in pairs(subs) do
print("", sub.name)
end
end
```
###Selectors
- `"element"`
- `"#id"`
- `".class"`
- `"[attribute]"`
- `"[attribute=value]"`
- `"[attribute!=value]"`
- `"[attribute|=value]"`
- `"[attribute*=value]"`
- `"[attribute~=value]"`
- `"[attribute^=value]"`
- `"[attribute$=value]"`
- `":not(selector)"`
- `"ancestor descendant"`
- `"parent > child"`
Selectors can be combined; e.g. `".class:not([attribute]) element.class"`
####Limitations
- Attribute values in selectors currently cannot contain any spaces, since space is interpreted as a delimiter between `ancestor` and `descendant`, `parent` and `>`, or `>` and `child` parts of the selector
- Likewise, for the `parent > child` relation, the spaces before and after the `>` are mandatory
###Element type
All tree elements provide, apart from `:select` and `()`, the following accessors:
- `.name` = the element's tagname
- `.attributes` = a table with keys and values for the element's attributes; `{}` if none
- `.id` = the value of the element's id attribute; `nil` if not present
- `.classes` = an array with the classes listed in element's class attribute; `{}` if none
- `:getcontent()` = the raw text between the opening and closing tags of the element; `""` if none
- `.nodes` = an array with the element's child elements, `{}` if none
- `.parent` = the elements that contains this element; `root.parent` is `nil`
- `:gettext()` = the raw text of the complete element, starting with `"<tagname"` and ending with `"/>"`
- `.level` = how deep the element is in the tree; root level is `0`
- `.root` the root element of the tree; `root.root` is `root`
- `.deepernodes` = a Set containing all elements in the tree beneath this element, including this element's `.nodes`; `{}` if none
- `.deeperelements` = a table with a key for each distinct tagname in `.deepernodes`, containing a Set of all deeper element nodes with that name; `{}` in none
- `.deeperattributes` = as `.deeperelements`, but keyed on attribute name
- `.deeperids` = as `.deeperelements`, but keyed on id value
- `.deeperclasses` = as `.deeperelements`, but keyed on class name