An HTML parser for lua.
Go to file
2013-03-28 12:24:10 +01:00
doc First draft for a Rock setup 2013-03-28 12:24:10 +01:00
src First draft for a Rock setup 2013-03-28 12:24:10 +01:00
.gitignore First draft for a Rock setup 2013-03-28 12:24:10 +01:00
htmlparser-0.1-1.rockspec First draft for a Rock setup 2013-03-28 12:24:10 +01:00
README.md First draft for a Rock setup 2013-03-28 12:24:10 +01:00

#LuaRock "htmlparser"

Parse HTML text into a tree of elements with selectors

###License MIT; see ./doc/LICENSE

###Usage Start off with

require("luarocks.loader")
local htmlparser = require("htmlparser")

Then, parse some html:

local root = htmlparser.parse(htmlstring)

The input to parse may be the contents of a complete html document, or any valid html snippet, as long as all tags are correctly opened and closed. Now, find specific elements by selecting:

local elements = root:select(selectorstring)

Or in shorthand:

local elements = root(selectorstring)

This wil return a Set of elements, all of which are of the same type as the root element, and thus support selecting as well, if ever needed:

for e in pairs(elements) do
	print(e.name)
	local subs = e(subselectorstring)
	for sub in pairs(subs) do
		print("", sub.name)
	end
end

###Selectors

  • "element"
  • "#id"
  • ".class"
  • "[attribute]"
  • "[attribute=value]"
  • "[attribute!=value]"
  • "[attribute|=value]"
  • "[attribute*=value]"
  • "[attribute~=value]"
  • "[attribute^=value]"
  • "[attribute$=value]"
  • ":not(selector)"
  • "ancestor descendant"
  • "parent > child" Selectors can be combined; e.g. ".class:not([attribute]) element.class"

####Limitations

  • Attribute values in selectors currently cannot contain any spaces, since space is interpreted as a delimiter between ancestor and descendant, parent and >, or > and child parts of the selector
  • Likewise, for the parent > child relation, the spaces before and after the > are mandatory

###Element type The tree elements provide, apart from :select and (), the following accessors:

  • .name = the elements tagname
  • .attributes = a table with keys and values for the element's attributes
  • .id = the value of the element's id attribute, if present
  • .classes = an array with the classes listed in element's class attribute, if any
  • :getcontent() = the raw text between the opening and closing tags of the element
  • .nodes = an array with the element's child elements
  • .parent = the elements that contains this element; root.parent is nil
  • :gettext() = the raw text of the complete element, starting with "<tagname" and ending with "/>"
  • .level = how deep the element is in the tree; root level is 0
  • .root the root element of the tree; root.root is root
  • .deepernodes = a Set containing all elements in the tree beneath this element, including this element's .nodes
  • .deeperelements = a table with a key for each distinct tagname in .deepernodes, containing a Set of all deeper element nodes with that name
  • .deeperattributes = as .deeperelements, but keyed on attribute name
  • .deeperids = as .deeperelements, but keyed on id value
  • .deeperclasses = as .deeperelements, but keyed on class name