From abbab47865e7b9f9ed78a46028e651e7fc58bb9c Mon Sep 17 00:00:00 2001 From: Wouter Scherphof Date: Mon, 8 Apr 2013 05:57:33 -0700 Subject: [PATCH] Create gh-pages branch via GitHub --- index.html | 57 +++++++++++++++++++++++++---------------------------- params.json | 2 +- 2 files changed, 28 insertions(+), 31 deletions(-) diff --git a/index.html b/index.html index defbbcc..f711728 100644 --- a/index.html +++ b/index.html @@ -31,11 +31,7 @@
-

License

- -

MIT; see ./doc/LICENSE

- -

Install

+

Install

Htmlparser is a listed LuaRock. Install using LuaRocks: luarocks install htmlparser

@@ -96,9 +92,9 @@ Now, find sepcific contained elements by selecting:

  • "[attribute]" elements with an attribute of the given name
  • -"[attribute='value']" equals: elements with the given value for the attribute with the given name
  • +"[attribute='value']" equals: elements with the given value for the given attribute
  • -"[attribute!='value']" not equals: elements without an attribute of the given name, or with that attribute, but with a value that is different from the given value
  • +"[attribute!='value']" not equals: elements without the given attribute, or having the attribute, but with a different value
  • "[attribute|='value']" prefix: attribute's value is given value, or starts with given value, followed by a hyphen (-)
  • @@ -117,27 +113,6 @@ Now, find sepcific contained elements by selecting:

    "parent > child" elements selected by the child selector string, that are a child element of any element selected by the parent selector string
  • Selectors can be combined; e.g. ".class:not([attribute]) element.class"

    -

    Limitations

    - -
      -
    • Attribute values in selectors currently cannot contain any spaces, since space is interpreted as a delimiter between the ancestor and descendant, parent and >, or > and child parts of the selector
    • -
    • Consequently, for the parent > child relation, the spaces before and after the > are mandatory
    • -
    • Attribute values in selectors currently also cannot contain any of #, ., [, ], :, (, or ) -
    • -
    • -<! elements are not parsed, including doctype, comments, and CDATA
    • -
    • Textnodes are not seperate entries in the tree, so the content of <p>line1<br />line2</p> is plainly "line1<br />line2" -
    • -
    • All start and end tags should be explicitly specified in the text to be parsed; omitted tags (as permitted by the the HTML spec) are NOT implied. Only the void elements naturally don't need (and mustn't have) an end tag
    • -
    • The HTML text is not validated in any way; tag and attribute names and the nesting of different tags is completely arbitrary. The only HTML-specific part of the parser is that it knows which tags are void elements
    • -

    Examples

    - -

    See ./doc/sample.lua

    - -

    Tests

    - -

    See ./tst/init.lua

    -

    Element type

    All tree elements provide, apart from :select and (), the following accessors:

    @@ -164,7 +139,7 @@ Now, find sepcific contained elements by selecting:

    • -:gettext() the raw text of the complete element, starting with "<tagname" and ending with "/>" +:gettext() the complete element text, starting with "<tagname" and ending with "/>" or "</tagname>"
    • .level how deep the element is in the tree; root level is 0 @@ -182,7 +157,29 @@ Now, find sepcific contained elements by selecting:

      .deeperids as .deeperelements, but keyed on id value
    • .deeperclasses as .deeperelements, but keyed on class name
    • -
    +

    Limitations

    + +
      +
    • Attribute values in selector strings cannot contain any spaces, nor any of #, ., [, ], :, (, or ) +
    • +
    • The spaces before and after the > in a parent > child relation are mandatory
    • +
    • +<! elements (including doctype, comments, and CDATA) are not parsed; markup within CDATA is not escaped
    • +
    • Textnodes are no seperate tree elements; in local root = htmlparser.parse("<p>line1<br />line2</p>"), root.nodes[1]:getcontent() is "line1<br />line2", while root.nodes[1].nodes[1].name is "br" +
    • +
    • No start or end tags are implied when omitted. Only the void elements should not have an end tag
    • +
    • No validation is done for tag or attribute names or nesting of element types. The list of void elements is in fact the only part specific to HTML
    • +

    Examples

    + +

    See ./doc/sample.lua

    + +

    Tests

    + +

    See ./tst/init.lua

    + +

    License

    + +

    MIT; see ./doc/LICENSE

    diff --git a/params.json b/params.json index cc0a10d..498a1f9 100644 --- a/params.json +++ b/params.json @@ -1 +1 @@ -{"name":"LuaRock \"htmlparser\"","tagline":"Parse HTML text into a tree of elements with selectors","body":"[1]: http://wscherphof.github.com/lua-set/\r\n[2]: http://api.jquery.com/category/selectors/\r\n\r\n##License\r\nMIT; see `./doc/LICENSE`\r\n\r\n##Install\r\nHtmlparser is a listed [LuaRock](http://luarocks.org/repositories/rocks/). Install using [LuaRocks](http://www.luarocks.org/): `luarocks install htmlparser`\r\n\r\n###Dependencies\r\nHtmlparser depends on [Lua 5.2](http://www.lua.org/download.html), and on the [\"set\"][1] LuaRock, which is installed along automatically. To be able to run the tests, [lunitx](https://github.com/dcurrie/lunit) also comes along as a LuaRock\r\n\r\n##Usage\r\nStart off with\r\n```lua\r\nrequire(\"luarocks.loader\")\r\nlocal htmlparser = require(\"htmlparser\")\r\n```\r\nThen, parse some html:\r\n```lua\r\nlocal root = htmlparser.parse(htmlstring)\r\n```\r\nThe input to parse may be the contents of a complete html document, or any valid html snippet, as long as all tags are correctly opened and closed.\r\nNow, find sepcific contained elements by selecting:\r\n```lua\r\nlocal elements = root:select(selectorstring)\r\n```\r\nOr in shorthand:\r\n```lua\r\nlocal elements = root(selectorstring)\r\n```\r\nThis wil return a [Set][1] of elements, all of which are of the same type as the root element, and thus support selecting as well, if ever needed:\r\n```lua\r\nfor e in pairs(elements) do\r\n\tprint(e.name)\r\n\tlocal subs = e(subselectorstring)\r\n\tfor sub in pairs(subs) do\r\n\t\tprint(\"\", sub.name)\r\n\tend\r\nend\r\n```\r\nThe root element is a container for the top level elements in the parsed text, i.e. the `` element in a parsed html document would be a child of the returned root element.\r\n\r\n##Selectors\r\nSupported selectors are a subset of [jQuery's selectors][2]:\r\n\r\n- `\"*\"` all contained elements\r\n- `\"element\"` elements with the given tagname\r\n- `\"#id\"` elements with the given id attribute value\r\n- `\".class\"` elements with the given classname in the class attribute\r\n- `\"[attribute]\"` elements with an attribute of the given name\r\n- `\"[attribute='value']\"` equals: elements with the given value for the attribute with the given name\r\n- `\"[attribute!='value']\"` not equals: elements without an attribute of the given name, or with that attribute, but with a value that is different from the given value\r\n- `\"[attribute|='value']\"` prefix: attribute's value is given value, or starts with given value, followed by a hyphen (`-`)\r\n- `\"[attribute*='value']\"` contains: attribute's value contains given value\r\n- `\"[attribute~='value']\"` word: attribute's value is a space-separated token, where one of the tokens is the given value\r\n- `\"[attribute^='value']\"` starts with: attribute's value starts with given value\r\n- `\"[attribute$='value']\"` ends with: attribute's value ends with given value\r\n- `\":not(selectorstring)\"` elements not selected by given selector string\r\n- `\"ancestor descendant\"` elements selected by the `descendant` selector string, that are a descendant of any element selected by the `ancestor` selector string\r\n- `\"parent > child\"` elements selected by the `child` selector string, that are a child element of any element selected by the `parent` selector string\r\n\r\nSelectors can be combined; e.g. `\".class:not([attribute]) element.class\"`\r\n\r\n###Limitations\r\n- Attribute values in selectors currently cannot contain any spaces, since space is interpreted as a delimiter between the `ancestor` and `descendant`, `parent` and `>`, or `>` and `child` parts of the selector\r\n- Consequently, for the `parent > child` relation, the spaces before and after the `>` are mandatory\r\n- Attribute values in selectors currently also cannot contain any of `#`, `.`, `[`, `]`, `:`, `(`, or `)`\r\n- `line1
    line2

    ` is plainly `\"line1
    line2\"`\r\n- All start and end tags should be explicitly specified in the text to be parsed; omitted tags (as [permitted](http://www.w3.org/TR/html5/syntax.html#optional-tags) by the the HTML spec) are NOT implied. Only the [void](http://www.w3.org/TR/html5/syntax.html#void-elements) elements naturally don't need (and mustn't have) an end tag\r\n- The HTML text is not validated in any way; tag and attribute names and the nesting of different tags is completely arbitrary. The only HTML-specific part of the parser is that it knows which tags are void elements\r\n\r\n##Examples\r\nSee `./doc/sample.lua`\r\n\r\n##Tests\r\nSee `./tst/init.lua`\r\n\r\n##Element type\r\nAll tree elements provide, apart from `:select` and `()`, the following accessors:\r\n\r\n###Basic\r\n- `.name` the element's tagname\r\n- `.attributes` a table with keys and values for the element's attributes; `{}` if none\r\n- `.id` the value of the element's id attribute; `nil` if not present\r\n- `.classes` an array with the classes listed in element's class attribute; `{}` if none\r\n- `:getcontent()` the raw text between the opening and closing tags of the element; `\"\"` if none\r\n- `.nodes` an array with the element's child elements, `{}` if none\r\n- `.parent` the elements that contains this element; `root.parent` is `nil`\r\n\r\n###Other\r\n- `:gettext()` the raw text of the complete element, starting with `\"\"`\r\n- `.level` how deep the element is in the tree; root level is `0`\r\n- `.root` the root element of the tree; `root.root` is `root`\r\n- `.deepernodes` a [Set][1] containing all elements in the tree beneath this element, including this element's `.nodes`; `{}` if none\r\n- `.deeperelements` a table with a key for each distinct tagname in `.deepernodes`, containing a [Set][1] of all deeper element nodes with that name; `{}` in none\r\n- `.deeperattributes` as `.deeperelements`, but keyed on attribute name\r\n- `.deeperids` as `.deeperelements`, but keyed on id value\r\n- `.deeperclasses` as `.deeperelements`, but keyed on class name\r\n","google":"","note":"Don't delete this file! It's used internally to help with page regeneration."} \ No newline at end of file +{"name":"LuaRock \"htmlparser\"","tagline":"Parse HTML text into a tree of elements with selectors","body":"[1]: http://wscherphof.github.com/lua-set/\r\n[2]: http://api.jquery.com/category/selectors/\r\n\r\n##Install\r\nHtmlparser is a listed [LuaRock](http://luarocks.org/repositories/rocks/). Install using [LuaRocks](http://www.luarocks.org/): `luarocks install htmlparser`\r\n\r\n###Dependencies\r\nHtmlparser depends on [Lua 5.2](http://www.lua.org/download.html), and on the [\"set\"][1] LuaRock, which is installed along automatically. To be able to run the tests, [lunitx](https://github.com/dcurrie/lunit) also comes along as a LuaRock\r\n\r\n##Usage\r\nStart off with\r\n```lua\r\nrequire(\"luarocks.loader\")\r\nlocal htmlparser = require(\"htmlparser\")\r\n```\r\nThen, parse some html:\r\n```lua\r\nlocal root = htmlparser.parse(htmlstring)\r\n```\r\nThe input to parse may be the contents of a complete html document, or any valid html snippet, as long as all tags are correctly opened and closed.\r\nNow, find sepcific contained elements by selecting:\r\n```lua\r\nlocal elements = root:select(selectorstring)\r\n```\r\nOr in shorthand:\r\n```lua\r\nlocal elements = root(selectorstring)\r\n```\r\nThis wil return a [Set][1] of elements, all of which are of the same type as the root element, and thus support selecting as well, if ever needed:\r\n```lua\r\nfor e in pairs(elements) do\r\n\tprint(e.name)\r\n\tlocal subs = e(subselectorstring)\r\n\tfor sub in pairs(subs) do\r\n\t\tprint(\"\", sub.name)\r\n\tend\r\nend\r\n```\r\nThe root element is a container for the top level elements in the parsed text, i.e. the `` element in a parsed html document would be a child of the returned root element.\r\n\r\n##Selectors\r\nSupported selectors are a subset of [jQuery's selectors][2]:\r\n\r\n- `\"*\"` all contained elements\r\n- `\"element\"` elements with the given tagname\r\n- `\"#id\"` elements with the given id attribute value\r\n- `\".class\"` elements with the given classname in the class attribute\r\n- `\"[attribute]\"` elements with an attribute of the given name\r\n- `\"[attribute='value']\"` equals: elements with the given value for the given attribute\r\n- `\"[attribute!='value']\"` not equals: elements without the given attribute, or having the attribute, but with a different value\r\n- `\"[attribute|='value']\"` prefix: attribute's value is given value, or starts with given value, followed by a hyphen (`-`)\r\n- `\"[attribute*='value']\"` contains: attribute's value contains given value\r\n- `\"[attribute~='value']\"` word: attribute's value is a space-separated token, where one of the tokens is the given value\r\n- `\"[attribute^='value']\"` starts with: attribute's value starts with given value\r\n- `\"[attribute$='value']\"` ends with: attribute's value ends with given value\r\n- `\":not(selectorstring)\"` elements not selected by given selector string\r\n- `\"ancestor descendant\"` elements selected by the `descendant` selector string, that are a descendant of any element selected by the `ancestor` selector string\r\n- `\"parent > child\"` elements selected by the `child` selector string, that are a child element of any element selected by the `parent` selector string\r\n\r\nSelectors can be combined; e.g. `\".class:not([attribute]) element.class\"`\r\n\r\n##Element type\r\nAll tree elements provide, apart from `:select` and `()`, the following accessors:\r\n\r\n###Basic\r\n- `.name` the element's tagname\r\n- `.attributes` a table with keys and values for the element's attributes; `{}` if none\r\n- `.id` the value of the element's id attribute; `nil` if not present\r\n- `.classes` an array with the classes listed in element's class attribute; `{}` if none\r\n- `:getcontent()` the raw text between the opening and closing tags of the element; `\"\"` if none\r\n- `.nodes` an array with the element's child elements, `{}` if none\r\n- `.parent` the elements that contains this element; `root.parent` is `nil`\r\n\r\n###Other\r\n- `:gettext()` the complete element text, starting with `\"\"` or `\"\"`\r\n- `.level` how deep the element is in the tree; root level is `0`\r\n- `.root` the root element of the tree; `root.root` is `root`\r\n- `.deepernodes` a [Set][1] containing all elements in the tree beneath this element, including this element's `.nodes`; `{}` if none\r\n- `.deeperelements` a table with a key for each distinct tagname in `.deepernodes`, containing a [Set][1] of all deeper element nodes with that name; `{}` in none\r\n- `.deeperattributes` as `.deeperelements`, but keyed on attribute name\r\n- `.deeperids` as `.deeperelements`, but keyed on id value\r\n- `.deeperclasses` as `.deeperelements`, but keyed on class name\r\n\r\n##Limitations\r\n- Attribute values in selector strings cannot contain any spaces, nor any of `#`, `.`, `[`, `]`, `:`, `(`, or `)`\r\n- The spaces before and after the `>` in a `parent > child` relation are mandatory \r\n- `line1
    line2

    \")`, `root.nodes[1]:getcontent()` is `\"line1
    line2\"`, while `root.nodes[1].nodes[1].name` is `\"br\"`\r\n- No start or end tags are implied when [omitted](http://www.w3.org/TR/html5/syntax.html#optional-tags). Only the [void elements](http://www.w3.org/TR/html5/syntax.html#void-elements) should not have an end tag\r\n- No validation is done for tag or attribute names or nesting of element types. The list of void elements is in fact the only part specific to HTML\r\n\r\n##Examples\r\nSee `./doc/sample.lua`\r\n\r\n##Tests\r\nSee `./tst/init.lua`\r\n\r\n##License\r\nMIT; see `./doc/LICENSE`\r\n","google":"","note":"Don't delete this file! It's used internally to help with page regeneration."} \ No newline at end of file