You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

202 lines
7.6 KiB

6 months ago
  1. # svg/sax
  2. A maintained fork of [sax-js](https://github.com/isaacs/sax-js) sax-style parser for XML and HTML.
  3. Designed with [node](http://nodejs.org/) in mind, but should work fine in
  4. the browser or other CommonJS implementations.
  5. ## What This Is
  6. * A very simple tool to parse through an XML string.
  7. * A stepping stone to a streaming HTML parser.
  8. * A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML
  9. docs.
  10. ## What This Is (probably) Not
  11. * An HTML Parser - That's a fine goal, but this isn't it. It's just
  12. XML.
  13. * A DOM Builder - You can use it to build an object model out of XML,
  14. but it doesn't do that out of the box.
  15. * XSLT - No DOM = no querying.
  16. * 100% Compliant with (some other SAX implementation) - Most SAX
  17. implementations are in Java and do a lot more than this does.
  18. * An XML Validator - It does a little validation when in strict mode, but
  19. not much.
  20. * A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic
  21. masochism.
  22. * A DTD-aware Thing - Fetching DTDs is a much bigger job.
  23. ## Regarding `<!DOCTYPE`s and `<!ENTITY`s
  24. The parser will handle the basic XML entities in text nodes and attribute
  25. values: `&amp; &lt; &gt; &apos; &quot;`. It's possible to define additional
  26. entities in XML by putting them in the DTD. This parser doesn't do anything
  27. with that. If you want to listen to the `ondoctype` event, and then fetch
  28. the doctypes, and read the entities and add them to `parser.ENTITIES`, then
  29. be my guest.
  30. Unknown entities will fail in strict mode, and in loose mode, will pass
  31. through unmolested.
  32. ## Usage
  33. ```javascript
  34. var sax = require("./lib/sax"),
  35. strict = true, // set to false for html-mode
  36. parser = sax.parser(strict);
  37. parser.onerror = function (e) {
  38. // an error happened.
  39. };
  40. parser.ontext = function (t) {
  41. // got some text. t is the string of text.
  42. };
  43. parser.onopentag = function (node) {
  44. // opened a tag. node has "name" and "attributes"
  45. };
  46. parser.onattribute = function (attr) {
  47. // an attribute. attr has "name" and "value"
  48. };
  49. parser.onend = function () {
  50. // parser stream is done, and ready to have more stuff written to it.
  51. };
  52. parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
  53. ```
  54. ## Arguments
  55. Pass the following arguments to the parser function. All are optional.
  56. `strict` - Boolean. Whether or not to be a jerk. Default: `false`.
  57. `opt` - Object bag of settings regarding string formatting. All default to `false`.
  58. Settings supported:
  59. * `trim` - Boolean. Whether or not to trim text and comment nodes.
  60. * `normalize` - Boolean. If true, then turn any whitespace into a single
  61. space.
  62. * `lowercase` - Boolean. If true, then lowercase tag names and attribute names
  63. in loose mode, rather than uppercasing them.
  64. * `xmlns` - Boolean. If true, then namespaces are supported.
  65. * `position` - Boolean. If false, then don't track line/col/position.
  66. * `strictEntities` - Boolean. If true, only parse [predefined XML
  67. entities](http://www.w3.org/TR/REC-xml/#sec-predefined-ent)
  68. (`&amp;`, `&apos;`, `&gt;`, `&lt;`, and `&quot;`)
  69. ## Methods
  70. `write` - Write bytes onto the stream. You don't have to do this all at
  71. once. You can keep writing as much as you want.
  72. `close` - Close the stream. Once closed, no more data may be written until
  73. it is done processing the buffer, which is signaled by the `end` event.
  74. `resume` - To gracefully handle errors, assign a listener to the `error`
  75. event. Then, when the error is taken care of, you can call `resume` to
  76. continue parsing. Otherwise, the parser will not continue while in an error
  77. state.
  78. ## Members
  79. At all times, the parser object will have the following members:
  80. `line`, `column`, `position` - Indications of the position in the XML
  81. document where the parser currently is looking.
  82. `startTagPosition` - Indicates the position where the current tag starts.
  83. `closed` - Boolean indicating whether or not the parser can be written to.
  84. If it's `true`, then wait for the `ready` event to write again.
  85. `strict` - Boolean indicating whether or not the parser is a jerk.
  86. `opt` - Any options passed into the constructor.
  87. `tag` - The current tag being dealt with.
  88. And a bunch of other stuff that you probably shouldn't touch.
  89. ## Events
  90. All events emit with a single argument. To listen to an event, assign a
  91. function to `on<eventname>`. Functions get executed in the this-context of
  92. the parser object. The list of supported events are also in the exported
  93. `EVENTS` array.
  94. `error` - Indication that something bad happened. The error will be hanging
  95. out on `parser.error`, and must be deleted before parsing can continue. By
  96. listening to this event, you can keep an eye on that kind of stuff. Note:
  97. this happens *much* more in strict mode. Argument: instance of `Error`.
  98. `text` - Text node. Argument: string of text.
  99. `doctype` - The `<!DOCTYPE` declaration. Argument: doctype string.
  100. `processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument:
  101. object with `name` and `body` members. Attributes are not parsed, as
  102. processing instructions have implementation dependent semantics.
  103. `sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>`
  104. would trigger this kind of event. This is a weird thing to support, so it
  105. might go away at some point. SAX isn't intended to be used to parse SGML,
  106. after all.
  107. `opentagstart` - Emitted immediately when the tag name is available,
  108. but before any attributes are encountered. Argument: object with a
  109. `name` field and an empty `attributes` set. Note that this is the
  110. same object that will later be emitted in the `opentag` event.
  111. `opentag` - An opening tag. Argument: object with `name` and `attributes`.
  112. In non-strict mode, tag names are uppercased, unless the `lowercase`
  113. option is set. If the `xmlns` option is set, then it will contain
  114. namespace binding information on the `ns` member, and will have a
  115. `local`, `prefix`, and `uri` member.
  116. `closetag` - A closing tag. In loose mode, tags are auto-closed if their
  117. parent closes. In strict mode, well-formedness is enforced. Note that
  118. self-closing tags will have `closeTag` emitted immediately after `openTag`.
  119. Argument: tag name.
  120. `attribute` - An attribute node. Argument: object with `name` and `value`.
  121. In non-strict mode, attribute names are uppercased, unless the `lowercase`
  122. option is set. If the `xmlns` option is set, it will also contains namespace
  123. information.
  124. `comment` - A comment node. Argument: the string of the comment.
  125. `opencdata` - The opening tag of a `<![CDATA[` block.
  126. `cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get
  127. quite large, this event may fire multiple times for a single block, if it
  128. is broken up into multiple `write()`s. Argument: the string of random
  129. character data.
  130. `closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block.
  131. `opennamespace` - If the `xmlns` option is set, then this event will
  132. signal the start of a new namespace binding.
  133. `closenamespace` - If the `xmlns` option is set, then this event will
  134. signal the end of a namespace binding.
  135. `end` - Indication that the closed stream has ended.
  136. `ready` - Indication that the stream has reset, and is ready to be written
  137. to.
  138. `noscript` - In non-strict mode, `<script>` tags trigger a `"script"`
  139. event, and their contents are not checked for special xml characters.
  140. If you pass `noscript: true`, then this behavior is suppressed.
  141. ## Reporting Problems
  142. It's best to write a failing test if you find an issue. I will always
  143. accept pull requests with failing tests if they demonstrate intended
  144. behavior, but it is very hard to figure out what issue you're describing
  145. without a test. Writing a test is also the best way for you yourself
  146. to figure out if you really understand the issue you think you have with
  147. sax-js.