HTMLPurifier_Lexer_DOMLex
Parser that uses PHP 5's DOM extension (part of the core).
In PHP 5, the DOM XML extension was revamped into DOM and added to the core. It gives us a forgiving HTML parser, which we use to transform the HTML into a DOM, and then into the tokens. It is blazingly fast (for large documents, it performs twenty times faster than HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.
- Full name:
\HTMLPurifier_Lexer_DOMLex
- Parent class:
\HTMLPurifier_Lexer
Properties
factory
Methods
__construct
tokenizeHTML
Lexes an HTML string into tokens.
public tokenizeHTML(string $html, \HTMLPurifier_Config $config, \HTMLPurifier_Context $context): \HTMLPurifier_Token[]
Parameters:
Parameter | Type | Description |
---|---|---|
$html |
string | |
$config |
\HTMLPurifier_Config | |
$context |
\HTMLPurifier_Context |
tokenizeDOM
Iterative function that tokenizes a node, putting it into an accumulator.
To iterate is human, to recurse divine - L. Peter Deutsch
Parameters:
Parameter | Type | Description |
---|---|---|
$node |
\DOMNode | DOMNode to be tokenized. |
$tokens |
\HTMLPurifier_Token[] | Array-list of already tokenized tokens. |
$config |
mixed |
getTagName
Portably retrieve the tag name of a node; deals with older versions of libxml like 2.7.6
Parameters:
Parameter | Type | Description |
---|---|---|
$node |
\DOMNode |
getData
Portably retrieve the data of a node; deals with older versions of libxml like 2.7.6
Parameters:
Parameter | Type | Description |
---|---|---|
$node |
\DOMNode |
createStartNode
protected createStartNode(\DOMNode $node, \HTMLPurifier_Token[]& $tokens, bool $collect, mixed $config): bool
Parameters:
Parameter | Type | Description |
---|---|---|
$node |
\DOMNode | DOMNode to be tokenized. |
$tokens |
\HTMLPurifier_Token[] | Array-list of already tokenized tokens. |
$collect |
bool | Says whether or start and close are collected, set to false at first recursion because it's the implicit DIV tag you're dealing with. |
$config |
mixed |
Return Value:
if the token needs an endtoken
createEndNode
Parameters:
Parameter | Type | Description |
---|---|---|
$node |
\DOMNode | |
$tokens |
\HTMLPurifier_Token[] |
transformAttrToAssoc
Converts a DOMNamedNodeMap of DOMAttr objects into an assoc array.
Parameters:
Parameter | Type | Description |
---|---|---|
$node_map |
\DOMNamedNodeMap | DOMNamedNodeMap of DOMAttr objects. |
Return Value:
Associative array of attributes.
muteErrorHandler
An error handler that mutes all errors
Parameters:
Parameter | Type | Description |
---|---|---|
$errno |
int | |
$errstr |
string |
callbackUndoCommentSubst
Callback function for undoing escaping of stray angled brackets in comments
Parameters:
Parameter | Type | Description |
---|---|---|
$matches |
array |
callbackArmorCommentEntities
Callback function that entity-izes ampersands in comments so that callbackUndoCommentSubst doesn't clobber them
Parameters:
Parameter | Type | Description |
---|---|---|
$matches |
array |
wrapHTML
Wraps an HTML fragment in the necessary HTML
protected wrapHTML(string $html, \HTMLPurifier_Config $config, \HTMLPurifier_Context $context, mixed $use_div = true): string
Parameters:
Parameter | Type | Description |
---|---|---|
$html |
string | |
$config |
\HTMLPurifier_Config | |
$context |
\HTMLPurifier_Context | |
$use_div |
mixed |
Inherited methods
create
Retrieves or sets the default Lexer as a Prototype Factory.
By default HTMLPurifier_Lexer_DOMLex will be returned. There are a few exceptions involving special features that only DirectLex implements.
- This method is static.
Parameters:
Parameter | Type | Description |
---|---|---|
$config |
\HTMLPurifier_Config |
Throws:
__construct
parseText
Parameters:
Parameter | Type | Description |
---|---|---|
$string |
mixed | |
$config |
mixed |
parseAttr
Parameters:
Parameter | Type | Description |
---|---|---|
$string |
mixed | |
$config |
mixed |
parseData
Parses special entities into the proper characters.
This string will translate escaped versions of the special characters into the correct ones.
Parameters:
Parameter | Type | Description |
---|---|---|
$string |
string | String character data to be parsed. |
$is_attr |
mixed | |
$config |
mixed |
Return Value:
Parsed character data.
tokenizeHTML
Lexes an HTML string into tokens.
public tokenizeHTML(mixed $string, \HTMLPurifier_Config $config, \HTMLPurifier_Context $context): \HTMLPurifier_Token[]
Parameters:
Parameter | Type | Description |
---|---|---|
$string |
mixed | String HTML. |
$config |
\HTMLPurifier_Config | |
$context |
\HTMLPurifier_Context |
Return Value:
array representation of HTML.
escapeCDATA
Translates CDATA sections into regular sections (through escaping).
- This method is static.
Parameters:
Parameter | Type | Description |
---|---|---|
$string |
string | HTML string to process. |
Return Value:
HTML with CDATA sections escaped.
escapeCommentedCDATA
Special CDATA case that is especially convoluted for
- This method is static.
Parameters:
Parameter | Type | Description |
---|---|---|
$string |
string | HTML string to process. |
Return Value:
HTML with CDATA sections escaped.
removeIEConditional
Special Internet Explorer conditional comments should be removed.
- This method is static.
Parameters:
Parameter | Type | Description |
---|---|---|
$string |
string | HTML string to process. |
Return Value:
HTML with conditional comments removed.
CDATACallback
Callback function for escapeCDATA() that does the work.
- This method is static.
Parameters:
Parameter | Type | Description |
---|---|---|
$matches |
array | PCRE matches array, with index 0 the entire match and 1 the inside of the CDATA section. |
Return Value:
Escaped internals of the CDATA section.
normalize
Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff.
public normalize(string $html, \HTMLPurifier_Config $config, \HTMLPurifier_Context $context): string
Parameters:
Parameter | Type | Description |
---|---|---|
$html |
string | HTML. |
$config |
\HTMLPurifier_Config | |
$context |
\HTMLPurifier_Context |
extractBody
Takes a string of HTML (fragment or document) and returns the content
Parameters:
Parameter | Type | Description |
---|---|---|
$html |
mixed |
Automatically generated on 2025-03-18