Class TagParseState
- java.lang.Object
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.CharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
-
- org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState
-
- Direct Known Subclasses:
HTMLParseState,XMLFuzzyParseState,XMLParseState
public class TagParseState extends SingleCharacterReceiver
This class represents a basic xml/html tag parser. It is capable of recognizing the following xml and html constructs:'<' <token> <attrs> '>' ... '</' <token> '>' '<' <token> <attrs> '/>' '<?' <token> <attrs> '?>' '<![' [<token>] '[' ... ']]>' '<!' <token> ... '>' '<!--' ... '-->'
Each of these, save the comment, has supporting protected methods that will be called by the parsing engine. Overriding these methods will allow an extending class to perform higher-level data extraction and parsing. Of these, the messiest is the <! ... > construct, since there can be multiple nested btags, cdata-like escapes, and qtags inside. Ideally the parser should produce a sequence of preparsed tokens from these tags. Since they can be nested, keeping track of the depth is also essential, so we do that with a btag depth counter. Thus, in this case, it is not the state that matters, but the btag depth, to determine if the parser is operating inside a btag.
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.StringBuilderaccumBufferThis is the only buffer we actually accumulate stuff in.protected java.lang.StringBuilderampBufferBuffer of characters seen after ampersand.protected intbTagDepthThe btag depth, which indicates btag behavior when > 0.protected java.util.List<AttrNameValue>currentAttrListprotected java.lang.StringcurrentAttrNameprotected java.lang.StringBuildercurrentAttrNameBufferprotected intcurrentStateprotected java.lang.StringcurrentTagNameprotected java.lang.StringBuildercurrentTagNameBufferprotected java.lang.StringBuildercurrentValueBufferprotected booleaninAmpersandWhether we've seen an ampersandprotected static java.util.Map<java.lang.String,java.lang.String>mapLookupprotected static intTAGPARSESTATE_IN_ATTR_LOOKING_FOR_VALUEprotected static intTAGPARSESTATE_IN_ATTR_NAMEprotected static intTAGPARSESTATE_IN_ATTR_VALUEprotected static intTAGPARSESTATE_IN_BANG_TOKENprotected static intTAGPARSESTATE_IN_BRACKET_TOKENprotected static intTAGPARSESTATE_IN_CDATA_BODYprotected static intTAGPARSESTATE_IN_COMMENTprotected static intTAGPARSESTATE_IN_DOUBLE_QUOTES_ATTR_VALUEprotected static intTAGPARSESTATE_IN_END_TAG_NAMEprotected static intTAGPARSESTATE_IN_QTAG_ATTR_LOOKING_FOR_VALUEprotected static intTAGPARSESTATE_IN_QTAG_ATTR_NAMEprotected static intTAGPARSESTATE_IN_QTAG_ATTR_VALUEprotected static intTAGPARSESTATE_IN_QTAG_DOUBLE_QUOTES_ATTR_VALUEprotected static intTAGPARSESTATE_IN_QTAG_NAMEprotected static intTAGPARSESTATE_IN_QTAG_SAW_QUESTIONprotected static intTAGPARSESTATE_IN_QTAG_SINGLE_QUOTES_ATTR_VALUEprotected static intTAGPARSESTATE_IN_QTAG_UNQUOTED_ATTR_VALUEprotected static intTAGPARSESTATE_IN_SINGLE_QUOTES_ATTR_VALUEprotected static intTAGPARSESTATE_IN_TAG_NAMEprotected static intTAGPARSESTATE_IN_TAG_SAW_SLASHprotected static intTAGPARSESTATE_IN_UNQUOTED_ATTR_VALUEprotected static intTAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE_SAW_SLASHprotected static intTAGPARSESTATE_NEED_FINAL_BRACKETprotected static intTAGPARSESTATE_NORMALprotected static intTAGPARSESTATE_SAWCOMMENTDASHprotected static intTAGPARSESTATE_SAWDASHprotected static intTAGPARSESTATE_SAWEXCLAMATIONprotected static intTAGPARSESTATE_SAWLEFTANGLEprotected static intTAGPARSESTATE_SAWRIGHTBRACKETprotected static intTAGPARSESTATE_SAWSECONDCOMMENTDASHprotected static intTAGPARSESTATE_SAWSECONDRIGHTBRACKET-
Fields inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
charBuffer
-
-
Constructor Summary
Constructors Constructor Description TagParseState()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected booleanacceptNewTag()Allow parsing within tag.protected static java.lang.StringattributeDecode(java.lang.String input)Decode an html attributebooleandealWithCharacter(char thisChar)Deal with a character.protected booleandumpValues(java.lang.String value)protected static booleanisPunctuation(char x)Is a character markup language punctuation?protected static booleanisWhitespace(char x)Is a character markup language whitespace?protected static java.lang.StringmapChunk(java.lang.String input)Map an entity reference back to a characterprotected java.lang.StringBuildernewBuffer()Allocate the buffer.protected booleannoteBTag(java.lang.String tagName)This method is called for every <! <token> ...protected booleannoteBTagToken(java.lang.String token)This method gets called for every token inside a btag.protected booleannoteEndBTag()This method is called for the end of every btag, or any time there's a naked '>' in the document.protected booleannoteEndEscaped()Called for the end of every cdata-like tag.protected booleannoteEndTag(java.lang.String tagName)This method gets called for every end tag.protected booleannoteEscaped(java.lang.String token)Called for the start of every cdata-like tag, e.g.protected booleannoteEscapedCharacter(char thisChar)This method gets called for every character that is found within an escape block, e.g.protected booleannoteNormalCharacter(char thisChar)This method gets called for every character that is not part of a tag etc.protected booleannoteQTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes)This method is called for every <? ...protected booleannoteTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes)This method gets called for every tag.protected booleanoutputAmpBuffer()Interpret ampersand buffer.-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
dealWithCharacters, dealWithRemainder
-
Methods inherited from class org.apache.manifoldcf.connectorcommon.fuzzyml.CharacterReceiver
finishUp
-
-
-
-
Field Detail
-
TAGPARSESTATE_NORMAL
protected static final int TAGPARSESTATE_NORMAL
- See Also:
- Constant Field Values
-
TAGPARSESTATE_SAWLEFTANGLE
protected static final int TAGPARSESTATE_SAWLEFTANGLE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_SAWEXCLAMATION
protected static final int TAGPARSESTATE_SAWEXCLAMATION
- See Also:
- Constant Field Values
-
TAGPARSESTATE_SAWDASH
protected static final int TAGPARSESTATE_SAWDASH
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_COMMENT
protected static final int TAGPARSESTATE_IN_COMMENT
- See Also:
- Constant Field Values
-
TAGPARSESTATE_SAWCOMMENTDASH
protected static final int TAGPARSESTATE_SAWCOMMENTDASH
- See Also:
- Constant Field Values
-
TAGPARSESTATE_SAWSECONDCOMMENTDASH
protected static final int TAGPARSESTATE_SAWSECONDCOMMENTDASH
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_TAG_NAME
protected static final int TAGPARSESTATE_IN_TAG_NAME
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_ATTR_NAME
protected static final int TAGPARSESTATE_IN_ATTR_NAME
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_ATTR_VALUE
protected static final int TAGPARSESTATE_IN_ATTR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_TAG_SAW_SLASH
protected static final int TAGPARSESTATE_IN_TAG_SAW_SLASH
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_END_TAG_NAME
protected static final int TAGPARSESTATE_IN_END_TAG_NAME
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_ATTR_LOOKING_FOR_VALUE
protected static final int TAGPARSESTATE_IN_ATTR_LOOKING_FOR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_SINGLE_QUOTES_ATTR_VALUE
protected static final int TAGPARSESTATE_IN_SINGLE_QUOTES_ATTR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_DOUBLE_QUOTES_ATTR_VALUE
protected static final int TAGPARSESTATE_IN_DOUBLE_QUOTES_ATTR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE
protected static final int TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_QTAG_NAME
protected static final int TAGPARSESTATE_IN_QTAG_NAME
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_QTAG_ATTR_NAME
protected static final int TAGPARSESTATE_IN_QTAG_ATTR_NAME
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_QTAG_SAW_QUESTION
protected static final int TAGPARSESTATE_IN_QTAG_SAW_QUESTION
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_QTAG_ATTR_VALUE
protected static final int TAGPARSESTATE_IN_QTAG_ATTR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_QTAG_ATTR_LOOKING_FOR_VALUE
protected static final int TAGPARSESTATE_IN_QTAG_ATTR_LOOKING_FOR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_QTAG_SINGLE_QUOTES_ATTR_VALUE
protected static final int TAGPARSESTATE_IN_QTAG_SINGLE_QUOTES_ATTR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_QTAG_DOUBLE_QUOTES_ATTR_VALUE
protected static final int TAGPARSESTATE_IN_QTAG_DOUBLE_QUOTES_ATTR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_QTAG_UNQUOTED_ATTR_VALUE
protected static final int TAGPARSESTATE_IN_QTAG_UNQUOTED_ATTR_VALUE
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_BRACKET_TOKEN
protected static final int TAGPARSESTATE_IN_BRACKET_TOKEN
- See Also:
- Constant Field Values
-
TAGPARSESTATE_NEED_FINAL_BRACKET
protected static final int TAGPARSESTATE_NEED_FINAL_BRACKET
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_BANG_TOKEN
protected static final int TAGPARSESTATE_IN_BANG_TOKEN
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_CDATA_BODY
protected static final int TAGPARSESTATE_IN_CDATA_BODY
- See Also:
- Constant Field Values
-
TAGPARSESTATE_SAWRIGHTBRACKET
protected static final int TAGPARSESTATE_SAWRIGHTBRACKET
- See Also:
- Constant Field Values
-
TAGPARSESTATE_SAWSECONDRIGHTBRACKET
protected static final int TAGPARSESTATE_SAWSECONDRIGHTBRACKET
- See Also:
- Constant Field Values
-
TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE_SAW_SLASH
protected static final int TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE_SAW_SLASH
- See Also:
- Constant Field Values
-
currentState
protected int currentState
-
bTagDepth
protected int bTagDepth
The btag depth, which indicates btag behavior when > 0.
-
accumBuffer
protected java.lang.StringBuilder accumBuffer
This is the only buffer we actually accumulate stuff in.
-
currentTagNameBuffer
protected java.lang.StringBuilder currentTagNameBuffer
-
currentAttrNameBuffer
protected java.lang.StringBuilder currentAttrNameBuffer
-
currentValueBuffer
protected java.lang.StringBuilder currentValueBuffer
-
currentTagName
protected java.lang.String currentTagName
-
currentAttrName
protected java.lang.String currentAttrName
-
currentAttrList
protected java.util.List<AttrNameValue> currentAttrList
-
inAmpersand
protected boolean inAmpersand
Whether we've seen an ampersand
-
ampBuffer
protected java.lang.StringBuilder ampBuffer
Buffer of characters seen after ampersand.
-
mapLookup
protected static final java.util.Map<java.lang.String,java.lang.String> mapLookup
-
-
Method Detail
-
dealWithCharacter
public boolean dealWithCharacter(char thisChar) throws ManifoldCFExceptionDeal with a character. No exceptions are allowed, since those would represent syntax errors, and we don't want those to cause difficulty.- Specified by:
dealWithCharacterin classSingleCharacterReceiver- Returns:
- true if done.
- Throws:
ManifoldCFException
-
acceptNewTag
protected boolean acceptNewTag()
Allow parsing within tag.
-
newBuffer
protected java.lang.StringBuilder newBuffer()
Allocate the buffer.
-
outputAmpBuffer
protected boolean outputAmpBuffer() throws ManifoldCFExceptionInterpret ampersand buffer.- Throws:
ManifoldCFException
-
dumpValues
protected boolean dumpValues(java.lang.String value) throws ManifoldCFException- Throws:
ManifoldCFException
-
noteTag
protected boolean noteTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes) throws ManifoldCFExceptionThis method gets called for every tag. Override this method to intercept tag begins.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndTag
protected boolean noteEndTag(java.lang.String tagName) throws ManifoldCFExceptionThis method gets called for every end tag. Override this method to intercept tag ends.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteQTag
protected boolean noteQTag(java.lang.String tagName, java.util.List<AttrNameValue> attributes) throws ManifoldCFExceptionThis method is called for every <? ... ?> construct, or 'qtag'. Override it to intercept such constructs.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteBTag
protected boolean noteBTag(java.lang.String tagName) throws ManifoldCFExceptionThis method is called for every <! <token> ... > construct, or 'btag'. Override it to intercept these.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndBTag
protected boolean noteEndBTag() throws ManifoldCFExceptionThis method is called for the end of every btag, or any time there's a naked '>' in the document. Override it if you want to intercept these.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEscaped
protected boolean noteEscaped(java.lang.String token) throws ManifoldCFExceptionCalled for the start of every cdata-like tag, e.g. <![ <token> [ ... ]]>- Parameters:
token- may be empty!!!- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEndEscaped
protected boolean noteEndEscaped() throws ManifoldCFExceptionCalled for the end of every cdata-like tag.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteBTagToken
protected boolean noteBTagToken(java.lang.String token) throws ManifoldCFExceptionThis method gets called for every token inside a btag.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteNormalCharacter
protected boolean noteNormalCharacter(char thisChar) throws ManifoldCFExceptionThis method gets called for every character that is not part of a tag etc. Override this method to intercept such characters.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
noteEscapedCharacter
protected boolean noteEscapedCharacter(char thisChar) throws ManifoldCFExceptionThis method gets called for every character that is found within an escape block, e.g. CDATA. Override this method to intercept such characters.- Returns:
- true to halt further processing.
- Throws:
ManifoldCFException
-
attributeDecode
protected static java.lang.String attributeDecode(java.lang.String input)
Decode an html attribute
-
mapChunk
protected static java.lang.String mapChunk(java.lang.String input)
Map an entity reference back to a character
-
isWhitespace
protected static boolean isWhitespace(char x)
Is a character markup language whitespace?
-
isPunctuation
protected static boolean isPunctuation(char x)
Is a character markup language punctuation?
-
-