java.io.Closeable
, java.lang.AutoCloseable
public class DeDuplicatingTokenFilter
extends org.apache.lucene.analysis.FilteringTokenFilter
Internally each token is hashed/moduloed into a single byte (so 256 possible
values for each token) and then recorded in a trie of seen byte sequences
using a DuplicateByteSequenceSpotter
. This trie is passed into the
TokenFilter constructor so a single object can be reused across multiple
documents.
The emitDuplicates setting controls if duplicate tokens are filtered from
results or are output (the DuplicateSequenceAttribute
attribute can
be used to inspect the number of prior sightings when emitDuplicates is true)
Constructor | Description |
---|---|
DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in,
DuplicateByteSequenceSpotter byteStreamDuplicateSpotter) |
|
DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in,
DuplicateByteSequenceSpotter byteStreamDuplicateSpotter,
boolean emitDuplicates) |
Modifier and Type | Method | Description |
---|---|---|
protected boolean |
accept() |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
end, incrementToken, reset
public DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter)
public DeDuplicatingTokenFilter(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter, boolean emitDuplicates)
in
- The input token streambyteStreamDuplicateSpotter
- object which retains trie of token sequencesemitDuplicates
- true if duplicate tokens are to be emitted (use
DuplicateSequenceAttribute
attribute to inspect number
of prior sightings of tokens as part of a sequence).