Class DeDuplicatingTokenFilter

  • All Implemented Interfaces:, java.lang.AutoCloseable

    public class DeDuplicatingTokenFilter
    extends org.apache.lucene.analysis.FilteringTokenFilter
    Inspects token streams for duplicate sequences of tokens. Token sequences have a minimum length - 6 is a good heuristic as it avoids filtering common idioms/phrases but detects longer sections that are typical of cut+paste copies of text.

    Internally each token is hashed/moduloed into a single byte (so 256 possible values for each token) and then recorded in a trie of seen byte sequences using a DuplicateByteSequenceSpotter. This trie is passed into the TokenFilter constructor so a single object can be reused across multiple documents.

    The emitDuplicates setting controls if duplicate tokens are filtered from results or are output (the DuplicateSequenceAttribute attribute can be used to inspect the number of prior sightings when emitDuplicates is true)

    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

    • Field Summary

      • Fields inherited from class org.apache.lucene.analysis.TokenFilter

      • Fields inherited from class org.apache.lucene.analysis.TokenStream

    • Method Summary

      Modifier and Type Method Description
      protected boolean accept()  
      • Methods inherited from class org.apache.lucene.analysis.FilteringTokenFilter

        end, incrementToken, reset
      • Methods inherited from class org.apache.lucene.analysis.TokenFilter

      • Methods inherited from class org.apache.lucene.util.AttributeSource

        addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
      • Methods inherited from class java.lang.Object

        clone, finalize, getClass, notify, notifyAll, wait, wait, wait
    • Constructor Detail

      • DeDuplicatingTokenFilter

        public DeDuplicatingTokenFilter​(org.apache.lucene.analysis.TokenStream in,
                                        DuplicateByteSequenceSpotter byteStreamDuplicateSpotter)
      • DeDuplicatingTokenFilter

        public DeDuplicatingTokenFilter​(org.apache.lucene.analysis.TokenStream in,
                                        DuplicateByteSequenceSpotter byteStreamDuplicateSpotter,
                                        boolean emitDuplicates)
        in - The input token stream
        byteStreamDuplicateSpotter - object which retains trie of token sequences
        emitDuplicates - true if duplicate tokens are to be emitted (use DuplicateSequenceAttribute attribute to inspect number of prior sightings of tokens as part of a sequence).
    • Method Detail

      • accept

        protected boolean accept()
        Specified by:
        accept in class org.apache.lucene.analysis.FilteringTokenFilter