Class DeDuplicatingTokenFilter

java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.FilteringTokenFilter
org.apache.lucene.analysis.miscellaneous.DeDuplicatingTokenFilter
All Implemented Interfaces:
java.io.Closeable, java.lang.AutoCloseable

public class DeDuplicatingTokenFilter
extends org.apache.lucene.analysis.FilteringTokenFilter
Inspects token streams for duplicate sequences of tokens. Token sequences have a minimum length - 6 is a good heuristic as it avoids filtering common idioms/phrases but detects longer sections that are typical of cut+paste copies of text.

Internally each token is hashed/moduloed into a single byte (so 256 possible values for each token) and then recorded in a trie of seen byte sequences using a DuplicateByteSequenceSpotter. This trie is passed into the TokenFilter constructor so a single object can be reused across multiple documents.

The emitDuplicates setting controls if duplicate tokens are filtered from results or are output (the DuplicateSequenceAttribute attribute can be used to inspect the number of prior sightings when emitDuplicates is true)

  • Nested Class Summary

    Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

    org.apache.lucene.util.AttributeSource.State
  • Field Summary

    Fields inherited from class org.apache.lucene.analysis.TokenFilter

    input

    Fields inherited from class org.apache.lucene.analysis.TokenStream

    DEFAULT_TOKEN_ATTRIBUTE_FACTORY
  • Constructor Summary

    Constructors 
    Constructor Description
    DeDuplicatingTokenFilter​(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter)  
    DeDuplicatingTokenFilter​(org.apache.lucene.analysis.TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter, boolean emitDuplicates)  
  • Method Summary

    Modifier and Type Method Description
    protected boolean accept()  

    Methods inherited from class org.apache.lucene.analysis.FilteringTokenFilter

    end, incrementToken, reset

    Methods inherited from class org.apache.lucene.analysis.TokenFilter

    close

    Methods inherited from class org.apache.lucene.util.AttributeSource

    addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, wait, wait, wait
  • Constructor Details

  • Method Details

    • accept

      protected boolean accept() throws java.io.IOException
      Specified by:
      accept in class org.apache.lucene.analysis.FilteringTokenFilter
      Throws:
      java.io.IOException