org.apache.lucene.analysis.miscellaneous.DuplicateByteSequenceSpotter

public class DuplicateByteSequenceSpotter extends Object

A Trie structure for analysing byte streams for duplicate sequences. Bytes from a stream are added one at a time using the addByte method and the number of times it has been seen as part of a sequence is returned.

The minimum required length for a duplicate sequence detected is 6 bytes.

The design goals are to maximize speed of lookup while minimizing the space required to do so. This has led to a hybrid solution for representing the bytes that make up a sequence in the trie.

If we have 6 bytes in sequence e.g. abcdef then they are represented as object nodes in the tree as follows:

(a)-(b)-(c)-(def as an int)

DuplicateByteSequenceSpotter.RootTreeNode objects are used for the first two levels of the tree (representing bytes a and b in the example sequence). The combinations of objects at these 2 levels are few so internally these objects allocate an array of 256 child node objects to quickly address children by indexing directly into the densely packed array using a byte value. The third level in the tree holds DuplicateByteSequenceSpotter.LightweightTreeNode nodes that have few children (typically much less than 256) and so use a dynamically-grown array to hold child nodes as simple int primitives. These ints represent the final 3 bytes of a sequence and also hold a count of the number of times the entire sequence path has been visited (count is a single byte).

The Trie grows indefinitely as more content is added and while theoretically it could be massive (a 6-depth tree could produce 256^6 nodes) non-random content e.g English text contains fewer variations.

In future we may look at using one of these strategies when memory is tight:

auto-pruning methods to remove less-visited parts of the tree
auto-reset to wipe the whole tree and restart when a memory threshold is reached
halting any growth of the tree

Tests on real-world-text show that the size of the tree is a multiple of the input text where that multiplier varies between 10 and 5 times as the content size increased from 10 to 100 megabytes of content.

Field Summary

Fields

Modifier and Type

Field

Description

static final int

MAX_HIT_COUNT

static final int

TREE_DEPTH
Constructor Summary

Constructors

Constructor

Description

DuplicateByteSequenceSpotter()
Method Summary

Modifier and Type

Method

Description

short

addByte(byte b)

Add a byte to the sequence.

final long

getEstimatedSizeInBytes()

int[]

getNodesAllocatedByDepth()

int

getNodesResizedByDepth()

void

startNewSequence()

Reset the sequence detection logic to avoid any continuation of the immediately previous bytes.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- TREE_DEPTH
  
  public static final int TREE_DEPTH
  See Also:
  
  Constant Field Values
- MAX_HIT_COUNT
  
  public static final int MAX_HIT_COUNT
  See Also:
  
  Constant Field Values
Constructor Details
- DuplicateByteSequenceSpotter
  
  public DuplicateByteSequenceSpotter()
Method Details
- startNewSequence
  
  public void startNewSequence()
  
  Reset the sequence detection logic to avoid any continuation of the immediately previous bytes. A minimum of dupSequenceSize bytes need to be added before any new duplicate sequences will be reported. Hit counts are not reset by calling this method.
- addByte
  
  public short addByte(byte b)
  
  Add a byte to the sequence.
  
  Parameters:
  
  b - the next byte in a sequence
  
  Returns:
  
  number of times this byte and the preceding 6 bytes have been seen before as a sequence (only counts up to 255)
- getEstimatedSizeInBytes
  
  public final long getEstimatedSizeInBytes()
- getNodesAllocatedByDepth
  
  public int[] getNodesAllocatedByDepth()
  
  Returns:
  
  Performance info - the number of nodes allocated at each depth
- getNodesResizedByDepth
  
  public int getNodesResizedByDepth()
  
  Returns:
  
  Performance info - the number of resizing of children arrays, at each depth

Class DuplicateByteSequenceSpotter

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

TREE_DEPTH

MAX_HIT_COUNT

Constructor Details

DuplicateByteSequenceSpotter

Method Details

startNewSequence

addByte

getEstimatedSizeInBytes

getNodesAllocatedByDepth

getNodesResizedByDepth