java.lang.Object
- org.apache.lucene.analysis.miscellaneous.DuplicateByteSequenceSpotter

```
public class DuplicateByteSequenceSpotter
extends java.lang.Object
```
A Trie structure for analysing byte streams for duplicate sequences. Bytes from a stream are added one at a time using the addByte method and the number of times it has been seen as part of a sequence is returned. The minimum required length for a duplicate sequence detected is 6 bytes. The design goals are to maximize speed of lookup while minimizing the space required to do so. This has led to a hybrid solution for representing the bytes that make up a sequence in the trie. If we have 6 bytes in sequence e.g. abcdef then they are represented as object nodes in the tree as follows:
(a)-(b)-(c)-(def as an int)
DuplicateByteSequenceSpotter.RootTreeNode objects are used for the first two levels of the tree (representing bytes a and b in the example sequence). The combinations of objects at these 2 levels are few so internally these objects allocate an array of 256 child node objects to quickly address children by indexing directly into the densely packed array using a byte value. The third level in the tree holds DuplicateByteSequenceSpotter.LightweightTreeNode nodes that have few children (typically much less than 256) and so use a dynamically-grown array to hold child nodes as simple int primitives. These ints represent the final 3 bytes of a sequence and also hold a count of the number of times the entire sequence path has been visited (count is a single byte).
The Trie grows indefinitely as more content is added and while theoretically it could be massive (a 6-depth tree could produce 256^6 nodes) non-random content e.g English text contains fewer variations.
In future we may look at using one of these strategies when memory is tight:
1. auto-pruning methods to remove less-visited parts of the tree
2. auto-reset to wipe the whole tree and restart when a memory threshold is reached
3. halting any growth of the tree
Tests on real-world-text show that the size of the tree is a multiple of the input text where that multiplier varies between 10 and 5 times as the content size increased from 10 to 100 megabytes of content.

Field Summary

Fields
Modifier and Type Field Description

static int MAX_HIT_COUNT

static int TREE_DEPTH

Constructor Summary

Constructors
Constructor Description

DuplicateByteSequenceSpotter()

Method Summary

Modifier and Type	Method	Description
`short`	`addByte(byte b)`	Add a byte to the sequence.
`long`	`getEstimatedSizeInBytes()`
`int[]`	`getNodesAllocatedByDepth()`
`int`	`getNodesResizedByDepth()`
`void`	`startNewSequence()`	Reset the sequence detection logic to avoid any continuation of the immediately previous bytes.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - TREE_DEPTH
```
public static final int TREE_DEPTH
```
    See Also:
    
    Constant Field Values
  - MAX_HIT_COUNT
```
public static final int MAX_HIT_COUNT
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - DuplicateByteSequenceSpotter
```
public DuplicateByteSequenceSpotter()
```
- Method Detail
  - startNewSequence
```
public void startNewSequence()
```
    Reset the sequence detection logic to avoid any continuation of the immediately previous bytes. A minimum of dupSequenceSize bytes need to be added before any new duplicate sequences will be reported. Hit counts are not reset by calling this method.
  - addByte
```
public short addByte(byte b)
```
    Add a byte to the sequence.
    
    Parameters:
    
    b - the next byte in a sequence
    
    Returns:
    
    number of times this byte and the preceding 6 bytes have been seen before as a sequence (only counts up to 255)
  - getEstimatedSizeInBytes
```
public final long getEstimatedSizeInBytes()
```
  - getNodesAllocatedByDepth
```
public int[] getNodesAllocatedByDepth()
```
    Returns:
    
    Performance info - the number of nodes allocated at each depth
  - getNodesResizedByDepth
```
public int getNodesResizedByDepth()
```
    Returns:
    
    Performance info - the number of resizing of children arrays, at each depth

Fields
Modifier and Type	Field	Description
`static int`	`MAX_HIT_COUNT`
`static int`	`TREE_DEPTH`

Class DuplicateByteSequenceSpotter

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

TREE_DEPTH

MAX_HIT_COUNT

Constructor Detail

DuplicateByteSequenceSpotter

Method Detail

startNewSequence

addByte

getEstimatedSizeInBytes

getNodesAllocatedByDepth

getNodesResizedByDepth