public class DuplicateByteSequenceSpotter extends java.lang.ObjectA Trie structure for analysing byte streams for duplicate sequences. Bytes from a stream are added one at a time using the addByte method and the number of times it has been seen as part of a sequence is returned. The minimum required length for a duplicate sequence detected is 6 bytes. The design goals are to maximize speed of lookup while minimizing the space required to do so. This has led to a hybrid solution for representing the bytes that make up a sequence in the trie. If we have 6 bytes in sequence e.g. abcdef then they are represented as object nodes in the tree as follows:
(a)-(b)-(c)-(def as an int)
DuplicateByteSequenceSpotter.RootTreeNodeobjects are used for the first two levels of the tree (representing bytes a and b in the example sequence). The combinations of objects at these 2 levels are few so internally these objects allocate an array of 256 child node objects to quickly address children by indexing directly into the densely packed array using a byte value. The third level in the tree holds
DuplicateByteSequenceSpotter.LightweightTreeNodenodes that have few children (typically much less than 256) and so use a dynamically-grown array to hold child nodes as simple int primitives. These ints represent the final 3 bytes of a sequence and also hold a count of the number of times the entire sequence path has been visited (count is a single byte).
The Trie grows indefinitely as more content is added and while theoretically it could be massive (a 6-depth tree could produce 256^6 nodes) non-random content e.g English text contains fewer variations.
In future we may look at using one of these strategies when memory is tight:
- auto-pruning methods to remove less-visited parts of the tree
- auto-reset to wipe the whole tree and restart when a memory threshold is reached
- halting any growth of the tree
Constructors Constructor Description
All Methods Instance Methods Concrete Methods Modifier and Type Method Description
addByte(byte b)Add a byte to the sequence.
startNewSequence()Reset the sequence detection logic to avoid any continuation of the immediately previous bytes.
public void startNewSequence()Reset the sequence detection logic to avoid any continuation of the immediately previous bytes. A minimum of dupSequenceSize bytes need to be added before any new duplicate sequences will be reported. Hit counts are not reset by calling this method.
public short addByte(byte b)Add a byte to the sequence.
b- the next byte in a sequence
- number of times this byte and the preceding 6 bytes have been seen before as a sequence (only counts up to 255)
public final long getEstimatedSizeInBytes()
public int getNodesAllocatedByDepth()
- Performance info - the number of nodes allocated at each depth
public int getNodesResizedByDepth()
- Performance info - the number of resizing of children arrays, at each depth