The minimum required length for a duplicate sequence detected is 6 bytes.
The design goals are to maximize speed of lookup while minimizing the space required to do so. This has led to a hybrid solution for representing the bytes that make up a sequence in the trie.
If we have 6 bytes in sequence e.g. abcdef then they are represented as object nodes in the tree as follows:
(a)-(b)-(c)-(def as an int)
DuplicateByteSequenceSpotter.RootTreeNode objects are used for the first two levels of the tree
(representing bytes a and b in the example sequence). The combinations of
objects at these 2 levels are few so internally these objects allocate an
array of 256 child node objects to quickly address children by indexing
directly into the densely packed array using a byte value. The third level in
the tree holds
DuplicateByteSequenceSpotter.LightweightTreeNode nodes that have few children
(typically much less than 256) and so use a dynamically-grown array to hold
child nodes as simple int primitives. These ints represent the final 3 bytes
of a sequence and also hold a count of the number of times the entire sequence
path has been visited (count is a single byte).
The Trie grows indefinitely as more content is added and while theoretically it could be massive (a 6-depth tree could produce 256^6 nodes) non-random content e.g English text contains fewer variations.
In future we may look at using one of these strategies when memory is tight:
- auto-pruning methods to remove less-visited parts of the tree
- auto-reset to wipe the whole tree and restart when a memory threshold is reached
- halting any growth of the tree
Field SummaryModifier and TypeFieldDescription
Method SummaryModifier and TypeMethodDescription
(byte b)Add a byte to the sequence.
()Reset the sequence detection logic to avoid any continuation of the immediately previous bytes.
startNewSequencepublic void startNewSequence()Reset the sequence detection logic to avoid any continuation of the immediately previous bytes. A minimum of dupSequenceSize bytes need to be added before any new duplicate sequences will be reported. Hit counts are not reset by calling this method.
addBytepublic short addByte(byte b)Add a byte to the sequence.
b- the next byte in a sequence
- number of times this byte and the preceding 6 bytes have been seen before as a sequence (only counts up to 255)
getEstimatedSizeInBytespublic final long getEstimatedSizeInBytes()
getNodesAllocatedByDepthpublic int getNodesAllocatedByDepth()
- Performance info - the number of nodes allocated at each depth
getNodesResizedByDepthpublic int getNodesResizedByDepth()
- Performance info - the number of resizing of children arrays, at each depth