StrTree

An Algorithm for Detecting Duplicate/Previously Encountered Strings

This section, including the code, was contributed by Jeremy Cattone.


Download now: STRTREE.ZIP (11.8 k)

Overview

A challenge I've seen pop up often while writing search tools is the ability to verify that your code is not traversing the same source multiple times. A spider traversing the web, for example, would never wish to traverse the same URL twice in a session - to do so would merely allow circular loops and endless headaches.
To address this issue, I wrote an N-tree based search algorithm that breaks strings down by their leading characters. Identical portions are 'clipped' and stored in a single node, with the remainder of the string in similarly divided child nodes.

The advantage to this approach over typical linear searches is two-fold.

The memory overhead can be reduced as redundant text is partially eliminated
Since the tree is ordered, a search can be performed in O(Log(N)) time (someone please correct me if my estimate of this algorithm is incorrect!)

On a 600 MHz PIII system, the following benchmarks were obtained for inserting 3000 random strings of length 0-254 characters into an array that already contains 12,000 strings.

Linear search: 2.7 seconds
StringTree: 0.02 seconds

A reasonable improvement!

 

Download now: STRTREE.ZIP (11.8 k)

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.