Online String Matching ---------------------- The `online string matching' problem is finding all the substrings of a text string that match a given pattern string, assuming the pattern is given first, and is much shorter than the text string, or one pattern is matched against many text strings, so the pattern string may be preprocessed to optimize the match. Assume there is a single text string t of length n with characters t[0], ..., t[n-1], a pattern string p of length m with characters p[0], ..., p[m-1], and you are asked to find all s >= 0 such that t[s,...,s+m-1] is an exact match to p. Here s is called the `shift', and you are asked to find all the `matching shifts', i.e., all s for which p matches t[s,...,s+m-1]. A crude algorithm will check all values of s with simple substring comparison and have at worst case O(mn) char- acter comparisons. The Knuth algorithm improves this by having exactly n character table lookups, but there are algorithms with O(mn) worst case character comparisons or table lookups that outperform the Knuth algorithm in typical real world cases. You are being asked to code and evaluate several of these. All of the algorithms you are being asked to evaluate work by comparing characters from the end of the pattern p toward the beginning of p. When this comparison is completed, either by a character mis-match or by exhaus- tion of p when s is a matching shift, s is incremented by a value that is a function of p and one or two char- acters read from t. This shift increment is taken from a precomputed table S indexed by the characters read. For example, if for particular p and t the first char- acter comparison for each s always fails, c = t[s+m-1] is the character looked up in S, and c is never in p at all, then the shift increment can always be m, and there will be only n/m character comparisons and n/m table lookups, for a total of 2n/m operations. The ways of computing S that you are to study are- 1. Horspool Algorithm. c = t[s+m-1]. j = S[c] is chosen as the smallest j such that 1 <= j < m and p[m-1-j] = c, or j = m. 2. Quick-Search Algorithm. c = t[s+m]. j = S[c] is chosen as the smallest j such that 1 <= j <= m and p[m-j] = c, or j = m+1. 3. Berry-Ravindran Algorithm. c1 = t[s+m-1], c2=t[s+m]. j = S[c1,c2] is the smallest j such that 1 <= j < m and p[m-1-j] = c1 and p[m-j] = c2, or j = m and p[0] = c2, or j = m+1. You are given pairs t and p are are being asked to com- pute the number of character comparisons plus the number of table lookups required for each pair and each algor- ithm to find all the matching shifts. The Knuth algor- ithm whose number of comparisons is always 0 and whose number of table lookups is always n is also to be included in the output. Note that we assume that for each s you begin with a comparison of t[s+m-1] and p[m-1] and continue with comparisons from right to left until you get a mismatch or p runs out of characters. We also assume that initially s = 0 and you stop only when the next compar- ison or table lookup cannot be done because it would require reading a character beyond the end of t. Input ----- For each of several test cases, first a line containing just the test case name, then one or more lines contain- ing the text t, followed by a line containing just `END', and then one or more lines each containing a pattern p, followed by a line containing just `END'. Thus each test case has a single text and one or more patterns. The text t is made by putting a line feed at the end of each text line and concatenating the lines, so the lines are separated by line feeds and the last line ends in a line feed. Each pattern is on a line by itself. The maximum length of every line is 80 characters, and the total maximum length of the text t with line feeds is 1,000,000 characters. Each line contains only graphic ASCII characters (ASCII characters with codes in the range 33 .. 126) or single spaces (ASCII code 32). None of the texts or patterns are empty. Input ends with an end of file. Output ------ For each test case, first a line containing the test case name, and then for each pattern line 6 lines, first an exact copy of the pattern input line, and then the following 5 lines: Pattern Length = #, Text Length = #, Matches = # Knuth: # = # comparisons + # lookups Horspool: # = # comparisons + # lookups Quick-Search: # = # comparisons + # lookups Berry-Ravindran: # = # comparisons + # lookups where # denotes an integer >= 0. The first two #'s are m and n respectively, and the 4'th though 6'th #'s are n, 0, and n respectively. The `Matches' # is the number of matching shifts and can be computed by any of the three algorithms being studied, as all three algorithms agree. Sample Input ------ ----- -- SAMPLE 1 -- abcaaaabcaaaabc abcaaabcaaabc END abc aaa END -- SAMPLE 2 -- Hopping hippers flipping, Flipping floppers sticking, Sticking stoppers clicking, Clicking cloppers hopping. END ing pers ick END Sample Output ------ ------ -- SAMPLE 1 -- abc Pattern Length = 3, Text Length = 30, Matches = 6 Knuth: 30 = 0 comparisons + 30 lookups Horspool: 38 = 25 comparisons + 13 lookups Quick-Search: 32 = 22 comparisons + 10 lookups Berry-Ravindran: 32 = 22 comparisons + 10 lookups aaa Pattern Length = 3, Text Length = 30, Matches = 6 Knuth: 30 = 0 comparisons + 30 lookups Horspool: 51 = 35 comparisons + 16 lookups Quick-Search: 48 = 35 comparisons + 13 lookups Berry-Ravindran: 41 = 30 comparisons + 11 lookups -- SAMPLE 2 -- ing Pattern Length = 3, Text Length = 109, Matches = 8 Knuth: 109 = 0 comparisons + 109 lookups Horspool: 94 = 55 comparisons + 39 lookups Quick-Search: 78 = 47 comparisons + 31 lookups Berry-Ravindran: 78 = 47 comparisons + 31 lookups pers Pattern Length = 4, Text Length = 109, Matches = 4 Knuth: 109 = 0 comparisons + 109 lookups Horspool: 74 = 43 comparisons + 31 lookups Quick-Search: 64 = 39 comparisons + 25 lookups Berry-Ravindran: 61 = 37 comparisons + 24 lookups ick Pattern Length = 3, Text Length = 109, Matches = 4 Knuth: 109 = 0 comparisons + 109 lookups Horspool: 84 = 46 comparisons + 38 lookups Quick-Search: 66 = 37 comparisons + 29 lookups Berry-Ravindran: 66 = 37 comparisons + 29 lookups Reference --------- For a whole host of online string matching algorithms see The Exact Online String Matching Problem: A Review of the Most Recent Results, Simone Faro and Thierry Lecroq, ACM Computing Surveys, vol 45, no 2, 2013. File: onlinestring.txt Author: Bob Walton Date: Mon Oct 14 02:27:24 EDT 2013 The authors have placed this file in the public domain; they make no warranty and accept no liability for this file. RCS Info (may not be true date or author): $Author: walton $ $Date: 2013/10/14 06:44:06 $ $RCSfile: onlinestring.txt,v $ $Revision: 1.11 $