My Rsync -- ----- The UNIX rsync program copies a file F at location L to a remote location L' which is accessible from L only by slow communications. It optimizes the case where an older version F' of the file already exists at L'. F' is divided into disjoint S byte blocks, and the MD5 signatures of these blocks are communicated by L' to L. Then L sends to L' the bytes of F as follows: if the next S bytes to be sent match a block of F', the identi- fier of that block is sent, and otherwise the next byte is sent. Here S bytes match a block in F' if both have the same MD5 signature, which is only 16 bytes, so L' only has to send 16 bytes for every S bytes of F', and this is faster than having L' send all of F' if S is much greater than 16. However, if we do as we have said, then for every byte of F the MD5 signature of the S byte block starting at that byte would have to be computed. This is too expen- sive computationally. So L' reports for every block both its MD5 signature and a 32-bit rolling checksum. L uses the rolling checksum to find blocks in F that might with high probability be identical to blocks in F', and then computes the MD5 signatures of just those blocks, to check if the blocks are indeed identical. What do we mean by a rolling checksum? We are looking at the sequence of S byte blocks of F that begin at all the possible different byte offsets in F. Suppose we have a pointer into F and relative to that pointer the next S + 1 bytes are B(0), B(1), B(2), ..., B(S-1), B(S) An example rolling checksum for the current block is b = (B(0) + B(1) + B(2) + ... + B(S-1)) mod 2**16 The value of this checksum for the next block in the sequence is bnext = (B(1) + B(2) + ... + B(S-1) + B(S)) mod 2**16 = (b + B(S) - B(0)) mod 2**16 That is, bnext can be computed quickly from b and the byte B(0) we are discarding and the byte B(S) we are adding to make the next block from the current block. We call b a `rolling' checksum because bnext can be computed quickly from b. Another example of a rolling checksum is c = (S*B0 + (S-1)*B1 + (S-2)*B2 + ... + 1*B(S-1)) mod 2**16 for which cnext = (S*B1 + (S-1)*B2 + (S-2)*B3 + ... + 1*B(S)) mod 2**16 = (c + bnext - S*B0) mod 2**16 Here we use bnext to help compute cnext. Lastly, we can combine these two rolling checksums into one: d = b + 2**16 * c which is the 32-bit rolling checksum that we will use. Note that in the above a byte is an UNSIGNED 8 bit integer (an `unsigned char' in C/C++ , and as JAVA does not have unsigned integer data, you must convert each byte to an int and then & with 0xFF in JAVA). Input ----- The standard input consists of test cases. Each test case begins with a line containing the name of the test case. The second line of the test case contains a data file name (the name of F), and the third line contains the block size. The lines following this each describe one block of the remote file F', and each of these lines holds an MD5 signature followed by a single space followed by a rolling checksum. The signature is 32 hexadecimal digits (0, ..., 9, A, ..., F), and the rolling checksum is 8 hexadecimal digits. The last line of the test case contains just `.', which signals the end of the test case. The input file name will not contain any white-space characters, and the block size will be a decimal number. No standard input line will be longer than 80 characters. The standard input will be terminated by an end of file after the last test case. You must open each input file F for reading, and NOT for writing. If you open it for reading and writing, your program may fail, and WORSE, it might work when you test it and then fail when the judge tests it because when the judge runs it your program will not be allowed to open files for writing. Output ------ For each test case, first output an exact copy of the first line of the test case, the test case name. Then for each offset in file F of a block whose rolling checksum matches the rolling checksum of some block of F', output the line offset block-number where block-number is the block number of the block of F' whose MD5 sum matches that of the block of F at the given offset, or is -1 if there is none. These lines must be in order of increasing offset. The blocks of F' are numbered 0, 1, 2 ... . Lastly output a line containing just `.' to end the test case output. Notes ----- To compute an MD5 sum of an S byte block: In C: #include <openssl/md5.h> unsigned char signature[16]; unsigned char block[S]; ... read block ... MD5( block, S, signature ); In C++: extern "C" { #include <openssl/md5.h> } unsigned char signature[16]; unsigned char block[S]; ... read block ... MD5( block, S, signature ); In JAVA: import java.security.*; static byte[] MD5 ( byte[] block ) throws NoSuchAlgorithmException { MessageDigest md = MessageDigest.getInstance ( "MD5" ); return md.digest ( block ); } Here the MD5 sum is called a signature and is represent- ed as a 16 byte string, where each byte represents 2 hexadecimal digits, with the first byte representing the highest order digits. If you use gcc or g++ directly (instead of using the Makefile you are provided) you need to use the -lssl library option. On modern computers computation of MD5 sums is so fast compared to input/output CPU time that we were unable to construct sensible test cases where the optimization of using rolling sums to reduce the amount of MD5 sum computation actually made a large difference in CPU time. Sample Input ------ ----- -- SAMPLE 1 -- sample1.dat 256 0665A333D10B4F10495EDCD35E8F2904 94127CB5 7CB5FA5E037EFF272462C92867AFC1B9 BE387EB4 D94B841EB5F4C528ACFFA4D2BD068503 94127CB5 . -- SAMPLE 2 -- sample2.dat 4096 B26EBEE4CB66A9CC513F48293E676CA9 FB23138E 6167E71B305291265C85B37F758DB1BB D217FCA6 87AC778780609BDF81C7E5C144BE48EA 48C909AF . Sample Output ------ ------ -- SAMPLE 1 -- 0 0 256 1 512 2 . -- SAMPLE 2 -- 0 0 8314 2 . File: myrsync.txt Author: Bob Walton <walton@seas.harvard.edu> Date: Sun Oct 18 22:10:45 EDT 2009 The authors have placed this file in the public domain; they make no warranty and accept no liability for this file. RCS Info (may not be true date or author): $Author: hc3-judge $ $Date: 2009/10/19 02:12:22 $ $RCSfile: myrsync.txt,v $ $Revision: 1.14 $