The PAXOS Algorithm --- ----- --------- The PAXOS algorithm was developed by Leslie Lamport (and independently by others) to make a decision rapidly in a distributed computer system where any processor can fail. You have been asked to simulate the algorithm in order to get a sense of how it works. A distributed system is a set of processes connected to each other by communication channels. For the purposes of this problem, we assume there are two channels between every pair of processes, one channel going in each direction. In order for a distributed system to run as fast as possible, it should be `asynchronous'. This means that as soon as a process receives a message, it computes its new state and sends appropriate messages to other processes, without waiting for any particular time. However, there is a theorem that any asynchronous distributed algorithm will stop dead if just one process fails at exactly the wrong moment. Thus asynchronous distributed algorithms can fail if just one of their processes fails. The PAXOS algorithm can be thought of as an asynchronous decision making algorithm that can be run more than once to make the same decision. That is, there can be more than one instance of the algorithm. If one instance appears to be failing, another instance can be started. The algorithm has the critically important property that if several of the instances succeed in coming to a decision, they will all come to the SAME decision. The algorithm description is as follows: (1) For our purposes, the decision is to be made between two values, labeled `B' and `C'. In the real world the decision is most often between `aborting' and `committing' a data base transaction, so you can think of `B' as meaning `abort' and `C' as meaning `commit'. (2) Each algorithm instance has a single master process. This process will make the decision and send it to all the the other processes. The master of the first instance is generally chosen by the nature of the decision being made, and the masters of later instances are simply processes that, using clocks, have decided that previous algorithm instances are unduly delayed and a new algorithm instance needs to be started. All the processes other than the master for an instance are called the slaves of that instance. A process can be the master of some algorithm instances and a slave in other instances. (3) Each instance is assigned an identifier. No two instances may have the same identifier, and the identifiers are ordered so that instances started later are later in the ordering. In the real world identifiers are typically numbers whose high order bits are the time of day and whose low order bits are a unique process identifier. (4) Processes communicate by messages sent over chan- nels. We assume there are n processes numbered from 1 through n, and there is exactly one directed channel connecting message sending process j to message receiving process k, for every pair of processes j and k with j != k. We assume each channel delivers messages reliably and in the order they were sent, BUT, with arbitrary delay. Such a channel can be built on top of unreliable communica- tions by an appropriate channel protocol, which we do not consider here (in the real world the internet TCP protocol would likely be used). We assume each channel can hold exactly one message, which has been sent but not received. We assume that if a second message is sent to the channel when the channel is not empty, any previous message in the channel is discarded and lost. This behavior just simplifies the code of our simulation; a real world channel probably would not discard messages very often. (5) The algorithm messages have the following formats, where lower case letters are variables and upper case letters are constants: m s N i New-instance (N) message, sent from instance master process m to instance slave process s, announcing that a new instance has been created with identifier i. s m A i pd pi (New-instance) acknowledgment (A) message, sent from instance slave s to instance master m, for the instance with identifier i. pd and pi are the values of variables maintained by the slave as part of its state (see below). m s P i d Proposal (P) message, sent from instance master process m to instance slave process s, for the instance with identifier i, proposing that the instance decision be d (d == B or C). s m Q i Proposal acknowledgment (Q) message, sent merely to acknowledge the above proposal (P) message. m s F i Final (F) message, sent from instance master process m to instance slave process s, for the instance with identifier i, stating that the proposed decision for instance i has become the final decision of the set of all instances, and no subsequent instance will ever propose a different decision. (6) Each process maintains the following variables: imax The latest (maximum) instance identifier for which the process has either received or sent a new-instance (N), proposed (P), or final (F) message. Initialized to -1. All instance identifiers are integers >= 0. pd The decision value in the last proposal (P) message the process has either sent or has received and not ignored. Initialized to X (meaning no proposal has been received). pi The instance identifier in the last pro- posal (P) message the process has either sent or has received and not ignored. Initialized to -1. Note that a master maintains these variables as if it were also a slave that receives and does not ignore all the messages the master sends. Here -1 is treated as an instance identifier value that is less than any actual instance identifier. (7) Any received message is ignored (discarded) if it is a message whose instance identifier i is LESS THAN the receiver's current imax variable value. (8) There are n processes. Let m be the smallest inte- ger such that 2m > n. Any set of processes with m members is called a `majority'. PAXOS uses the fact that any two majorities must overlap. (9) The instance algorithm is: (a) The master picks an instance identifier i based on the current time. This must be greater than the imax variable value of the master, and should be greater than the identifier of any previous instance. (b) The master sends new-instance (N) messages to all slaves. Upon receiving and not ignoring the N message, a slave sends an acknowledgment (A) message back to the master containing the slave's pd and pi variable values. (c) As soon as the master receives m-1 acknowledg- ment (A) messages, it makes a proposed decision. It knows at this point the pd and pi values of m processes, the m-1 acknowledging slaves and the master itself. If all pd values are X, meaning `not yet set', the master is free to make any decision it wants (this will be the case for the first instance). Otherwise the master chooses the pd value whose associated pi value is greatest. (d) The master sends proposal (P) messages to all slaves containing the new proposed decision. Any slave receiving and not ignoring this message returns a proposal acknowledge (Q) message to the master. During this process activity the pd and pi variable values of the master and all acknowledging slaves are updated. (e) As soon as the master receives m-1 proposal acknowledgment (Q) messages, it sends a final (F) message to all slaves. Note that in the our simulation slaves ignore final (F) messages, though in the real world they would not. Event Specification ------------------- The simulation you have been asked to perform represents algorithm execution as a sequence of events. The events are numbered 1, 2, 3, etc., and the simulation input contains a list of event specification lines that are each one of the following: N m d Process m becomes the master of a new instance whose instance identifier is the number of the current event and whose decision will be d (B or C) if the master is free to choose in step (c) above. The event consists of the new master updating its variables and sending new-instance (N) messages to every other process. R j k Process k receives the message in the channel whose sending process is j and whose receiving process is k, if there is a message in that channel. If there is no message, there is no event. If there is a message, the event consists of the receiving process updating its variables and sending messages to other other processes as specified by the algorithm steps above. Events in general send messages (except for events where a received message is ignored). Sending a message consists of placing the message in the appropriate channel. If necessary any previous message in the channel is first discarded. Input ----- The input consists of several test cases. Each test case consists of a line containing the name of the test case a line containing the number n of processes zero or more event specification lines a line containing only the character `E' Here 2 <= n <= 32. The input terminates with an end of file. Output ------ For each test case the line containing the name of the test case is output. Then for each event the output is as follows: (a) If the event is the creation of a new instance, a line of the form `e: NEW INSTANCE m d' is output, where e is the event number, and also the identifier of the new instance, and m and d are taken from the `N m d' event specification. (b) If the event is the reception of a message, a line is output of the form `e: message action' where e is the event number (1, 2, 3, etc.), the `message' is the received message in the message format given above with single spaces surrounding it and separating its parts, and the `action' is one of the following words: IGNORED The message is ignored as per (7) above. COMMITTING The message is the m-1'st proposal (P) message to be received and not ignored by a slave for a given instance. ACCEPTED All other cases. Note that if for an event specification `R j k' the channel is empty, there is no event, there is no output, and the current event number is not incremented. In general, items in an output line are to be separated by a single space character, except there is no space before a `:'. There should be no other whitespace characters in any output line. You may assume the test case name input line is property formatted, and you should just copy it to the output. At the end of each test case print one empty line. This will appear before the name of the next test case, or before the end of the output file. Example Input ------- ----- ERRORLESS CASE 3 N 1 C R 1 2 R 1 2 R 1 3 R 2 1 R 3 1 R 1 2 R 1 3 R 2 1 R 3 1 R 1 2 R 1 3 E ONE FAILURE CASE 3 N 1 C R 1 2 R 1 3 N 2 B R 2 1 R 3 1 R 2 3 R 1 2 R 3 2 R 2 1 R 2 3 R 1 2 R 3 2 R 2 1 R 2 3 E Example Output ------- ------ ERRORLESS CASE 1: NEW INSTANCE 1 C 2: 1 2 N 1 ACCEPTED 3: 1 3 N 1 ACCEPTED 4: 2 1 A 1 X -1 ACCEPTED 5: 3 1 A 1 X -1 ACCEPTED 6: 1 2 P 1 C COMMITTING 7: 1 3 P 1 C ACCEPTED 8: 2 1 Q 1 ACCEPTED 9: 3 1 Q 1 ACCEPTED 10: 1 2 F 1 ACCEPTED 11: 1 3 F 1 ACCEPTED ONE FAILURE CASE 1: NEW INSTANCE 1 C 2: 1 2 N 1 ACCEPTED 3: 1 3 N 1 ACCEPTED 4: NEW INSTANCE 2 B 5: 2 1 N 4 ACCEPTED 6: 3 1 A 1 X -1 IGNORED 7: 2 3 N 4 ACCEPTED 8: 1 2 A 4 X -1 ACCEPTED 9: 3 2 A 4 X -1 ACCEPTED 10: 2 1 P 4 B COMMITTING 11: 2 3 P 4 B ACCEPTED 12: 1 2 Q 4 ACCEPTED 13: 3 2 Q 4 ACCEPTED 14: 2 1 F 4 ACCEPTED 15: 2 3 F 4 ACCEPTED [The last output line is blank.] Remarks ------- A non-distributed reliable system can be made from a computer and a disk. The computer runs instances of an algorithm which makes decisions. To make a decision, an instance first proposes it, then writes it to disk, then declares the decision final. If the computer crashes, a new instance of the algorithm is started, which begins by reading the disk to find out all the previously proposed decisions. Since it does not know the extent to which these have been acted on, it must assume each of these proposed decisions is final. The disk is referred to as `stable storage', because it survives crashes and provides reliability. The PAXOS algorithm uses the set of processes to imple- ment stable storage (without any disks). The proposal (P) messages write the proposed decision to stable storage (the set of processes), and the proposal acknow- ledgment (Q) messages confirm that the decision has been written. In the non-distributed case the proposed deci- sion has been successfully written to stable storage when its last bit has been successfully copied to disk. In the PAXOS case the proposed decision has been successfully written to stable storage when the m-1'st slave which will not ignore the proposal's P message has received the P message. At this point any subsequent instance will read the proposed decision. So at this point the proposed decision is `committed'. The master knows it has successfully written stable storage when it receives the m-1'st Q message. The new instance (N) messages and their acknowledgment (A) messages correspond to reading stable storage. The master knows it has successfully read stable storage when it receives the m-1'st A message, and any proposed decision read is that associated with the most recent instance known to any of the slaves that sent the A messages. Because two majority sets of processes must overlap, any committed proposed decision becomes known to the master. Here we are using the fact that if a committed proposed decision is overwritten by a new proposed decision, the new proposed decision must be the SAME as the old proposed decision, by step (c) of the algorithm. One can formalize all this in an inductive mathematical proof that using PAXOS, once a proposed decision is committed by an instance no future instance can propose any other decision. The analogy we have drawn between the non-distributed and distributed cases is not precise. For example, in the distributed case several instances can run simultan- eously (in DIFFERENT processes), and a proposed decision that is never committed can end up being read as the latest proposed decision by a subsequent algorithm instance. File: paxos.txt Author: Bob Walton Date: Wed Oct 15 09:25:31 EDT 2008 The authors have placed this file in the public domain; they make no warranty and accept no liability for this file. RCS Info (may not be true date or author): $Author: walton $ $Date: 2008/10/15 13:26:34 $ $RCSfile: paxos.txt,v $ $Revision: 1.10 $