Repair Replication

Background

MySQL Replication is asynchronous and one-way. (Even master-master replication is 2 one-way relationships, not a single two-way relationship.)

The slave server copies events from the master's binary log into its own relay log, then applies events from its relay log to its own live database. There is no feedback mechanism from the slave to the master if applying events to the slave's database causes an unrecoverable error; the slave stops replicating and waits for help from an administrator.

This procedure will solve a data corruption problem that halts a master-master replication.

Specifically, we'll handle a primary key collision. The table important.stuff has a typical AUTO_INCREMENT primary key. Both servers in this master-master replication accept INSERTs to the same table. Because MySQL replication is asynchronous, it is possible for both masters to write a new row that uses the same primary key. When these new rows replicate, the receiving master will have a duplicate key on a column that must be unique, and replication will stop in both directions.

Steps

Test Replication from A to B

At the beginning of this lesson, replication should be working correctly.

Log in to a.example.com

From A, insert a new record in the important.stuff table.

a ~ $ mysql -u root
mysql> INSERT INTO important.stuff SET details='Gift from A to B';
Query OK, 1 row affected (0.03 sec)

Log in to b.example.com

Now on B, check that the record has replicated.

b ~ $ mysql -u root
mysql> SELECT * FROM important.stuff ORDER BY id DESC LIMIT 1\G
*************************** 1. row ***************************
      id: 7
 details: Gift from A to B
happened: 2013-03-27 04:09:40
1 rows in set (0.00 sec)
After this step is completed:
  1. INSERT a new record on A in the table important.stuff
  2. Within 1 second, that record is replicated to B
Test Replication from the B to A

Now test that changes to B also replicate to A.

Still on B, insert a new record in the important.stuff table.

mysql> INSERT INTO important.stuff SET details='Gift from B to A';
Query OK, 1 row affected (0.03 sec)

mysql>

Now on A, check that the record has replicated.

mysql> SELECT * FROM important.stuff ORDER BY id DESC LIMIT 1\G
*************************** 1. row ***************************
      id: 9
 details: Gift from B to A
happened: 2013-03-27 04:09:40
1 rows in set (0.00 sec)
After this step is completed:
  1. INSERT a new record on B in the table important.stuff
  2. Within 1 second, that record is replicated to A
Pause Replication

In this step, we're going to force a collision by slowing down replication.

MySQL replication uses two threads on the slave machine. The I/O Thread copies binary log data from the master to the relay log on the slave. Then the SQL Thread applies changes from the relay log to the slave database.

Stopping the slave I/O thread is no different than having a short network outage between the two servers. Both servers are healthy, but they're unable to tell each other about changes. Because MySQL replication is asynchronous, neither server stops accepting client traffic.

On both servers, stop the I/O thread.

mysql> STOP SLAVE IO_THREAD;
Query OK, 0 rows affected (0.01 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: a.example.com
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 3415
               Relay_Log_File: mysql-relay-bin.000002
                Relay_Log_Pos: 803
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3415
              Relay_Log_Space: 959
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 20
1 row in set (0.00 sec)

mysql>
        
After this step is completed:
  1. SHOW SLAVE STATUS on A shows Slave_IO_Running is No
  2. SHOW SLAVE STATUS on B shows Slave_IO_Running is No
Causing a Primary Key Collision

Insert a new row on A

mysql> INSERT INTO important.stuff SET details='Unique data from A';
Query OK, 1 row affected (0.03 sec)

Insert something complicated on B

mysql> INSERT INTO important.stuff SET details='Unique data from B';
Query OK, 1 row affected (0.03 sec)

Restart the replication I/O Thread on both servers.

mysql> START SLAVE;
Query OK, 0 rows affected (0.00 sec)

Now check replication status on both servers. The INSERTs have caused replication to halt.

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: b.example.com
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 4127
               Relay_Log_File: mysql-relay-bin.000003
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1062
                   Last_Error: Error 'Duplicate entry '11' for key 'PRIMARY'' on query. Default database: ''. Query: 'INSERT INTO important.stuff SET details='Unique data from B''
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3884
              Relay_Log_Space: 1585
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1062
               Last_SQL_Error: Error 'Duplicate entry '11' for key 'PRIMARY'' on query. Default database: ''. Query: 'INSERT INTO important.stuff SET details='Unique data from B''
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 20
1 row in set (0.00 sec)

mysql>
After this step is completed:
  1. SHOW SLAVE STATUS on A shows Slave_IO_Running is Yes
  2. SHOW SLAVE STATUS on A shows Slave_SQL_Running is No
  3. SHOW SLAVE STATUS on A shows Last_Errno is 1062
  4. SHOW SLAVE STATUS on B shows Slave_IO_Running is Yes
  5. SHOW SLAVE STATUS on B shows Slave_SQL_Running is No
  6. SHOW SLAVE STATUS on B shows Last_Errno is 1062
Update Auto Increment for Master-Master Replication

MySQL has two settings that can help master-master pairs avoid duplicate auto-increment values.

auto_increment_increment tells MySQL how much to increase the next AUTO_INCREMENT value. The default increment is 1: the next value is the last value +1. Changing the increment to 2 will make MySQL use every-other value (e.g., 1,3,5,7 instead of 1,2,3,4).

auto_increment_offset tells MySQL where to start numbering AUTO_INCREMENT values. The default offset is 1.

The formula for what AUTO_INCREMENT value to assign the Nth row is auto_increment_offset + (N * auto_increment_increment)

We're going to use these settings so that A uses odd numbers (offset 1, increment 2) and B uses even numbers (offset 2, increment 2). This will prevent future auto_increment key collisions.

On server A, edit /etc/my.cnf

mysql> exit
a ~ $ sudoedit /etc/my.cnf

Add these two lines to directly under [mysqld]

/etc/my.cnf
[mysqld] auto_increment_increment = 2 auto_increment_offset = 1 relay_log = mysql-relay-bin log_slave_updates = 1 log_bin = mysql-bin server_id = 10 ......some content not shown

Now restart MySQL:

a ~ $ sudo service mysqld restart
Stopping mysqld:                                           [  OK  ]
Starting mysqld:                                           [  OK  ]

On server B, edit /etc/my.cnf

mysql> exit
b ~ $ sudoedit /etc/my.cnf

Add these two lines to directly under [mysqld]

/etc/my.cnf
[mysqld] auto_increment_increment = 2 auto_increment_offset = 2 relay_log = mysql-relay-bin log_slave_updates = 1 log_bin = mysql-bin server_id = 20 ......some content not shown

Now restart MySQL:

b ~ $ sudo service mysqld restart
Stopping mysqld:                                           [  OK  ]
Starting mysqld:                                           [  OK  ]
After this step is completed:
  1. On A /etc/my.cnf sets auto_increment_increment to 2
  2. On A /etc/my.cnf sets auto_increment_offset to 1
  3. On A the MySQL variable auto_increment_increment has value 2
  4. On A the MySQL variable auto_increment_offset has value 1
  5. On B /etc/my.cnf sets auto_increment_increment to 2
  6. On B /etc/my.cnf sets auto_increment_offset to 2
  7. On B the MySQL variable auto_increment_increment has value 2
  8. On B the MySQL variable auto_increment_offset has value 1
Clean Up Conflicts on B

The new AUTO_INCREMENT settings will help prevent this problem with future INSERTs, but it won't fix the rows that are blocking replication right now.

Look at the Last_Error in the slave status to see the id of the row causing the problem:

b ~ $ mysql -u root
mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: a.example.com
......some content not shown
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1062
                   Last_Error: Error 'Duplicate entry '11' for key 'PRIMARY'' on query. Default database: ''. Query: 'INSERT INTO important.stuff SET details='Unique data from B''
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3882
......some content not shown

Note the value of the duplicate entry. (It's 11 in this example.) Look up that record:

mysql> select * from important.stuff where id=11;
+----+--------------------+---------------------+
| id | details            | happened            |
+----+--------------------+---------------------+
| 11 | Unique data from B | 2013-04-06 19:33:44 |
+----+--------------------+---------------------+
1 row in set (0.00 sec)

mysql>

Now delete the row with that ID, only from B. It's important that the act of deleting the row never replicate to A, or it will delete the unique data with the same id. By turning off binary logging, this change will apply to the local database, but won't be seen by the I/O thread on A.

mysql> SET sql_log_bin="OFF";
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like "sql_log_bin";
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| sql_log_bin   | OFF   |
+---------------+-------+
1 row in set (0.00 sec)

mysql> delete from important.stuff where id=11;
Query OK, 1 row affected (0.03 sec)

Now turn binary logging back on, then re-INSERT the row. We want this version of the INSERT to replicate to A (and of course everything after this.) Providing a NULL value for the id will let us make sure the new auto-increment settings work.

mysql> SET sql_log_bin="ON";
Query OK, 0 rows affected (0.00 sec)
mysql> insert into important.stuff VALUES(NULL, 'Unique data from B', '2013-04-06 19:33:44');
Query OK, 1 row affected (0.02 sec)

Select the last row from the table to see what id was assigned to the new row.

mysql> select * from important.stuff order by id DESC LIMIT 1;
+----+--------------------+---------------------+
| id | details            | happened            |
+----+--------------------+---------------------+
| 12 | Unique data from B | 2013-04-06 19:33:44 |
+----+--------------------+---------------------+
1 row in set (0.00 sec)

mysql>

Note the new id. It should be higher than the old value and, because it was inserted on B, it should be even.

Start replication on B.

mysql> slave start;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: a.example.com
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 107
               Relay_Log_File: mysql-relay-bin.000006
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 107
              Relay_Log_Space: 555
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
1 row in set (0.00 sec)

The formerly conflicting row from A now happily coexists with the new row on B.

mysql> select * from important.stuff where id >= 11;
+----+--------------------+---------------------+
| id | details            | happened            |
+----+--------------------+---------------------+
| 11 | Unique data from A | 2013-04-06 19:33:44 |
| 12 | Unique data from B | 2013-04-06 19:33:44 |
+----+--------------------+---------------------+
2 rows in set (0.00 sec)

mysql>
After this step is completed:
  1. SHOW SLAVE STATUS on B shows Slave_IO_Running is Yes
  2. SHOW SLAVE STATUS on B shows Slave_SQL_Running is Yes
  3. SHOW SLAVE STATUS on B shows Last_Errno is 0
Clean Up Conflicts on A

on A, we still need to skip the insert from B with the bad index.

a ~ $ mysql -u root
mysql> SET GLOBAL sql_slave_skip_counter = 1;
Query OK, 0 rows affected (0.00 sec)

mysql>

Now start replication on A.

mysql> slave start;
Query OK, 0 rows affected (0.00 sec)

mysql>

You should see the row from B with the new id.

mysql> select * from important.stuff where id >= 11;
+----+--------------------+---------------------+
| id | details            | happened            |
+----+--------------------+---------------------+
| 11 | Unique data from A | 2013-04-06 19:33:44 |
| 12 | Unique data from B | 2013-04-06 19:33:44 |
+----+--------------------+---------------------+
2 rows in set (0.00 sec)

mysql>
After this step is completed:
  1. SHOW SLAVE STATUS on A shows Slave_IO_Running is Yes
  2. SHOW SLAVE STATUS on A shows Slave_SQL_Running is Yes
  3. SHOW SLAVE STATUS on A shows Last_Errno is 0
Test that Normal Replication has Resumed

Re-test normal replication, following the instructions from step 1 and step 2.

After this step is completed:
  1. INSERT a new record on A in the table important.stuff
  2. Within 1 second, that record is replicated to B
  3. INSERT a new record on B in the table important.stuff
  4. Within 1 second, that record is replicated to A
Test that Delayed Replication Can't Cause Duplicate Keys

Re-test paused replication, following the instructions from step 3 and step 4.

This time, the changes we made in step 5 will prevent a duplicate key, and replication will not halt.

After this step is completed:
  1. On A, stop the replication I/O Thread.
  2. On B, stop the replication I/O Thread.
  3. INSERT a new record on A in the table important.stuff
  4. INSERT a new record on B in the table important.stuff
  5. On A, start the replication I/O Thread.
  6. On B, start the replication I/O Thread.
  7. New record replicates from A to B
  8. New record replicates from B to A
  9. On A, SHOW SLAVE STATUS shows Last_Errno is 0
  10. On B, SHOW SLAVE STATUS shows Last_Errno is 0