June 27, 2012

INSERT, Don’t DELETE

I’ve been working on a data archival project over the last couple weeks and thought it would be interesting to discuss something a bit counter-intuitive. Absolutes are never true, but when getting rid of data, it’s usually more efficient to insert the data being kept into a new table rather than deleting the old data from the existing table.

Here is our example table from the IMDB database.

mysql> show create table title\G
*************************** 1. row ***************************
       Table: title
Create Table: CREATE TABLE `title` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` text NOT NULL,
  `imdb_index` varchar(12) DEFAULT NULL,
  `kind_id` int(11) NOT NULL,
  `production_year` int(11) DEFAULT NULL,
  `imdb_id` int(11) DEFAULT NULL,
  `phonetic_code` varchar(5) DEFAULT NULL,
  `episode_of_id` int(11) DEFAULT NULL,
  `season_nr` int(11) DEFAULT NULL,
  `episode_nr` int(11) DEFAULT NULL,
  `series_years` varchar(49) DEFAULT NULL,
  `md5sum` varchar(32) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_title` (`title`(10)),
  KEY `idx_pcode` (`phonetic_code`),
  KEY `idx_epof` (`episode_of_id`),
  KEY `idx_md5` (`md5sum`),
  KEY `title_kind_id_exists` (`kind_id`),
  CONSTRAINT `title_episode_of_id_exists` FOREIGN KEY (`episode_of_id`) REFERENCES `title` (`id`),
  CONSTRAINT `title_kind_id_exists` FOREIGN KEY (`kind_id`) REFERENCES `kind_type` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2172322 DEFAULT CHARSET=latin1

It has about 2.2 million rows

mysql> SELECT COUNT(*) FROM title;
+----------+
| count(*) |
+----------+
|  2172320 |
+----------+
1 row in set (1.35 sec)

To make this a fair test, let’s say that we want to get rid of half the records. In this contrived example, ‘g’ is the median starting letter. So, we’re going to get rid of everything that matches /^[0-9a-gA-G]/. The title column is indexed, so building the result set should be reasonably efficient.

mysql> SELECT COUNT(*) FROM title WHERE title <= 'g';
+----------+
| count(*) |
+----------+
|  1085426 |
+----------+
1 row in set (2.48 sec)

mysql> SELECT COUNT(*) FROM title WHERE title > 'g';
+----------+
| count(*) |
+----------+
|  1086894 |
+----------+
1 row in set (2.60 sec)

We’ll do it my way first and insert the records to keep into a new table:

mysql> CREATE TABLE new_title LIKE title;
Query OK, 0 rows affected (0.15 sec)

mysql> INSERT INTO new_title SELECT * FROM title WHERE title > 'g';
Query OK, 1086894 rows affected (1 min 5.18 sec)
Records: 1086894  Duplicates: 0  Warnings: 0

mysql> RENAME TABLE original_title TO old_title, new_title TO title
Query OK, 0 rows affected (0.08 sec)

mysql> DROP TABLE old_title;
Query OK, 0 rows affected (0.26 sec)

The total time is about 1min 5secs. There are a caveats:

  1. You must be able to stop all writes to the table. This can be done by operating on a slave with the slave threads stopped.
  2. The DROP TABLE statement can take a long time on ext filesystems, though it is nearly instant on xfs.
  3. DROP TABLE can cause stalls on a very active server due to InnoDB buffer pool purging and filesystem activity. Percona Server has a global variable that helps with this, innodb_lazy_drop_table and MySQL 5.6 will implement a similar solution. Ovais from Percona compared the performance of these options in their blog.

Now let’s look at the performance of deleting the unneeded records:

mysql> DELETE FROM title WHERE title <= 'g';
Query OK, 1085426 rows affected (1 min 26.27 sec)

1min, 26secs is a little slower, but not too bad. The thing is, you’re probably deleting these records to save some disk space. The table on disk is still the same size.

anders:imdb root# ls -lh title.* 
-rw-rw----  1 _mysql  wheel   8.8K Jun 26 18:09 title.frm
-rw-rw----  1 _mysql  wheel   572M Jun 26 18:23 title.ibd

To reclaim the space, we need to run OPTIMIZE TABLE

mysql> OPTIMIZE TABLE title;
+-------------+----------+----------+-------------------------------------------------------------------+
| Table       | Op       | Msg_type | Msg_text                                                          |
+-------------+----------+----------+-------------------------------------------------------------------+
| title       | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| title       | optimize | status   | OK                                                                |
+-------------+----------+----------+-------------------------------------------------------------------+
2 rows in set (1 min 23.72 sec)

anders:test root# ls -lh title.*
-rw-rw----  1 _mysql  wheel   8.8K Jun 26 18:19 title.frm
-rw-rw----  1 _mysql  wheel   304M Jun 26 18:21 title.ibd

The space is reclaimed, but it took a total of 2min 50s to get to the same point as in the INSERT example which was nearly 3x faster. This is only relevant with innodb_file_per_table set or when using MyISAM. Without innodb_file_per_table, there is no way to reclaim the space without rebuilding the entire database from a mysqldump.

I picked 1/2 of the table to make this a fair example. As the number of records being removed gets smaller, deleting becomes a more efficient way to get rid of the rows. DELETE can also be done online, where the INSERT technique requires some down time.