I’ve been working on a data archival project over the last couple weeks and thought it would be interesting to discuss something a bit counter-intuitive. Absolutes are never true, but when getting rid of data, it’s usually more efficient to insert the data being kept into a new table rather than deleting the old data from the existing table.
Here is our example table from the IMDB database.
mysql> show create table title\G *************************** 1. row *************************** Table: title Create Table: CREATE TABLE `title` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title` text NOT NULL, `imdb_index` varchar(12) DEFAULT NULL, `kind_id` int(11) NOT NULL, `production_year` int(11) DEFAULT NULL, `imdb_id` int(11) DEFAULT NULL, `phonetic_code` varchar(5) DEFAULT NULL, `episode_of_id` int(11) DEFAULT NULL, `season_nr` int(11) DEFAULT NULL, `episode_nr` int(11) DEFAULT NULL, `series_years` varchar(49) DEFAULT NULL, `md5sum` varchar(32) DEFAULT NULL, PRIMARY KEY (`id`), KEY `idx_title` (`title`(10)), KEY `idx_pcode` (`phonetic_code`), KEY `idx_epof` (`episode_of_id`), KEY `idx_md5` (`md5sum`), KEY `title_kind_id_exists` (`kind_id`), CONSTRAINT `title_episode_of_id_exists` FOREIGN KEY (`episode_of_id`) REFERENCES `title` (`id`), CONSTRAINT `title_kind_id_exists` FOREIGN KEY (`kind_id`) REFERENCES `kind_type` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2172322 DEFAULT CHARSET=latin1
It has about 2.2 million rows
mysql> SELECT COUNT(*) FROM title; +----------+ | count(*) | +----------+ | 2172320 | +----------+ 1 row in set (1.35 sec)
To make this a fair test, let’s say that we want to get rid of half the records. In this contrived example, ‘g’ is the median starting letter. So, we’re going to get rid of everything that matches /^[0-9a-gA-G]/. The title column is indexed, so building the result set should be reasonably efficient.
mysql> SELECT COUNT(*) FROM title WHERE title <= 'g'; +----------+ | count(*) | +----------+ | 1085426 | +----------+ 1 row in set (2.48 sec) mysql> SELECT COUNT(*) FROM title WHERE title > 'g'; +----------+ | count(*) | +----------+ | 1086894 | +----------+ 1 row in set (2.60 sec)
We’ll do it my way first and insert the records to keep into a new table:
mysql> CREATE TABLE new_title LIKE title; Query OK, 0 rows affected (0.15 sec) mysql> INSERT INTO new_title SELECT * FROM title WHERE title > 'g'; Query OK, 1086894 rows affected (1 min 5.18 sec) Records: 1086894 Duplicates: 0 Warnings: 0 mysql> RENAME TABLE original_title TO old_title, new_title TO title Query OK, 0 rows affected (0.08 sec) mysql> DROP TABLE old_title; Query OK, 0 rows affected (0.26 sec)
The total time is about 1min 5secs. There are a caveats:
Now let’s look at the performance of deleting the unneeded records:
mysql> DELETE FROM title WHERE title <= 'g'; Query OK, 1085426 rows affected (1 min 26.27 sec)
1min, 26secs is a little slower, but not too bad. The thing is, you’re probably deleting these records to save some disk space. The table on disk is still the same size.
anders:imdb root# ls -lh title.* -rw-rw---- 1 _mysql wheel 8.8K Jun 26 18:09 title.frm -rw-rw---- 1 _mysql wheel 572M Jun 26 18:23 title.ibd
To reclaim the space, we need to run OPTIMIZE TABLE
mysql> OPTIMIZE TABLE title; +-------------+----------+----------+-------------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-------------+----------+----------+-------------------------------------------------------------------+ | title | optimize | note | Table does not support optimize, doing recreate + analyze instead | | title | optimize | status | OK | +-------------+----------+----------+-------------------------------------------------------------------+ 2 rows in set (1 min 23.72 sec) anders:test root# ls -lh title.* -rw-rw---- 1 _mysql wheel 8.8K Jun 26 18:19 title.frm -rw-rw---- 1 _mysql wheel 304M Jun 26 18:21 title.ibd
The space is reclaimed, but it took a total of 2min 50s to get to the same point as in the INSERT example which was nearly 3x faster. This is only relevant with innodb_file_per_table set or when using MyISAM. Without innodb_file_per_table, there is no way to reclaim the space without rebuilding the entire database from a mysqldump.
I picked 1/2 of the table to make this a fair example. As the number of records being removed gets smaller, deleting becomes a more efficient way to get rid of the rows. DELETE can also be done online, where the INSERT technique requires some down time.