Recently I had the pleasure of performance testing a query on MySQL. The target data set had to be on the order of single-digit-millions of rows to simulate production scale data. The query had to run in a reasonable time, and the web application code had to process it without running out of memory. Using the web application itself to produce this much data would have been impossible, as a realistic scenario covered a broad time range and I wanted to test this within one day, not one year! Fortunately, inserting this scale of dummy data is not hard using straight SQL.
What the data looks like
- let’s presume that we need 5 million records in a table
- the primary key is a compound key defined by a DATETIME and another number
- there are other data columns
- data should be evenly distributed over both parts of the primary key
Things to keep in mind
We will create a stored procedure that runs in MySQL to create all this data. Before we do, there are some things to keep in mind.
- we might want to remove foreign keys on that table (if, say, the second part of the primary key is also a foreign key to another table and we don’t have that other table also populated with corresponding data). This is only valid if you’re ONLY selecting against this table
- we might want to include a dummy variable unique to the test data so that it can easily be removed later. If we can’t specify a value like this for any values in our records, another possibility is to create dummy data over a time range that doesn’t overlap our actual data, say, all simulated data is from the year 1981.
- Insert a small subset of dummy data to make sure the data works with your query. There’s nothing like waiting hours for the insert to complete, only to find you accidentally swapped two columns.
Here is the stored procedure. Note that the time will be evenly distributed so that there are 10 unique records every minute for an entire year. This adds up to over 5 million records. Also note the dummy value “DELETEME” so that later we can delete this data with a DELETE FROM WHERE.
delimiter $$ CREATE PROCEDURE dummy() BEGIN DECLARE d INT DEFAULT 1; DECLARE k INT DEFAULT 1; DECLARE t INT DEFAULT 1; -- insert 10 somethings every minute for a full year -- should create over 5 million records, took 6 hours to run -- the type where p is inserted can be a number or a string WHILE (d <= 365) DO WHILE (k<=10) DO WHILE (t <= 1440) DO INSERT INTO `TABLE_NAME` VALUES (DATE_ADD(MAKEDATE(2013, d), INTERVAL t MINUTE), k, ‘...’, ’DELETEME'); SET t=t+1; END WHILE; SET t=1; SET k=k+1; END WHILE; SET v=1; SET d=d+1; END WHILE; END;
To run the stored procedure from inside MySQL:
To verify your data has all been inserted, we can use the dummy identifying value:
SELECT COUNT(*) FROM TABLE_NAME WHERE COLUMN_NAME='DELETEME';
Clean up after yourself
Finally, after all the data has been inserted and the tests have been run, we can remove all this data with a simple delete. It’s a good thing we have an identifying value for the dummy data!
DELETE FROM TABLE_NAME WHERE COLUMN_NAME='DELETEME';