TOC

Intro

So this topic is about various techniques of getting data from a file into a MySQL database.  I’ll be talking about using PHP to load and parse a file, and different methods within PHP to insert the data in the DB.  I’ll also talk about MySQL’s built in functionality to load data directly from a file.

So, for an impartial test and because the total permutations are more than I want to program… I’ve split each of the steps for the import out to show the performance of the various types of processing.

I’ll start off with getting the data from the file.  Once I show the best way to get data from the file, parsing columns and also rows, I will show how to get the data into the DB the fastest way.

Now kiddies… lets get started.

Generating test files

I created a script which will generate a test file based on command line inputs.  The files are tab delimited, and every column of every row has a unique value.  See the script.

Large for loops

I’ve found that most code uses sizeof() or count() as part of the for loop definition. What actually happens in each iteration of the for loop is a re-evaluation of the sizeof() or count(), meaning PHP checks the size of the array for each iteration.  So… if you have an array of length 100 it checks the size of the array 101 times (size plus last/exit count).  PHP is pretty fast… so this really doesn’t matter until you have large arrays.  Well… we are dealing with data files here.  We want to go as fast as we can.

The simple thing to do is calculate sizeof() or count() and save the results into a variable, call it $size_of.  Use $size_of in your for loop and this reduces the execution time and load dramatically.  I created a test script to demonstrate the time differences for various array sizes.  Check out the code here.

With a test of an array sized 100, 1000, 10000, 100000, 500000, and 1000000 the results are:

  • Using sizeof() in for loop of size 100 took 5.31673431396E-5 seconds
  • Using count() in for loop of size 100 took 4.60147857666E-5 seconds
  • Using saved variable, $size_of, in for loop of size 100 took 2.19345092773E-5 seconds
  • Using sizeof() in for loop of size 1000 took 0.000462055206299 seconds
  • Using count() in for loop of size 1000 took 0.000478982925415 seconds
  • Using saved variable, $size_of, in for loop of size 1000 took 0.000200986862183 seconds
  • Using sizeof() in for loop of size 10000 took 0.00419116020203 seconds
  • Using count() in for loop of size 10000 took 0.00372004508972 seconds
  • Using saved variable, $size_of, in for loop of size 10000 took 0.00174808502197 seconds
  • Using sizeof() in for loop of size 100000 took 0.0383322238922 seconds
  • Using count() in for loop of size 100000 took 0.0381410121918 seconds
  • Using saved variable, $size_of, in for loop of size 100000 took 0.0178201198578 seconds
  • Using sizeof() in for loop of size 500000 took 0.203524112701 seconds
  • Using count() in for loop of size 500000 took 0.207995176315 seconds
  • Using saved variable, $size_of, in for loop of size 500000 took 0.0991399288177 seconds
  • Using sizeof() in for loop of size 1000000 took 0.378456115723 seconds
  • Using count() in for loop of size 1000000 took 0.37501001358 seconds
  • Using saved variable, $size_of, in for loop of size 1000000 took 0.17529797554 seconds

So as you can see the fastest way to loop through any array of any size, is to calculate the size one time, rather than as part of the for loop.  Based on the 1m loop, calculating prior to looping is 46% faster.  All my further test scripts will pre-calculate the array size for maximum efficiency.

Of course this only applies if you are incrementing in your for loop.  If you decrement then the method, sizeof() count() or saved variable, becomes a mute point.  But decrementing may not be appropriate in all situations. Here’s the test script.

Using a 1,000,000 size array with decrementing for loop the execution times are:

  • Using sizeof() in for loop of size 1000000 took 0.168756008148 seconds
  • Using count() in for loop of size 1000000 took 0.162810087204 seconds
  • Using saved variable, $size_of, in for loop of size 1000000 took 0.162010908127 seconds

Oh and a decrementing loop is about 8% faster than an incrementing one… why?  Someone once explained it to me… but I don’t remember

Extracting data from the file

There are 3 ways to extract rows and columns from a file:

I have set up a script which takes the extraction type as a parameter and performed the test on 3 file types.

  1. 5 columns and 25k records
  2. 10 columns and 100k records
  3. 2 columns and 1m records

The results are:

  • A five column file with 25k records took 167.383776188 seconds with split()
  • A five column file with 25k records took 0.0651090145111 seconds with explode()
  • A five column file with 25k records took 0.0954439640045 seconds with preg_split()
  • A ten column file with 100k records took 5345.68132997 seconds (89 minutes) with split()
  • A ten column file with 100k records took 0.360288143158 seconds with explode()
  • A ten column file with 100k records took 0.592931032181 seconds with preg_split()
  • A two column file with 1m records took 106985.44959 (29.7 hours) seconds with split()
  • A two column file with 1m records took 1.58133602142 seconds with explode()
  • A two column file with 1m records took 2.47510886192 seconds with preg_split()

So, unless you MUST use regular expressions use explode().  If you MUST use regular expressions then avoid split() and use the new version preg_split().  After an analysis like this one… you can see why split() is DEPRECATED as of PHP 5.3.0 and REMOVED as of PHP 6.0.0

Moving on… now that we know the best method to parse the data file, lets look at ways to get the data into the database.

Inserting into MySQL using mysql_query()

The simplest method of querying the database is the “old school” mysql_query() function.

However, this is not the best.  This is fine for single queries, but if you are doing repetitive queries, such as importing data, it is not the right approach.

See the code

Using the 3 files I generated above, here are the results:

  • A five column file with 25k records took 1.94345879555 seconds to import.  Minus the looping of records/columns that’s 1.8783497810389 seconds just for inserts.
  • A ten column file with 100k records took 9.01914596558 seconds to import.  Minus the looping of records/columns that’s 8.658857822422 seconds just for inserts.
  • A two column file with 1m records took 90.6988818645 seconds to import. Minus the looping of records/columns that’s about 89 seconds just for inserts.

Inserting into MySQL using mysqli

Mysqli stands for MySQL Improved.  Its better than mysql_query(), but only if you know how to use it.  This approach is no better than mysql_query().

See the code

Using the 3 files I generated above, here are the results:

  • A five column file with 25k records took 1.95893096924 seconds to import.  So… close enough to mysql_query()
  • A ten column file with 100k records took 9.13561987877 seconds to import.  So… close enough to mysql_query()
  • A two column file with 1m records took 91.0189270973 seconds to import. So… close enough to mysql_query()

As you can see… mysqli doesn’t provide any benefit if you execute unique queries per records.  In fact, it is slightly slower than mysql_query().  To see the real power of MySQLi, read on.

Inserting into MySQL using mysqli prepared statements

The reason why $mysqli->query() and mysql_query() provide about the same performance is because the way I’ve implemented the import is not optimal.  For a 25k record file, I’m asking MySQL to prepare and execute 25k queries.  A wonderful feature that MySQLi provides that mysql_query() does not, is prepared statements.  What this allows me to do is ask MySQL to prepare a statement, and then execute it 25k times without additional re-preperation.  What this does is reduce the DB load by re-using the SQL by replacing unique values as placeholders.  We still have to execute 25k queries, but at least now we’ve removed the overhead for preparing the query 25k times.  This is the theory behind prepared statements… it is my belief that MySQL doesn’t handle prepared statements well.  Don’t believe me?  Try running any of my above insert examples without escaping the values… the insert times will MATCH my results below for prepared statements.

See the code

Using the 3 files I generated above, here are the results:

  • A five column file with 25k records took 1.95480203629 seconds to import.
  • A ten column file with 100k records took 7.96701002121 seconds to import.
  • A two column file with 1m records took 65.1619710922 seconds to import.

So… as the number of records increase, we see dramatic decreases in the import time.  The 1m record file imported almost 30% faster.

There are other benefits to using prepared statements.

  1. SQL injection prevention by default…. no need to apply mysql_escape_real_string() on each value prior to insert.  This makes your code easier to read, and actually makes it faster too.
  2. Programatically importing data gives the ability to catch import errors.

I will describe using built in MySQL import commands next, the fastest by far.

Using MySQL load data infile

Now, the fastest method to get data in MySQL tables… the built in mysql load data infile command.  This command gets the data in SUPER fast, but has some drawbacks which I will explain after the test results.

Now to get down to MySQL load data infile…

Here’s the MySQL command I issued, you can use this in a PHP script or execute on the command line as I did.

mysql> load data local infile
-> '/pathtofile/two_col_1m_records20100129234259.txt'
-> into table two_col_1m_records fields terminated by '\t' lines terminated by '\n';
Query OK, 1000000 rows affected (16.46 sec)
Records: 1000000  Deleted: 0  Skipped: 0  Warnings: 0

Using the 3 files I generated above, here are the results:

  • A five column file with 25k records took 0.07 seconds to import.
  • A ten column file with 100k records took 0.52 seconds to import.
  • A two column file with 1m records took 16.46 seconds to import.

So you can see how fast this is.  75% faster than MySQLi prepared statements inserting the 1m record file.

I’ve played around with it for large data imports, but stick with MySQLi prepared statements because I can catch errors.  The biggest drawback to load data infile is you cannot catch an error and do something with the faulty record.  I set up a batch import process which inserts data into a table and need to ensure that all data gets in the table.  If I encounter any issues importing the data I need to capture the failed data for re-processing later.  Load data infile only allows you to skip failed records and doesn’t tell you which records were bad. I am very fond of Oracle external tables, which are a great way to import files into Oracle tables.  I worked with external tables prior to load data infile, so was surprised when I learned that there is no concept of a .bad file for MySQL.  That made me shy away from load data infile for any situations where the data files were not created directly from MySQL.  Even Oracle’s SQLLoader has .bad files :(

As I mentioned, I have a process which batch imports data from files into MySQL tables.  As part of this process I check each error that occurs and compare it to a list of ignorable errors.  If the error is not in the ignorable list (primary key violations are ignorable in my case), then I write the bad record to a .bad file.  I have the process re-try the .bad file again at a later time and if that fails again, manual intervention is required.

Another case where it is not feasible to use load data infile is if you need to add additional data by joining it to another table.  With a case like this, load the file into its own table, then do the rest of your processing with joins.

Conclusions

When I get a hankerin’ to develop a new tool or automate some process… I tend to jump right in and get it done as quickly as I can.  I find myself later going through my code trying to pinpoint any areas where I can gain a little more efficiency.  I hope that my tests and trials provide you with the insight to start off with the efficient code and worry about other things instead.

Bottom line… if the situation allows, use MySQL load data infile, otherwise stick with MySQLi prepared statements.

90.6988818645