Chris Johnson's Blog

Just another tech weblog

To squeeze the last bit of performance out of your application you should ensure you are efficient with your HTTP responses.

There are two simple ways to add efficiency.

  1. Minify your CSS and JavaScript.
    1. Minify is basicly a process which takes your well formatted CSS and JS and removes all those tabs, multiple spaces and line breaks.  This makes it much less readable but also reduces the content size when your server sends it to the browser.
  2. Use conditional gets on pages whose content doesn’t change each time it loads.
    1. Conditional gets are built into all modern browsers.  There are two aspects to conditional gets: 1) The server tells the browser when the content was last modified via the Last-Modified header. 2) The browser tells the server the last modified date for the cached version of the content via the If-Modified-Since header.  The combination of these two headers allows your web server to determine if it needs to provide the browser with any content or if it can tell the browser to use what is cached.  It is much faster to respond saying “Not Modified” than to always respond with the entire HTML output of a PHP doc; thus reducing the amount of data transferred to the browser.

So how do we do it?

Well.. there is a great set of classes for Minifiy and HTTP Conditional Gets that I use.  It can be found on the google code repo here http://code.google.com/p/minify/

So here is some example code which uses both Minify on CSS and conditional gets:

<?PHP
include_once 'min/lib/HTTP/ConditionalGet.php';
HTTP_ConditionalGet::check(max(array(
                                      filemtime('clientstyles.css'),
                                      filemtime('ie7styles.css'),
                                      filemtime('ie6styles.css'),
                                      filemtime('styles.php')
                                      )), true); // exits if client has cache
include_once 'min/lib/Minify/CSS.php';
include_once 'includes/browser_detection.php';

$user_browser = browser_detection('browser');
//echo $user_browser;
//if ie version < 7
if ( $user_browser == 'msie' || $user_browser=='msie 6' || $user_browser=='msie 5')
$css = file_get_contents('ie6styles.css');
else
$css = file_get_contents('ie7styles.css');

echo Minify_CSS::minify($css);
// now add client specific stlyes
echo Minify_CSS::minify(file_get_contents('clientstyles.css'));
?>

So… what is going on here? Well… the point of this code is to determine which browser the user is on and output either IE6 styles or modern browser styles.  So the details.

Calling the static check method with two parameters: 1) the last modified time in a unix time stamp.  Since I’m using multiple files in the code, I do a max of the file mod times.  2) True, this tells the class that this content is public and can be cached by proxy servers, not just the user’s browser.

The check method does what it says… it checks the If-Last-Modified header sent by the browser against the last modified date you provide in the first parameter.  If the browser has the most current version, this method responds with a 304 Not Modified and exits, no content is output.  If the browser has an older version, it sets a new Last-Modified header and continues.

The middle part, before the echos, is determining which browser the user is on and gets the CSS from the appropriate file… pretty straight forward.

The last part, the echos, is where we use the Minify class.  We call the static minify method passing in the CSS file’s contents.  This method performs all the content replacements of basically white spaces and line breaks and returns the Minified version of it which we echo to the browser.

So… this simple set of code used in conjunction with Minify and Conditional Gets helps to reduce the load time and load on your server.  Questions??? please comment below.

The latest thing I’ve been working on is creating objects that are cache-able.
The use case for caching objects are many. I’m mainly looking to cache objects in order to reduce the load on the database, but an added benefit with caching, if done correctly, is you add fault tolerance for when the DB goes down.
This post will show you a simple object which uses caching via object serialization (writing the object to a file). There are more advanced caching methods (memcache) that I’ll likely look into next… but this is the easiest to set up, especially if you don’t host the website on your own machines and can’t install stuff… like memcache.

The basics of this class is built on a simple “generic” object class that I found on the internet (PHP Classes possibly? – not sure) that I modified a bit to my liking. In this class I added 5 functions (source code linked at the bottom):

  1. save()
    1. This function is the function which saves a serialized version of the object to a file.
  2. open()
    1. This function is the function which reads the serialized version of the object from the  file.
  3. isCacheValid()
    1. A simple function that says if the object data is still valid based on the cache expiration timestamp.
  4. setCacheExpireTimeStamp()
    1. The set’er function to set the unix timestamp for when the object data becomes too old.
  5. setCacheDuration()
    1. The set’er function to set the duration, in seconds, the object data should be cached for before it becomes too old.

The generic object class and these five functions gives me the ability to create a simple object which can be cached to disk.  This class is my base class, to be extended when you build a cache-able object that may contain its own functions.

The extended class I created I’m calling “content” (source code linked at the bottom).  This class, as stated, extends the generic class.  What I’ve added to this class is the logic around the persistent storage of the object data in the database, MySQL in this case.

There are two new object variables that I’m adding that were not part of the generic class 1) $dur – the default caching time in seconds 2) $track_cache – a boolean which if true will track when the cache is hit, evicted (refreshed), and when the refresh using the DB fails.  This is intended for debugging or for tuning the cache duration.

Here are the functions of the content class:

  1. __construct()
    1. The constructor.  If you pass in a content id, the constructor will call the generic class’s constructor which will get the object from disk if it exists.  The content constructor will also check to see if the cache is still valid (the current unix timestamp is before the expiration of the cached object) and if the cache is invalid it will try to refresh the cache using the database.
  2. set_cache()
    1. Sets some cache variables needed for caching to disk.
  3. cache_hit()
    1. This writes data to a file when we use the cached object data… but is only called if $track_cache is true.
  4. cache_evict()
    1. This writes data to a file when we evict the cached object data (refresh it from the db)… but is only called if $track_cache is true.
  5. cache_refresh_failed()
    1. This writes data to a file when we try to refresh the object from the db and something failed… but is only called if $track_cache is true.
  6. update()
    1. Updates the object in the database and also in the cache file.
  7. create()
    1. Creates the object in the database and also in the cache file.
  8. refresh()
    1. Gets the object data from the database and re-caches it.
  9. connect_to_db()
    1. Creates a mysqli database connection object.

The most important part of this class is how we handle MySQL failures.  Once the object is created in the database we will always have a version of it in cache, valid or not.  I’ve heard of people who cache data, but do it purely for the performance improvements… and they set up the caching incorrectly.  They cache data to disk, time passes… read the cached data, see that it is too old… evict it (dump the data) and request the data from the database again.  Well… what happens in this situation if the database is down?  You just lost all your cache data… and can’t get it back… not good.  I have to admit… this is how I had set up caching in the past.  I am not aware of any times when this caused issues… but won’t be doing it this way ever again.

The right way to evict data is AFTER you get the latest data from the resource (database, API, etc).  This way if you request an update and the update fails… at least you can continue on with the older data… which most likely is correct anyway.  If this method is used across your site… you can function for days without the database and no one would know… unless they try to create or update data that is.

PHP Files:

generic.class.php

content.class.php

Questions? Post a comment!

Recently I performed a server migration; migrating from an old machine with multiple failure points to my new highly redundant server :)

As part of this process I, obviously, copied all my files from the old server to the new server.  This included all my photos (over 13 thousand these days), and my Photoshop Elements “catalog”.

I access my Elements catalog over the network from my laptop; mounting the file server as the “n:\” drive.  Nothing changed on the laptop from a drive perspective, but the new server has a different network path (server name and share name).

This caused an unexpected result with my catalog file… All 13k files were listed with the path of the old server name and share name and were shown as “offline”.  I read on some Adobe forums that I should have performed a “move” in the application to avoid this.  Unfortunately there was no information on how to correct this after the fact… very sad.

After additional searching, I found that as of Elements version 5.0 (i think) the data structure was changed to one using a SQLite database.  The Adobe website says that you can edit the DB directly, but warns that this is “dangerous” and shouldn’t be done.  I disregard their warning, after backing up the catalog file.

You can get a SQLite command line application from the SQLite website (windows), or use your Linux distribution package for SQLite3.  After installing, accessing the DB is as simple as sqlite3 dbfile.  This opens up the SQLite command line for the database file specified.

All kinds of new commands here… I’m used to MySQL and Oracle, so this was interesting to see the DBMS specific commands for SQLite.

If you are looking for all the tables in the DB… the command is .tables

If you are looking to describe a table, I couldn’t find any command for this except .schema table_name shows the create table statement.

Using these commands I was able to figure out which tables I was needing to view/edit. Those tables were: volume_table (holds information about the file mounts, remote and local) and media_table (holds the file location information).

What I found interesting is that the media_table table holds the full network path to each file. I’m curious why Adobe would have a volume_table which holds the network path, and also store this information in media_table for each record. Why not just store the base path from volume_table and the sub folders in media_table? Sounds a little more complex than needed… but I still don’t get why volume_table is needed… but hey I don’t know the insides of the Elements program :)

So… what I needed to change to point Elements to the correct location was the full_filepath, file_path_search_index, and volume_id columns in media_table table.

So here’s a sample query for the update.
update media_table set full_filepath=replace(full_filepath,'text to find','text to replace'), file_path_search_index = replace(file_path_search_index,'text to find', 'text to replace'), volume_id = 12345 where full_filepath like 'text to find%';

This changes the volume id to the new volume, and replaces the base server/share name (text to find) for each file with the new server/share name (text to replace)

The replace syntax is not what I was expecting, it took me a few tries to get it right, also the documentation wasn’t very helpful… but once I figured that out everything was perfect :)

So… if you have a similar problem as I did… you can try what I tried. Of course, create a backup first! Any problems… feel free to contact me in the comments!

Thanks goes out to Evan’s blog Adobe’s forums for the kick in the right direction

TOC

Intro

So this topic is about various techniques of getting data from a file into a MySQL database.  I’ll be talking about using PHP to load and parse a file, and different methods within PHP to insert the data in the DB.  I’ll also talk about MySQL’s built in functionality to load data directly from a file.

So, for an impartial test and because the total permutations are more than I want to program… I’ve split each of the steps for the import out to show the performance of the various types of processing.

I’ll start off with getting the data from the file.  Once I show the best way to get data from the file, parsing columns and also rows, I will show how to get the data into the DB the fastest way.

Now kiddies… lets get started.

Generating test files

I created a script which will generate a test file based on command line inputs.  The files are tab delimited, and every column of every row has a unique value.  See the script.

Large for loops

I’ve found that most code uses sizeof() or count() as part of the for loop definition. What actually happens in each iteration of the for loop is a re-evaluation of the sizeof() or count(), meaning PHP checks the size of the array for each iteration.  So… if you have an array of length 100 it checks the size of the array 101 times (size plus last/exit count).  PHP is pretty fast… so this really doesn’t matter until you have large arrays.  Well… we are dealing with data files here.  We want to go as fast as we can.

The simple thing to do is calculate sizeof() or count() and save the results into a variable, call it $size_of.  Use $size_of in your for loop and this reduces the execution time and load dramatically.  I created a test script to demonstrate the time differences for various array sizes.  Check out the code here.

With a test of an array sized 100, 1000, 10000, 100000, 500000, and 1000000 the results are:

  • Using sizeof() in for loop of size 100 took 5.31673431396E-5 seconds
  • Using count() in for loop of size 100 took 4.60147857666E-5 seconds
  • Using saved variable, $size_of, in for loop of size 100 took 2.19345092773E-5 seconds
  • Using sizeof() in for loop of size 1000 took 0.000462055206299 seconds
  • Using count() in for loop of size 1000 took 0.000478982925415 seconds
  • Using saved variable, $size_of, in for loop of size 1000 took 0.000200986862183 seconds
  • Using sizeof() in for loop of size 10000 took 0.00419116020203 seconds
  • Using count() in for loop of size 10000 took 0.00372004508972 seconds
  • Using saved variable, $size_of, in for loop of size 10000 took 0.00174808502197 seconds
  • Using sizeof() in for loop of size 100000 took 0.0383322238922 seconds
  • Using count() in for loop of size 100000 took 0.0381410121918 seconds
  • Using saved variable, $size_of, in for loop of size 100000 took 0.0178201198578 seconds
  • Using sizeof() in for loop of size 500000 took 0.203524112701 seconds
  • Using count() in for loop of size 500000 took 0.207995176315 seconds
  • Using saved variable, $size_of, in for loop of size 500000 took 0.0991399288177 seconds
  • Using sizeof() in for loop of size 1000000 took 0.378456115723 seconds
  • Using count() in for loop of size 1000000 took 0.37501001358 seconds
  • Using saved variable, $size_of, in for loop of size 1000000 took 0.17529797554 seconds

So as you can see the fastest way to loop through any array of any size, is to calculate the size one time, rather than as part of the for loop.  Based on the 1m loop, calculating prior to looping is 46% faster.  All my further test scripts will pre-calculate the array size for maximum efficiency.

Of course this only applies if you are incrementing in your for loop.  If you decrement then the method, sizeof() count() or saved variable, becomes a mute point.  But decrementing may not be appropriate in all situations. Here’s the test script.

Using a 1,000,000 size array with decrementing for loop the execution times are:

  • Using sizeof() in for loop of size 1000000 took 0.168756008148 seconds
  • Using count() in for loop of size 1000000 took 0.162810087204 seconds
  • Using saved variable, $size_of, in for loop of size 1000000 took 0.162010908127 seconds

Oh and a decrementing loop is about 8% faster than an incrementing one… why?  Someone once explained it to me… but I don’t remember

Extracting data from the file

There are 3 ways to extract rows and columns from a file:

I have set up a script which takes the extraction type as a parameter and performed the test on 3 file types.

  1. 5 columns and 25k records
  2. 10 columns and 100k records
  3. 2 columns and 1m records

The results are:

  • A five column file with 25k records took 167.383776188 seconds with split()
  • A five column file with 25k records took 0.0651090145111 seconds with explode()
  • A five column file with 25k records took 0.0954439640045 seconds with preg_split()
  • A ten column file with 100k records took 5345.68132997 seconds (89 minutes) with split()
  • A ten column file with 100k records took 0.360288143158 seconds with explode()
  • A ten column file with 100k records took 0.592931032181 seconds with preg_split()
  • A two column file with 1m records took 106985.44959 (29.7 hours) seconds with split()
  • A two column file with 1m records took 1.58133602142 seconds with explode()
  • A two column file with 1m records took 2.47510886192 seconds with preg_split()

So, unless you MUST use regular expressions use explode().  If you MUST use regular expressions then avoid split() and use the new version preg_split().  After an analysis like this one… you can see why split() is DEPRECATED as of PHP 5.3.0 and REMOVED as of PHP 6.0.0

Moving on… now that we know the best method to parse the data file, lets look at ways to get the data into the database.

Inserting into MySQL using mysql_query()

The simplest method of querying the database is the “old school” mysql_query() function.

However, this is not the best.  This is fine for single queries, but if you are doing repetitive queries, such as importing data, it is not the right approach.

See the code

Using the 3 files I generated above, here are the results:

  • A five column file with 25k records took 1.94345879555 seconds to import.  Minus the looping of records/columns that’s 1.8783497810389 seconds just for inserts.
  • A ten column file with 100k records took 9.01914596558 seconds to import.  Minus the looping of records/columns that’s 8.658857822422 seconds just for inserts.
  • A two column file with 1m records took 90.6988818645 seconds to import. Minus the looping of records/columns that’s about 89 seconds just for inserts.

Inserting into MySQL using mysqli

Mysqli stands for MySQL Improved.  Its better than mysql_query(), but only if you know how to use it.  This approach is no better than mysql_query().

See the code

Using the 3 files I generated above, here are the results:

  • A five column file with 25k records took 1.95893096924 seconds to import.  So… close enough to mysql_query()
  • A ten column file with 100k records took 9.13561987877 seconds to import.  So… close enough to mysql_query()
  • A two column file with 1m records took 91.0189270973 seconds to import. So… close enough to mysql_query()

As you can see… mysqli doesn’t provide any benefit if you execute unique queries per records.  In fact, it is slightly slower than mysql_query().  To see the real power of MySQLi, read on.

Inserting into MySQL using mysqli prepared statements

The reason why $mysqli->query() and mysql_query() provide about the same performance is because the way I’ve implemented the import is not optimal.  For a 25k record file, I’m asking MySQL to prepare and execute 25k queries.  A wonderful feature that MySQLi provides that mysql_query() does not, is prepared statements.  What this allows me to do is ask MySQL to prepare a statement, and then execute it 25k times without additional re-preperation.  What this does is reduce the DB load by re-using the SQL by replacing unique values as placeholders.  We still have to execute 25k queries, but at least now we’ve removed the overhead for preparing the query 25k times.  This is the theory behind prepared statements… it is my belief that MySQL doesn’t handle prepared statements well.  Don’t believe me?  Try running any of my above insert examples without escaping the values… the insert times will MATCH my results below for prepared statements.

See the code

Using the 3 files I generated above, here are the results:

  • A five column file with 25k records took 1.95480203629 seconds to import.
  • A ten column file with 100k records took 7.96701002121 seconds to import.
  • A two column file with 1m records took 65.1619710922 seconds to import.

So… as the number of records increase, we see dramatic decreases in the import time.  The 1m record file imported almost 30% faster.

There are other benefits to using prepared statements.

  1. SQL injection prevention by default…. no need to apply mysql_escape_real_string() on each value prior to insert.  This makes your code easier to read, and actually makes it faster too.
  2. Programatically importing data gives the ability to catch import errors.

I will describe using built in MySQL import commands next, the fastest by far.

Using MySQL load data infile

Now, the fastest method to get data in MySQL tables… the built in mysql load data infile command.  This command gets the data in SUPER fast, but has some drawbacks which I will explain after the test results.

Now to get down to MySQL load data infile…

Here’s the MySQL command I issued, you can use this in a PHP script or execute on the command line as I did.

mysql> load data local infile
-> '/pathtofile/two_col_1m_records20100129234259.txt'
-> into table two_col_1m_records fields terminated by '\t' lines terminated by '\n';
Query OK, 1000000 rows affected (16.46 sec)
Records: 1000000  Deleted: 0  Skipped: 0  Warnings: 0

Using the 3 files I generated above, here are the results:

  • A five column file with 25k records took 0.07 seconds to import.
  • A ten column file with 100k records took 0.52 seconds to import.
  • A two column file with 1m records took 16.46 seconds to import.

So you can see how fast this is.  75% faster than MySQLi prepared statements inserting the 1m record file.

I’ve played around with it for large data imports, but stick with MySQLi prepared statements because I can catch errors.  The biggest drawback to load data infile is you cannot catch an error and do something with the faulty record.  I set up a batch import process which inserts data into a table and need to ensure that all data gets in the table.  If I encounter any issues importing the data I need to capture the failed data for re-processing later.  Load data infile only allows you to skip failed records and doesn’t tell you which records were bad. I am very fond of Oracle external tables, which are a great way to import files into Oracle tables.  I worked with external tables prior to load data infile, so was surprised when I learned that there is no concept of a .bad file for MySQL.  That made me shy away from load data infile for any situations where the data files were not created directly from MySQL.  Even Oracle’s SQLLoader has .bad files :(

As I mentioned, I have a process which batch imports data from files into MySQL tables.  As part of this process I check each error that occurs and compare it to a list of ignorable errors.  If the error is not in the ignorable list (primary key violations are ignorable in my case), then I write the bad record to a .bad file.  I have the process re-try the .bad file again at a later time and if that fails again, manual intervention is required.

Another case where it is not feasible to use load data infile is if you need to add additional data by joining it to another table.  With a case like this, load the file into its own table, then do the rest of your processing with joins.

Conclusions

When I get a hankerin’ to develop a new tool or automate some process… I tend to jump right in and get it done as quickly as I can.  I find myself later going through my code trying to pinpoint any areas where I can gain a little more efficiency.  I hope that my tests and trials provide you with the insight to start off with the efficient code and worry about other things instead.

Bottom line… if the situation allows, use MySQL load data infile, otherwise stick with MySQLi prepared statements.

90.6988818645

Plugin Overview

This plugin provides the ability to add comment blocks mid post. The use case for this is: WP is used to show clients creative documents, a single post is dedicated to a single project. Each revision is appended to the post. Adding in the CJ Revision Feedback tag will allow clients to provide feedback for a specific revision. Admin users can mark the feedback as completed and approved while adding sub-comments, and close out feedback on a revision. The plugin includes multiple customizable elements including: look and feel, email notifications, and post-specific email notifications.

Developed on version 2.8.6 so any previous versions haven’t been tested

Installation

  1. Upload upload all files from zip to cj-revision-feedback folder in the /wp-content/plugins/ directory
  2. Activate the plugin through the ‘Plugins’ menu in WordPress
  3. Place [revision_feedback revision=1] in your post to enable mid post comments

Usage

Basic Tag Format:
[revision_feedback revision=# nobutton]

Each tag added should have a unique revision number, starting with 1 and incrementing from there. To disable new feedback for a revision just add nobutton to the tag. This will only disable the user from adding new comments, they can view comments and the admin users can do everything still.

Sample tags:
[revision_feedback revision=1]

This tag is the first revision tag. At this point the client makes comments. Once revision 1 is completed the tag is updated by adding the nobutton tag
[revision_feedback revision=1 nobutton]

Now the “add feedback” button will not show At this point you would want to open comments to the next revision by adding
[revision_feedback revision=2]

rinse and repeat until the final revisions are approved

If you add the same revision number to a post the admin view (not user) will see an error message on the screen. Putting duplicate revision numbers doesn’t break anything, but will cause new/existing comments to show in two places.

Features/settings

If you visit the “settings” section of the admin page you will see a line item for CJ Revision Feedback. This details what each feature/setting is for.

Show feedback author and date
By checking this box the user id and comment date will be displayed next to each comment.

Allow Anonymous User Comments (disabling will require uses be logged in to comment)
Check this checkbox to allow anonymous users, people not logged in, to make revision comments

Layer Top Offset
This defines how many pixels below or above the clickable button the user and admin comment layer will appear. A positive number is below, a negative number is above. The top left of the button is 0 offset

Admin Layer Left Offset
This defines how many pixels to the left or right of the clickable button the admin layer should be positioned. A negative number is left, a positive number is right.

User Layer Left Offset
This defines how many pixels to the left or right of the clickable button the user layer should be positioned. A negative number is left, a positive number is right.

Send notification emails on new comments
By checking this box (and providing a default address) you will enable email notification for new user comments.

Default Notification Address (will be sent all new comment notifications)
This is the default notification address used for all new comments. This email address will receive an email for each new user comment.

Revision Editor Roles (Administrator is always an editor)
This setting gives you the ability to specify which user types to consider as an admin. An “admin” is a user who can make sub-comments, and mark a comment as completed or approved.

Post Specific Email Notification Addresses

It is possible to specify email notification addresses on a per post basis. This enables sending new comment notifications to non-default addresses scoped to the specific post.
To do this create a new custom field named “notification_addresses” with the value as a comma seperated list of email addresses. For example: email1@email.com,email2@email.com, email3@email.com (You can add spaces between email addresses)

Customizing the look and feel

Customizing the look and feel

Images
You can customize the buttons by replacing the gif images in the cj-revision-feedback folder.

Styles
You can customize the colors of the comment layers and some of the positioning by editing the CSS file in the cj-revision-feedback folder.

Bugs/Issues?

Please notify me by posting a comment below.