Recently I found a great data source for historical weather information. I’ve always found it annoying that there are all these fancy desktop, iPhone, and web apps that allow me to get all kind of info on the current weather situations… but I have to pay to get the historical information.
I live in California, SF Bay area, and stumbled upon http://cdec.water.ca.gov/ when searching the internet for reservoir levels to see if the recent storms got us past where we “should” be for late Jan. The name of the site alone was what caught my eye “California Data Exchange Center”… mmmmmmm data, drool, drool. I started looking into the data that the site has, first the reservoir data, then I found the weather stations . I looked at some of the stations around my area and found one in particular run by a fire house that had LOTS of the historical data I’ve always wanted. Rain by hour and day, temperature by hour and day, wind by hour and day. This one station had HOURLY data back to 1992 for 4 different data sets. Once again… mmmmm data, drool, drool. Then I stumbled upon the “Query Tools” and saw “Historical Data (csv)”. Me being the nerd I am decided that since I can’t find historical weather data on weather sites, I would create a database, for fun, and build some data load scripts. So… that’s what this post is about. I haven’t built the web front end to get fancy graphs and charts… but will post that late too.
Here’s the structure of the weather database I set up:
Station meta data:
CREATE TABLE `stations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`station_id` varchar(10) NOT NULL,
`station_name` varchar(255) NOT NULL,
`elevation` varchar(20) DEFAULT NULL,
`operator` varchar(255) DEFAULT NULL,
`river_basin` varchar(255) DEFAULT NULL,
`hydrologic_area` varchar(255) DEFAULT NULL,
`latitude` varchar(20) DEFAULT NULL,
`longitude` varchar(20) DEFAULT NULL,
`nearby_city` varchar(255) DEFAULT NULL,
`country` varchar(255) DEFAULT NULL,
`added_on` datetime NOT NULL,
`modified_on` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`active` int(11) DEFAULT '1',
PRIMARY KEY (`id`),
UNIQUE KEY `station_id_idx` (`station_id`)
This table holds all the info about the weather station. Some of this data is required when downloading new data from CDEC, the rest is just because I can get it.
Data Collection Types:
CREATE TABLE `data_collection_types` (
`id` int(11) NOT NULL,
`sensor_number` int(11) NOT NULL DEFAULT '0',
`sensor_name` varchar(255) DEFAULT NULL,
`duration` varchar(1) DEFAULT NULL,
`last_data_collection` date DEFAULT NULL,
`active` int(11) DEFAULT '1',
PRIMARY KEY (`id`,`sensor_number`)
This table holds the data about the types of data a station collects. CDEC refers to this as a sensor. Most of the data in this table is needed to download new data from CDEC. I’m using last_data_collection to store the last data download date so I don’t constantly re-download the same data.
There really isn’t anything special about these two tables… so read on.
I’ve set up about 10 stations and 31 data types that I collect data from the CDEC for. Some of the data ranges back to 1954, but 97% is from after 1994. After I first loaded the data into a table I had 2.1 million records of data. All the data is provided in the same 3 column output, which made it very easy to store all kinds of data in the same table using the collection types id and station id to distinguish the sources.
The data table I store the actual data in is structured like this:
mysql> desc data_collected;
| Field | Type | Null | Key | Default | Extra |
| id | int(11) | NO | PRI | 0 | |
| sensor_number | int(11) | NO | PRI | 0 | |
| data_date | date | NO | PRI | 0000-00-00 | |
| data_time | char(4) | NO | PRI | | |
| value | varchar(255) | NO | | NULL | |
5 rows in set (0.00 sec)
Queries on the table did take a little time, even with the primary key index; and I like “dicking” around with shit… so I decided I would partition the table by date range. So, I set out to write the create table statement to partition by range of year and subpartition by month. At first I created the table by manually writting the SQL out. Well… I ended up with 17 partitons with 12 subpartitions each… lots of typing.
So I decided to build a simple PHP command line (works on web too) script to generate the SQL text. For those of you who know partitioning and subpartitioning, you may know that you can choose to name subpartitions or not as part of the create table statement. For me, I wanted to name the sub partitions to include the date range in the name so I could easily tell how much data was in each range (you’ll see why this was good later).
So… here’s the code (finally!!):
$sql="CREATE TABLE `data_collected` (
`id` int(11) NOT NULL DEFAULT '0',
`sensor_number` int(11) NOT NULL DEFAULT '0',
`data_date` date NOT NULL DEFAULT '0000-00-00',
`data_time` char(4) NOT NULL DEFAULT '',
`value` varchar(255) NOT NULL,
PRIMARY KEY (`id`,`sensor_number`,`data_date`,`data_time`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (YEAR(data_date))
SUBPARTITION BY HASH (MONTH(data_date))
$sql.="\nPARTITION p$m VALUES LESS THAN ($i)
(SUBPARTITION s_dec_$m ENGINE = MyISAM,
SUBPARTITION s_jan_$m ENGINE = MyISAM,
SUBPARTITION s_feb_$m ENGINE = MyISAM,
SUBPARTITION s_mar_$m ENGINE = MyISAM,
SUBPARTITION s_apr_$m ENGINE = MyISAM,
SUBPARTITION s_may_$m ENGINE = MyISAM,
SUBPARTITION s_jun_$m ENGINE = MyISAM,
SUBPARTITION s_jul_$m ENGINE = MyISAM,
SUBPARTITION s_aug_$m ENGINE = MyISAM,
SUBPARTITION s_sep_$m ENGINE = MyISAM,
SUBPARTITION s_oct_$m ENGINE = MyISAM,
SUBPARTITION s_nov_$m ENGINE = MyISAM)";
mysql_query("drop table if exists data_collected");
mysql_query($sql) or die(mysql_error());
So what’s going on here is I am generating the partition and subpartition text in a for loop. When I analyzed the data, I found that there was little data <1994 so put that all in the first partition. MySQL allows range partitioning and hash subpartitioning. This breaks each year into 12 sub partitions. Now one thing to notice is that the first sub partition is for Dec and the second for Jan. I didn’t realize it when I first set this up, but hash is using this formula
mod(month(data_date),12) in my case. Well… mod(12,12) = 0 or the 1st partition. I noticed this because all my Jan 2010 data was in my s_Feb_2010 partition. You can see the partition distribution with:
select partition_name,subpartition_name,table_rows from information_schema.partitions where table_schema='weather' and partition_name is not null;
This query is exactly why I chose to name the partitions and subpartitions
Then I searched on the hash formula to figure out why it was this way… and well moved the dec partition to the 1st partition.
So… long story short. Here’s a simple PHP script that can be used to generate a data table which is partitioned by year, and contains a sub partiton for each month. Partitioning is really meant for tables with large amounts of data. The benefits are:
- The ability to store sections of data on different disks. Especially helpful if you have large data tables on small disks. Also a plus since you can put “archive” partitions on the slower disks.
- The ability to have large amounts of data in a table and still maintain fast inserts/selects
- The ability to remove large amounts of data very quickly (drop the partition)
For more information on MySQL partitions please read:
In a later post I’ll talk about why I like Oracle partitions better than MySQL and also how I’m loading the data into these tables.