Skip to main content

The Small Company and the Data Warehouse

Once upon a time…….
No, not that kind of story. There was a time no long ago when having a data warehouse meant either a large IT staff or hiring consultants and adding hardware. Then the work of keeping the warehouse updated, what if the hardware failed. The work went on and on.
Today, thanks to cloud services like Amazon RedShift, any size company can utilize the benefits of a data warehouse. They take care of all the storage, all the headaches of worrying about natural disasters. All you have to handle is the data. Simple.

Data, Simple?

Well, yes data is a large part of your warehouse solution. Luckily there are tools available to help you load and maintain your data warehouse more easily than ever. Some even provide enhancements to make moving your data, say from MySQL to RedShift.

Let’s look at a common example. A small company wants to move their customer and inventory data into a data warehouse to perform analysis on company efficiency. The use a common CRM tool and a third-party inventory management system. However, both of these applications use a MySQL database for data storage. A large advantage to using MySQL is that many companies have created tools to work with this database. An advantage to being popular. And some users have also created data warehouses using MySQL.
So, we want to upload our data. Not quite yet. As with any data warehouse there is some verification and transforming to be done before the loading begins.

Extract: you want to audit your data, ensure you have removed any garbage data and that your customer and inventory records are up to date. Then you can export the data.
Transform: you will need to transform your data to allow imports into the warehouse. You can use several languages to run the transformation commands, such as Python, Ruby, C# or Java. The exact transformations will depend upon your data and how compatible it may be with the warehouse in its native format.
Load: insert your data into staging tables or the data warehouse tables. And now it is ready for analysis.

Basically, ETL code can be grouped into several categories:

• Extract: select from original DB tables
• Load: copy to stage target tables
• Insert/Update: transform data in stage tables to data warehouse standards
• Final Load: copy transformed data from stage tables to data warehouse tables.

Example Time

Here is a code example using the PySpark API from Apache Spark:

Database connections:

# Define database connection parameters

MYSQL_DRIVER_PATH = "/usr/local/spark/python/lib/mysql-connector-java-5.1.36-bin.jar"
MYSQL_PASSWORD = '********'
MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/employees?user=" + MYSQL_USERNAME+"&password="+MYSQL_PASSWORD


Transformation of data and tables:

# Perform INNER JOIN on the two data frames on EMP_NO column
# As of Spark 1.4 you don't have to worry about duplicate column on join result
df_emp_sal_join = df_employees.join(df_salaries, "emp_no").select("emp_no", "birth_date", "first_name", "last_name", "gender", "hire_date", "salary", "from_date", "to_date")

# Adding a column 'year' to the data frame for partitioning the hive table
df_add_year = df_emp_sal_join.withColumn('year', F.year(df_emp_sal_join.to_date))

# Adding a load date column to the data frame
df_final = df_add_year.withColumn('Load_date', F.current_date())


# Registering data frame as a temp table for SparkSQL
hive_ctx.registerDataFrameAsTable(df_final, "EMP_TEMP")


Data Loading:

# Inserting data into the Target table
            SALARY, FROM_DATE, TO_DATE, year, Load_date FROM EMP_TEMP")

Code quoted from:

To the Warehouse

All this groomed data needs to be loaded into the appropriate tables in the data warehouse. In our scenario, the warehouse architecture is prepped with the outlook of comparing customer purchases to inventory movements. The architecture/environment/service that will make up the data warehouse is a large variable. There are plenty of options.
• Use MySQL and create a data warehouse from a new instance.
• Use a hardware/application data warehouse solution.
• Use a hosted data warehouse.
• Use a cloud-based data warehouse such as Amazon RedShift.
Each option has advantages. The deciding factor is the needs of the business. If you have limited developer skills in house, then a managed solution will be easier and less frustrating. Cloud solutions are growing quickly due to the double benefit of scalable size and no hardware to maintain.
What is it important to remember is the function of the data warehouse. Warehouses allow you to bring together logically related, but format disparate data and translate it into a cohesive data set. This collected data can then be analyzed from many perspectives to allow you to make informed business decisions. Looking at reports from separate systems does not provide the holistic view you can get from a data warehouse.

End Result

In the end, companies gain great insights from properly built and maintained data warehouses. With the plethora of solutions available today, data warehouses can now be used by any size business. Get the most out of all that data in your systems.
Cloud based storage and Integration as a Service (IaaS) data warehouse solutions are growing because of the management advantages and ROI. Companies of every size use IaaS solutions for faster implementation and quick ROI. The advantages of flexible cloud-based services and open source applications such as MySQL are easily visible.
Trying to keep up with technology developments is difficult for technology professionals, much less business owners with many demands on their time. If you are curious about cloud services including data warehouses, software as a service or integration as a service, the StratoScale team has compiled the IaaS Wiki. A growing collection of information from across the cloud and services world. If you have cloud questions, start by looking there.

Nagios plugin for checking mysql synchronization state

I want to share this simple nagios plugin for monitoring mysql slaves replications states:


This plugin can be downloaded here. To install copy this script to nagios plugins directory (/usr/lib/nagios/plugins/). View script help for full options list:

# /usr/lib/nagios/plugins/ -h

Check mysql sync between master and slave for Nagios version 0.1
Usage: /usr/lib/nagios/plugins/ [-v] -H <host> -p <port> -l <username> -x <password> -w <warn level> -c <crit level>  [-V]
-v, --verbose
   print extra debugging information 
-h, --help
   print this help message
-H, --hostname=HOST
   name or IP address of host to check
-l, --login
    username for mysql database (Default nagios)
-x, --passwd=PASSWD
-p, --port=PORT
   MYSQL port (Default 3361)
-w, --warn=INTEGER
   warning level for seconds behind sync
-c, --crit=INTEGER 
   critical level for seconds behind sync
-V, --version
   prints version number

Run this command in your mysql master for grant to nagios user the required privileges:


Bacula: migrate database from mysql to postgresql

I’ve followed these steps to migrate bacula database from mysql 5.1.49 to postgresql 8.4.9. I am running bacula 5.2.3 compiled from sources on Debian Squeeze.

1- Install postgresql database:

# apt-get install postgresql-8.4 postgresql-server-dev-8.4 postgresql-client-8.4

2- Compile and install bacula with postgresql support:

# cd /usr/src/bacula-5.2.5
# ./configure --with-postgresql --with-openssl --with-python
# make install

3- Create bacula user on postgresql:

# su - postgres
# psql -d template1 -U postgres
template1=# CREATE USER bacula WITH PASSWORD 'myPassword';

4- Create bacula database

# sh /etc/bacula/create_postgresql_database

5- Create database tables

# sh /etc/bacula/make_postgresql_tables

6- Grant privilejes on bacula database to bacula user.

# sh /etc/bacula/grant_postgresql_privileges

7- take dump file from mysql. I’ve used this script to take the mysql dump and change some fields, like zero timestamp that does not fit with postgresql.

#!/bin/perl -w

my $line=undef;
my $catalog="bacula";

system ("mysqldump --single-transaction --compatible=postgresql --compact --no-create-info $catalog > $catalog.sql");
print "$catalog dump completed!\n";
open FD_CAT, "<$catalog.sql";
open FD_OUT, ">out-$catalog.sql";
while (<FD_CAT>)
 if (m/^(INSERT\s+INTO\s+\")(\w+)(\"\s+.+)$/){          
   $line=~ s/\\\'/\'\'/g;
   $line=~ s/\'0000-00-00 00:00:00\'/to_timestamp\(0\)/g;
   print FD_OUT "$line\n";
   print "line not found: $_";

close (FD_CAT);
close (FD_OUT);
print "Dump filtered commpleted for $catalog\n";

8- Import dump file into postgresql

# psql -d bacula -f out-bacula.sql

9- Change the catalog definition in bacula-dir.conf from mysql to postgresql

Edit /etc/bacula-dir.conf and change driver and port

dbdriver = "dbi:postgresql"; dbaddress =; dbport =5432

10- Restart bacula director.

# /etc/init.d/bacula-dir restart

mysql replication and binary logs expiration

Mysql replication is very well documented on mysql server web site and has a well explained howto that you can follow when you are setting up replication on your mysql databases.

You have to set in mysql server configuration file the log_bin parameter specifying the binary log file name that will be replicated across all the slaves. Mysql slaves take this binary logs from master and apply them to database; the problem I found is that Mysql server does not manage this binary logs expiration automatically, so if you are not careful and your database has a lot of changes your system can run out of disk space.

To address this issue, mysql has another parameter “expire_logs_days” that with you can specify the value in days of the binary logs expiration:

expire_logs_days = 5
max_binlog_size = 1G

In this config, 5 days after logs replication, mysql will automatically remove them. You have to adjust this value to be sure your hard disk is not going to run out of space and your slave has time enough to apply the binary log changes considering the number of days slaves might lag behind the master.

In the same way, you can use purge binary logs mysql command to purge old binary logs:

PURGE BINARY LOGS BEFORE '2008-04-02 22:46:26';

Access denied for user ‘debian-sys-maint’@’localhost’ (using password: YES)


I had this error  when I moved all database server files,  phisically,  from one server to another for replication purposes and used the mysql startup script. The new server is a Debian Squeeze,  in debian there is a user debian-sys-maint   to do maintenance tasks,  to avoid this error with the mysql startup script you have to create this user in your new database server.


The password for this user can be seen in this file:

# cat /etc/mysql/debian.cnf

# Automatically generated for Debian scripts. DO NOT TOUCH!
[client] host = localhost
user = debian-sys-maint
password = password
socket = /var/run/mysqld/mysqld.sock
[mysql_upgrade] host = localhost
user = debian-sys-maint
password = password
socket = /var/run/mysqld/mysqld.sock
basedir = /usr

To create the user,  log in as mysql root user and run this statement:

GRANT ALL PRIVILEGES ON *.* TO 'debian-sys-maint'@'localhost' IDENTIFIED BY 'password'; 

In my case,  the original server was a Centos 5.5,  so that the user debian-sys-maint did not exist,  but in case the server was another Debian you can use the same statement to only reset the user password.


Notice: Undefined variable: wp_sh_class_name in /var/www/ on line 1002

Notice: Undefined variable: wp_sh_class_name in /var/www/ on line 1002

Warning: Use of undefined constant XML - assumed 'XML' (this will throw an Error in a future version of PHP) in /var/www/ on line 1048