Once upon a time…….
No, not that kind of story. There was a time no long ago when having a data warehouse meant either a large IT staff or hiring consultants and adding hardware. Then the work of keeping the warehouse updated, what if the hardware failed. The work went on and on.
Today, thanks to cloud services like Amazon RedShift, any size company can utilize the benefits of a data warehouse. They take care of all the storage, all the headaches of worrying about natural disasters. All you have to handle is the data. Simple.
Well, yes data is a large part of your warehouse solution. Luckily there are tools available to help you load and maintain your data warehouse more easily than ever. Some even provide enhancements to make moving your data, say from MySQL to RedShift.
Let’s look at a common example. A small company wants to move their customer and inventory data into a data warehouse to perform analysis on company efficiency. The use a common CRM tool and a third-party inventory management system. However, both of these applications use a MySQL database for data storage. A large advantage to using MySQL is that many companies have created tools to work with this database. An advantage to being popular. And some users have also created data warehouses using MySQL.
So, we want to upload our data. Not quite yet. As with any data warehouse there is some verification and transforming to be done before the loading begins.
Extract: you want to audit your data, ensure you have removed any garbage data and that your customer and inventory records are up to date. Then you can export the data.
Transform: you will need to transform your data to allow imports into the warehouse. You can use several languages to run the transformation commands, such as Python, Ruby, C# or Java. The exact transformations will depend upon your data and how compatible it may be with the warehouse in its native format.
Load: insert your data into staging tables or the data warehouse tables. And now it is ready for analysis.
Basically, ETL code can be grouped into several categories:
• Extract: select from original DB tables
• Load: copy to stage target tables
• Insert/Update: transform data in stage tables to data warehouse standards
• Final Load: copy transformed data from stage tables to data warehouse tables.
Here is a code example using the PySpark API from Apache Spark:
# Define database connection parameters
MYSQL_DRIVER_PATH = "/usr/local/spark/python/lib/mysql-connector-java-5.1.36-bin.jar" MYSQL_USERNAME = '<USER_NAME >' MYSQL_PASSWORD = '********' MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/employees?user=" + MYSQL_USERNAME+"&password="+MYSQL_PASSWORD
Transformation of data and tables:
# Perform INNER JOIN on the two data frames on EMP_NO column # As of Spark 1.4 you don't have to worry about duplicate column on join result df_emp_sal_join = df_employees.join(df_salaries, "emp_no").select("emp_no", "birth_date", "first_name", "last_name", "gender", "hire_date", "salary", "from_date", "to_date") # Adding a column 'year' to the data frame for partitioning the hive table df_add_year = df_emp_sal_join.withColumn('year', F.year(df_emp_sal_join.to_date)) # Adding a load date column to the data frame df_final = df_add_year.withColumn('Load_date', F.current_date()) df_final.repartition(10) # Registering data frame as a temp table for SparkSQL hive_ctx.registerDataFrameAsTable(df_final, "EMP_TEMP")
# Inserting data into the Target table hive_ctx.sql("INSERTWRITE TABLE EMPLOYEES.EMPLOYEE_DIM PARTITION (year, Load_date) \ SELECT EMP_NO, BIRTH_DATE, FIRST_NAME, LAST_NAME, GENDER, HIRE_DATE, \ SALARY, FROM_DATE, TO_DATE, year, Load_date FROM EMP_TEMP")
Code quoted from:
To the Warehouse
All this groomed data needs to be loaded into the appropriate tables in the data warehouse. In our scenario, the warehouse architecture is prepped with the outlook of comparing customer purchases to inventory movements. The architecture/environment/service that will make up the data warehouse is a large variable. There are plenty of options.
• Use MySQL and create a data warehouse from a new instance.
• Use a hardware/application data warehouse solution.
• Use a hosted data warehouse.
• Use a cloud-based data warehouse such as Amazon RedShift.
Each option has advantages. The deciding factor is the needs of the business. If you have limited developer skills in house, then a managed solution will be easier and less frustrating. Cloud solutions are growing quickly due to the double benefit of scalable size and no hardware to maintain.
What is it important to remember is the function of the data warehouse. Warehouses allow you to bring together logically related, but format disparate data and translate it into a cohesive data set. This collected data can then be analyzed from many perspectives to allow you to make informed business decisions. Looking at reports from separate systems does not provide the holistic view you can get from a data warehouse.
In the end, companies gain great insights from properly built and maintained data warehouses. With the plethora of solutions available today, data warehouses can now be used by any size business. Get the most out of all that data in your systems.
Cloud based storage and Integration as a Service (IaaS) data warehouse solutions are growing because of the management advantages and ROI. Companies of every size use IaaS solutions for faster implementation and quick ROI. The advantages of flexible cloud-based services and open source applications such as MySQL are easily visible.
Trying to keep up with technology developments is difficult for technology professionals, much less business owners with many demands on their time. If you are curious about cloud services including data warehouses, software as a service or integration as a service, the StratoScale team has compiled the IaaS Wiki. A growing collection of information from across the cloud and services world. If you have cloud questions, start by looking there.