Skip to main content

The Small Company and the Data Warehouse

Once upon a time…….
No, not that kind of story. There was a time no long ago when having a data warehouse meant either a large IT staff or hiring consultants and adding hardware. Then the work of keeping the warehouse updated, what if the hardware failed. The work went on and on.
Today, thanks to cloud services like Amazon RedShift, any size company can utilize the benefits of a data warehouse. They take care of all the storage, all the headaches of worrying about natural disasters. All you have to handle is the data. Simple.

Data, Simple?

Well, yes data is a large part of your warehouse solution. Luckily there are tools available to help you load and maintain your data warehouse more easily than ever. Some even provide enhancements to make moving your data, say from MySQL to RedShift.

Let’s look at a common example. A small company wants to move their customer and inventory data into a data warehouse to perform analysis on company efficiency. The use a common CRM tool and a third-party inventory management system. However, both of these applications use a MySQL database for data storage. A large advantage to using MySQL is that many companies have created tools to work with this database. An advantage to being popular. And some users have also created data warehouses using MySQL.
So, we want to upload our data. Not quite yet. As with any data warehouse there is some verification and transforming to be done before the loading begins.

Extract: you want to audit your data, ensure you have removed any garbage data and that your customer and inventory records are up to date. Then you can export the data.
Transform: you will need to transform your data to allow imports into the warehouse. You can use several languages to run the transformation commands, such as Python, Ruby, C# or Java. The exact transformations will depend upon your data and how compatible it may be with the warehouse in its native format.
Load: insert your data into staging tables or the data warehouse tables. And now it is ready for analysis.

Basically, ETL code can be grouped into several categories:

• Extract: select from original DB tables
• Load: copy to stage target tables
• Insert/Update: transform data in stage tables to data warehouse standards
• Final Load: copy transformed data from stage tables to data warehouse tables.

Example Time

Here is a code example using the PySpark API from Apache Spark:

Database connections:

# Define database connection parameters

MYSQL_DRIVER_PATH = "/usr/local/spark/python/lib/mysql-connector-java-5.1.36-bin.jar"
MYSQL_USERNAME = '<USER_NAME >'
MYSQL_PASSWORD = '********'
MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/employees?user=" + MYSQL_USERNAME+"&password="+MYSQL_PASSWORD

 

Transformation of data and tables:

# Perform INNER JOIN on the two data frames on EMP_NO column
# As of Spark 1.4 you don't have to worry about duplicate column on join result
df_emp_sal_join = df_employees.join(df_salaries, "emp_no").select("emp_no", "birth_date", "first_name", "last_name", "gender", "hire_date", "salary", "from_date", "to_date")

# Adding a column 'year' to the data frame for partitioning the hive table
df_add_year = df_emp_sal_join.withColumn('year', F.year(df_emp_sal_join.to_date))

# Adding a load date column to the data frame
df_final = df_add_year.withColumn('Load_date', F.current_date())

df_final.repartition(10)

# Registering data frame as a temp table for SparkSQL
hive_ctx.registerDataFrameAsTable(df_final, "EMP_TEMP")

 

Data Loading:

# Inserting data into the Target table
hive_ctx.sql("INSERTWRITE TABLE EMPLOYEES.EMPLOYEE_DIM PARTITION (year, Load_date) \
            SELECT EMP_NO, BIRTH_DATE, FIRST_NAME, LAST_NAME, GENDER, HIRE_DATE, \
            SALARY, FROM_DATE, TO_DATE, year, Load_date FROM EMP_TEMP")

Code quoted from:

https://dwtobigdata.wordpress.com/2015/09/29/etl-with-apache-spark/

To the Warehouse

All this groomed data needs to be loaded into the appropriate tables in the data warehouse. In our scenario, the warehouse architecture is prepped with the outlook of comparing customer purchases to inventory movements. The architecture/environment/service that will make up the data warehouse is a large variable. There are plenty of options.
• Use MySQL and create a data warehouse from a new instance.
• Use a hardware/application data warehouse solution.
• Use a hosted data warehouse.
• Use a cloud-based data warehouse such as Amazon RedShift.
Each option has advantages. The deciding factor is the needs of the business. If you have limited developer skills in house, then a managed solution will be easier and less frustrating. Cloud solutions are growing quickly due to the double benefit of scalable size and no hardware to maintain.
What is it important to remember is the function of the data warehouse. Warehouses allow you to bring together logically related, but format disparate data and translate it into a cohesive data set. This collected data can then be analyzed from many perspectives to allow you to make informed business decisions. Looking at reports from separate systems does not provide the holistic view you can get from a data warehouse.

End Result

In the end, companies gain great insights from properly built and maintained data warehouses. With the plethora of solutions available today, data warehouses can now be used by any size business. Get the most out of all that data in your systems.
Cloud based storage and Integration as a Service (IaaS) data warehouse solutions are growing because of the management advantages and ROI. Companies of every size use IaaS solutions for faster implementation and quick ROI. The advantages of flexible cloud-based services and open source applications such as MySQL are easily visible.
Trying to keep up with technology developments is difficult for technology professionals, much less business owners with many demands on their time. If you are curious about cloud services including data warehouses, software as a service or integration as a service, the StratoScale team has compiled the IaaS Wiki. A growing collection of information from across the cloud and services world. If you have cloud questions, start by looking there.

mkfs.xfs: cannot open /dev/[device]: Device or resource busy

If you hit this error trying to format a partition after you have installed a new disk in your server:

~# mkfs -t xfs -f -i size=2048 /dev/sdav2
mkfs.xfs: cannot open /dev/sdav2: Device or resource busy

Maybe you have multipath enabled and the new device has been detected. Run this command to see if multipath has detected it.

~# multipath -ll /dev/sdav
2796a30585a4b6e45 dm-12 ,
size=2.2T features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| `- #:#:#:# sdav 66:48  active undef running

then to fix it, remove from multipath:

~# multipath -f /dev/sdav

And finally you can format it without any problem:

~# mkfs -t xfs -f -i size=2048 /dev/sdav2
meta-data=/dev/device            isize=2048   agcount=4, agsize=3277258 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=13109032, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=6400, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Resize vps disk in an automated way

This is the commands I run to resize automatically the VPS disks when I have resized the disk through Proxmox interface.

#!/bin/bash

# This is only needed if you use GPT type tables
echo "Fix" | /usr/local/sbin/parted  ---pretend-input-tty /dev/sdb print
# Resize partition 1 using all the space
/usr/local/sbin/parted /dev/sdb resizepart 1 100%
#  I use LVM so the physical volume must be resized first
pvresize /dev/sdb1
# Resize the logical volume using all space available
lvresize /dev/vgdata/home -l +100%FREE
# Finally notify file system to use all the space
resize2fs /dev/vgdata/home

Note: I use parted 3.2, if your distribution comes with an older one uninstall it and install it from sources

OpenLDAP – cómo establecer la clave de root del esquema

Si necesitamos establecer la clave del usuario admin bajo la rama config (cn=admin,cn=config) podemos seguir los siguientes pasos:

Crear un fichero ldif llamado rootpw_cnconfig.ldif con el siguiente contenido:

dn: olcDatabase={0}config,cn=config
changetype: modify
replace: olcRootPW
olcRootPW: NUEVACLAVE

Y ejecutar el siguiente comando:

ldapmodify -Y EXTERNAL -H ldapi:/// -f rootpw_cnconfig.ldif
SASL/EXTERNAL authentication started
SASL username: gidNumber=0+uidNumber=0,cn=peercred,cn=external,cn=auth
SASL SSF: 0
modifying entry "olcDatabase={0}config,cn=config"

A partir de ahora ya podríamos usar la clave en los comandos de ldap de la siguiente forma si necesitamos realizar cambios en el esquema:

ldapmodify -d "cn=admin,cn=config" -W -f fichero.ldif

Cómo añadir índices en OpenLDAP

Si encuentras en el log de OpenLDAP entradas de este tipo cuando se realizan búsquedas de los objetos significa que las búsquedas se realizan sobre un atributo que no tiene índice en LDAP:

slapd[14318]: <= mdb_equality_candidates: (associatedDomain) not indexed

Estas búsquedas podrían ser bastante costosas si tenemos un LDAP con muchos datos por lo que si este tipo de búsquedas son muy frecuentes conviene añadir un índice al atributo. Para ello crearemos un fichero add-index.ldif con el siguiente contenido:

dn: olcDatabase={1}mdb,cn=config
changetype: modify
replace: olcDbIndex
olcDbIndex: associatedDomain eq

En este caso creamos un índice de tipo de eq como se puede observar en la última línea ya que las búsquedas son del tipo ‘associatedDomain=dominio’, pero existen otro tipo de índices dependiendo del tipo de búsqueda a realizar sobre el atributo:

pres should be used if use searches of the form ‘objectclass=person’ or ‘attribute=mail’ will be used.
approx MUST be used if use searches of the form ‘sn~=person’ (a ‘sounds-like’ search) will be used.
eq should be used if searches of the form ‘sn=smith’ will be used i.e no wildcards are included (uses the EQUALITY rule only).
sub should be used if use searches of the form ‘sn=sm*’ i.e wildcards are included (uses the SUBSTR rule). This rule may be enhanced by a using subinitial (optimised for ‘sn=*s’), subany (optimised for ‘sn=*n*’) or subfinal (optimised for ‘sn=th*’). One or more sub parameters may be included.

También podríamos combinar varios tipos de índices de esta forma:

olcDbIndex: associatedDomain eq,sub

Una vez generado el fichero solo tenemos que ejecutar el siguiente comando e introducir la clave del usuario admin del esquema:

$ ldapadd -h localhost -D "cn=admin,cn=config" -f add-index.ldif -W
Enter LDAP Password: 
modifying entry "olcDatabase={1}mdb,cn=config"

Si no sabemos la clave del usuario cn=admin,cn=config podemos cambiar la contraseña tal como se explica en este post Cambiar la clave de cn=admin,cn=config o utilizar el siguiente comando que no requiere una autenticación explícita ya que se basa en la autenticación del propio usuario:

ldapmodify -Y EXTERNAL -H ldapi:/// -f add-index.ldif