ETL Technology is Central to the Big Data World

It allows organisations to efficiently manage multiple operations with limited resources and improve customer data analytic

Published: 14, Feb 2014

IBM has always been a company in a state of constant renewal and reinvention. Through economic upheavals and natural disasters, tech bubbles and recessions, it continues to engage clients, governments, local communities and universities to improve how the world works. Differentiated by values, strengthened by collaboration and experienced through the IBMer, today their solutions in Cloud, Big Data, Analytics, Mobility, Predictive Intelligence and others are making the world smarter. Through its Blog posts, IBMers will explore some essential areas of business and life that are deeply interlinked with technology and would like to invite all to share experiences and comments as it continues on this journey of discovery and innovation.

How many of us have been levied a late payment charge on our credit card and ended up haggling with customer service to get it waived off?  I am sure many of us.

I happened to pride myself for being punctual and always paying on time. That was until the inevitable happened. I was once travelling overseas, when I realised that my credit card payment was due in a day. I could still pay online. I went to the bank’s site but failed to remember my net banking password. And in order to reset the password, the bank wanted me to visit its nearest branch which was 9,000 miles away then. I informed the bank of my dilemma, but received a canned reply via mail.

Cut to the present – the age of Big Data. I was narrating my aforementioned experience to the CIO of a leading bank. He asked me how Big Data could help him solve such a problem and uncover more issues so as to improve customer experience. There is a huge amount of valuable information captured in customer emails and phone calls that a bank receives on a daily basis – a classic example of Big Data. Today, we have all the tools to extract insights from the available information.

However, the CIO mentioned three requirements:

(1)    The bank wants to do deeper analytics for their high-value customers only

(2)    It wants to bring down training costs involving people who execute Big Data projects.

(3)    It wants to reuse existing business processes (for marketing campaigns for instance) and not design new ones.

These points resonate across industries that can leverage Big Data. One way to achieve all this is through the Extract-Transform-Load (ETL) technology that they have been investing in over the years.  ETL technology is used to integrate data from multiple sources and combines three database functions into one tool: to pull data out of one database/source, transform it and place it into another database/target.

Requirement 1
A customer is unlikely to know that he/she is a high value customer and hence is unlikely to mention this in email communication with the bank. The information about high-value customers is present in the data warehouse from where it needs to be extracted and loaded into the Big Data environment– something that can be easily done using ETL tools.

Once the high-value customer information is available, email and call analytics can be restricted to only those customers in the high value customer list. The bank implemented such an approach and one of the biggest pain points turned out to be the inability of high-value customers was the inability to reset net banking password online!

Requirement 2
Fortunately, ETL tools have evolved of late and can easily reduce training costs for banks. ETL job developers who design data transformation workflows can use tools and translate their workflows to Hadoop jobs.  Hadoop is an open source framework that is used extensively to process BigData using MapReduce (which is another open source technology that helps to process large amounts of data on Hadoop). However, finding resources skilled in Hadoop and MapReduce can be challenging.

If an ETL developer has to find the IP addresses that have made more than a million requests on the bank’s website, he needs to write a MapReduce job which processes the web-log data stored in Hadoop.  However, with the advancement in ETL technology, a job developer can use the standard ETL design tools to create an ETL job which can read data from multiple sources in Hadoop (Files, Hive, HBase), join, aggregate and filter them to find an answer to the query on IP addresses.

ETL tools have added the capability to “translate” an ETL job to a MapReduce job using Jaql technology. Thus, the ETL job is rewritten as a JAQL query which gets executed as a MapReduce job on Hadoop.  This is a key innovation which helps to reduce entry barriers in Big Data technology and allows ETL job developers to carry out Bug Data analytics.

Requirement 3
Banks have already invested heavily in technology for running marketing campaigns. Their systems use the data warehouse as the source of information. However, the output of Big Data analytics resides in Hadoop. This is where ETL tools play a crucial role—they help take the output of the analytics and move it to the data warehouse so that it can be acted upon by the existing business processes of the enterprise.

An example: In case a bank wants to identify customers who have been unhappy in their last three interactions (with the bank) via email or phone, it could be done using analytics on Hadoop. The output of the analytics would be a simple list of customer names. ETL tools can help read this data from Hadoop and update the data warehouse. This would also enable the customer retention marketing campaign to act upon this additional insight without requiring any change in the campaign management tools, thereby saving significant costs for the enterprise.

In summary, ETL tools are far from being passé. They are central to the Big Data ecosystem and play a crucial role in enabling data analytics.

 - By Manish Bhide, PhD, is an STSM for InfoSphere Information Server at India Software Labs, IBM India

 Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

  • Shwetank

    Nicely explained from technical point of view. Especially good for those who are working in the ETL technologies.....

    on Jul 13, 2015
  • Ambadas Kshirsagar

    Hello Manish, A very nicely illustrated article in simple words and easy to understand examples. It is important that existing systems continue to be compatible and effective though the newer ones are more efficient and capable. Redesigning a system involves lot of issues including challenges of functional and business rules, particularly, since the subject matter experts would have left or moved.

    on Apr 15, 2014
  • sundar

    Hi Manish Nice article to know about hadoop . Thanks

    on Mar 10, 2014
  • Abhijit Bhagwat

    Nice to read about something so technical in a simplified and effective version. With eCommerce auguring transforming the market place, Big Data seems to be coming of age.

    on Feb 27, 2014
  • sudarshan

    well, if a cat has to pass thru it can use the same dorr that was made for the dog too....bigdata analytics can be used directly instead of loading them to these small DW solutions like SAPBW or something and then look for analysis... kindly explain thanks

    on Feb 25, 2014
    • Manish Bhide

      Sudarshan, Enterprises which have already invested heavily into DW have a whole lot of processes/applications that make use of the warehouse. They are unlikely to change these processes/apps. Hence its important for BigData and warehouses to co-exist.

      on Mar 3, 2014
      • Sudarshan

        Yes ofcourse on that i agree completely. its more of economy/comercial reasons that they can not getinto bigdata all at once...thanks Manish !!!

        on Mar 4, 2014
  • Namit Kabra

    Nice Article Manish

    on Feb 25, 2014
Is your mobile app making it to the "Top 10 Best" lists?
Is the $19 billion price tag for WhatsApp a bargain for Zuckerberg?