888-285-2754   —   510-903-1055   —   510-903-1072 fax
  Products Services Download
  Support Contact Purchase
 

Delivering data   —   where and when you need it

 

Extract, Transform and Load

The Data Migration and Integration Challenge

Overview

The business world runs on data. Data supports competitive advantage and data supports collaboration. The ability to effectively manage and communicate data has become a critical aspect of a corporation's ability to effectively do business. Data buried in a single archival data store isn't enough: knowledge is power, but information is communication and effective communication, internal and external, makes a business strong. Moving and managing data on behalf of a variety of internal and external consumers is now a mandatory core competency for a successful business. Extracting data from data stores, transforming data to meet different requirements and loading data into targeted data stores is a mission critical business challenge and opportunity.

This paper is intended to give a brief overview of the concepts and terms associated with the field of data extraction, transformation and loading (ETL). SchemaResearch's contribution is briefly outlined, for examples and real world cases consult documents "SRTransport in a Hospital Setting" and "SRTransport Deployed".

Extraction

Data can come from many different sources, among them:
  • Databases
  • XML Documents
  • Spreadsheets
  • ERP Systems

Within the data processing industry, each of these areas by themselves constitute major and complex fields of research. The tools and terminology can be extremely sophisticated. However, each discipline is motivated by the need to unambiguously structure and communicate data.

Databases are specialized data stores which allow sophisticated organization of data. While technically complex and presenting daunting modeling challenges, databases make data access relatively simple from a programming point of view.

XML is a simple, flexible and transparent methodology for organizing and transmitting structured data. Less a data store and more a communication protocol, XML is used to convey rich data sets between data management systems.

Accounting tools that have been computerized, spreadsheets are useful and familiar ways of organizing data. They include a variety of easily mastered analytical tools and provide tools for the clear display of data.

ERP (Enterprise Resource Planning) Systems are applications which aid businesses in managing inventories, manufacturing and planning, among other things. They often are required to exchange data with partner business' ERP systems to improve supply chain management and address related issues.

Transformation

In moving from one repository or repository type to another, data is often repurposed to fill significantly new needs. Data warehouse contents may be summarized and simplified as they are digested for easier access in a data mart. ERP transaction log records may be reformatted from RDBMS columns and tables into XML documents sent to the systems of supply chain partners. Typical transformation types include the following:

  • Restructure
  • Substitute
  • Derive
  • Reformat

Restructure. Data can often be usefully restructured to improve or clarify access for new purposes. Data warehouses, for example, are used to store long term data which may be accessed for data intensive statistical studies. Data marts may be used to encapsulate smaller data sets used for short window reporting purposes. Because the querying problems associated with each purpose are very different, the data may be structured differently to emphasize improved access appropriate to each problem. However, the underlying data may be the same and may derive from the same source. Part of a transform before load into the two different data stores may be structural, laying the data out in different forms to facilitate solving different problems.

Substitute. It is often useful to alter data, often to facilitate reporting. For example, replacing the nulls with "N/A" or "Unknown" might make reports simpler to develop and bring a slight performance gain. Acronyms and abbrievations can be replaced with meaningful, readable strings or data codified under one rubric can be recodified under another.

Derive. Computing aggregate values such as sums and averages contributes substantially to the cost of running reports; by developing data marts and deriving data during transform, that cost can be eliminated.

Reformat. Similar to restructuring, reformating consists of converting data into an entirely different form, e.g. reconstituting relational database data as XML documents. This is particularly useful in transmitting data between systems - often the systems involved are databases and the communication protocol is XML based.

Load

After extracting and transforming data, it is loaded into target repositories. These include:

  • Data Warehouse
  • Data Mart
  • Data Interchange

A DataMart is a specialized repository derived from primary data stores and targeting user communities with specific needs such as specialized reporting. DataMarts are designed with user access, understanding and usability as priorities. Data is typically expressed in terms that are familiar to the target community and summary data is generated on load to simplify reporting and analysis tasks. Data loaded into DataMar ts may be derived from operational activity (in which case, older operational activity is typically aged off, often already captured in a DataWarehouse) or from long term stores, such as DataWarehouses.

A DataWarehouse is a large scale archival repository. While it can be used for more massive analytic studies, such as long term trending, it is more typically a resource from which smaller, more focused DataMarts are derived. Data loaded into a DataWarehouse may have been derived from transaction logs and operational activity which needs to be tracked over long periods of time. DataMining and Decision Support Systems often rely on DataWarehouses.

Data Interchange. These are repositories made available specifically for consumption by entities external to the system. Data may be organized for high performance extraction and will typically be transformed and loaded into other data stores for analytical purposes.

SRTransport

SchemaResearch provides tools that support sophisticated extract, transform and load problems. The flagship product, SRTransport, provides a unique, intuitive user interface supporting the rapid development, testing and deployment of ETL solutions. Both extraction and load support all popular databases (Oracle, Sybase, SQL-Access and DB2) as well as popular data exchange formats (XML, CSV). It provides a powerful language which supports complex transforms and the ability to script transforms in popular languages (XSL, Python, Perl, sed) is under development. SRTransport is a simple enough tool to empower knowledge workers and a powerful enough tool to empower DBAs and IT workers. It can be used both for stand alone, one time data transfers, but is more profitably and powerfully used in supporting day-to-day information supply operations.

Schema Research Corporation, founded 1992 webmaster