Extract, Transform and Load
The Data Migration and Integration Challenge
Overview
The business world runs on data.
Data supports competitive advantage
and data supports collaboration.
The ability to effectively manage and communicate data
has become a critical aspect of a corporation's ability
to effectively do business.
Data buried in a single archival data store isn't enough:
knowledge is power, but information is communication and
effective communication, internal and external, makes a business strong.
Moving and managing data on behalf of a variety of internal
and external consumers
is now a mandatory core competency for a successful business.
Extracting data from data stores,
transforming data to meet different requirements
and loading data into targeted data stores is
a mission critical business challenge and opportunity.
This paper is intended to give a brief overview of the
concepts and terms associated with the field of data
extraction, transformation and loading (ETL).
SchemaResearch's contribution is briefly outlined,
for examples and real world cases consult documents "SRTransport in a Hospital Setting"
and "SRTransport Deployed".
Extraction
Data can come from many different sources, among them:
-
Databases
-
XML Documents
-
Spreadsheets
-
ERP Systems
Within the data processing industry, each of these areas by themselves
constitute major and complex fields of research.
The tools and terminology can be extremely sophisticated.
However, each discipline is motivated by the need
to unambiguously structure and communicate data.
Databases are specialized data stores which allow sophisticated
organization of data.
While technically complex and presenting daunting modeling challenges,
databases make data access relatively simple from a programming point of view.
XML is a simple, flexible and transparent methodology for organizing and transmitting
structured data.
Less a data store and more a communication protocol,
XML is used to convey rich data sets between data management systems.
Accounting tools that have been computerized,
spreadsheets are useful and familiar ways of organizing data.
They include a variety of easily mastered analytical tools
and provide tools for the clear display of data.
ERP (Enterprise Resource Planning) Systems are applications which aid businesses in
managing inventories, manufacturing and planning, among other things.
They often are required to exchange data with partner business' ERP systems
to improve supply chain management and address related issues.
Transformation
In moving from one repository or repository type to another,
data is often repurposed to fill significantly new needs.
Data warehouse contents may be summarized and simplified as
they are digested for easier access in a data mart.
ERP transaction log records may be reformatted from RDBMS
columns and tables into XML documents sent to the systems
of supply chain partners.
Typical transformation types include the following:
-
Restructure
-
Substitute
-
Derive
-
Reformat
Restructure.
Data can often be usefully restructured to improve or clarify
access for new purposes.
Data warehouses, for example, are used to store long term data
which may be accessed for data intensive statistical studies.
Data marts may be used to encapsulate smaller data sets used
for short window reporting purposes.
Because the querying problems associated with each purpose
are very different,
the data may be structured differently to emphasize improved
access appropriate to each problem.
However, the underlying data may be the same and may
derive from the same source.
Part of a transform before load into the two different
data stores may be structural,
laying the data out in different forms to facilitate solving different
problems.
Substitute.
It is often useful to alter data,
often to facilitate reporting.
For example,
replacing the nulls with "N/A"
or "Unknown" might make reports simpler to develop
and bring a slight performance gain.
Acronyms and abbrievations can be replaced with
meaningful, readable strings
or data codified under one rubric
can be recodified under another.
Derive.
Computing aggregate values such as sums and averages
contributes substantially to the cost of running reports;
by developing data marts and deriving data during transform,
that cost can be eliminated.
Reformat.
Similar to restructuring,
reformating consists of converting data into an entirely different
form, e.g. reconstituting relational database data as XML documents.
This is particularly useful in transmitting data between
systems - often the systems involved are databases and the
communication protocol is XML based.
Load
After extracting and transforming data,
it is loaded into target repositories.
These include:
-
Data Warehouse
-
Data Mart
-
Data Interchange
A DataMart is a specialized repository derived from primary data stores and targeting
user communities with specific needs such as specialized reporting.
DataMarts are designed with user access, understanding and usability as priorities.
Data is typically expressed in terms that are familiar to the target community and
summary data is generated on load to simplify reporting and analysis tasks.
Data loaded into DataMar ts may be derived from operational activity (in which
case, older operational activity is typically aged off, often already captured in a DataWarehouse)
or from long term stores, such as DataWarehouses.
A DataWarehouse is a large scale archival repository.
While it can be used for more massive analytic studies, such as long term trending,
it is more typically a resource from which smaller, more focused DataMarts are derived.
Data loaded into a DataWarehouse may have been derived from transaction logs and
operational activity which needs to be tracked over long periods of time.
DataMining and Decision Support Systems often rely on DataWarehouses.
Data Interchange.
These are repositories made available specifically for consumption
by entities external to the system.
Data may be organized for high performance extraction and
will typically be transformed and loaded into other data stores for analytical purposes.
SRTransport
SchemaResearch provides tools that support sophisticated
extract, transform and load problems.
The flagship product,
SRTransport,
provides a unique,
intuitive user interface supporting the rapid development, testing and deployment of
ETL solutions.
Both extraction and load support all popular
databases (Oracle, Sybase, SQL-Access and DB2)
as well as popular data exchange formats (XML, CSV).
It provides a powerful language which supports complex
transforms and the ability to script transforms in
popular languages (XSL, Python, Perl, sed) is under development.
SRTransport is a simple enough tool to empower knowledge workers
and a powerful enough tool to empower DBAs and IT workers.
It can be used both for stand alone, one time data transfers,
but is more profitably and powerfully used in supporting
day-to-day information supply operations.