Talend and “The Data Vault”

In my previous blog “Beyond ‘The Data Vault’” I examined various data storage options and a practical architecture/design for an Enterprise Data Vault Warehouse.  As you may have realized by now I am quite smitten with this innovative data modeling methodology and recommend to anyone who is developing a ‘Data Lake’ or Data Warehouse on Big Data platforms consider this as a critical design paradigm. Dan Linstedt has worked hard on expanding this concept and recently introduced Data Vault 2.0 (shameless plug) which extends the methodology, architecture, and data model (HUB/LNK/SAT) implementation and best practices.

Today’s massive acceleration and accumulation of data truly demands high-technology, a reliable architecture, and clear best practices to accommodate, access, and gain from it, ‘Real Value’.  Moving this voluminous data from point ‘A’ to ‘B’ is a serious challenge.  Talend is designed for and embraces this challenge head on.  So where do Talend and “The Data Vault” converge?  Well, let’s find out…

Talend Data Fabric

Data Integration is the foundation upon which data system architects and implementation specialists all develop ETL/ELT processes for getting getting data from somewhere to elsewhere.  We then add Data Quality, Data Governance, Master Data Management, and Web Services as needed; implementing many acronyms along the way like: ESB, SOAP, REST, CDC, HDFS, SPARK, HIVE, and many more.  Talend Data Fabric is comprised of a full-featured development platform you are obvously familiar with, so let’s leave all the marketing fluff and sales pitches to those who live and breath that sort of thing.  In my mind, it’s the technology, the methodology, and best practices that truly matter.

In the case of a Data Vault (DV), Talend is emerging as a key technology utilized by many companies to sort out their DV data processing requirements.  There are 2 specific provisions of interest:

  • Schema Generation - using external metadata, having the ability to synchronize a Data Vault model with the ETL/ELT toolset
  • Job Templatesusing the synchronized Data Vault model, generate and maintain Talend jobs for ingestion, manipulation, & subsequent down-stream processing

While today’s Talend does not yet support automated facilities to manage these needs, work has begun to examine how, where, and when to implement them.  Talend is happy to be working closely with Dan Linstedt on these features and we hope to bring the most robust ‘Data Vault’ enabled data processing tool to the market soon.  If I have any sway, that is.   Meanwhile, many customers currently using Talend and their ‘Data Vault’ are happily demonstrating considerable success the old-fashioned way.  They code it!

WWDVC 2016

Talend has joined the WWDVC 2106 - World Wide Data Vault Consortium held in Stowe, Vermont this May 25th through May 29th as a Platinum Sponsor.  As an event sponsor presenting information on the product roadmap, features (for those new to Talend) and specifics on how to build data integration jobs for Data Vaults, we expect to finally bring ETL/ELT technology to this prestegous audience.  A three hour hands-on session will demonstrate how to build Talend jobs for both ‘Relational’ and ‘Big Data’ Data Vault environments.  Ed Ost, Talend’s ‘Director, Worldwide Technology Alliances, Channel Partners’ has fashioned the simple Enterprise Data Vault depicted in my previous blog into a sandbox for attendees to learn from and play in. 

Here are some sample implementations using the Amazon AWS cloud-based infrastructure:

The Relational flow demonstrates a dump/load technique from the source data into an AWS S3 bucket and processed into a Data Vault defined and stored in RDS.  From there one can ELT the data directly into a de-normalized RedShift Data Mart for analytical queries.

The Big Data flow demonstrates a direct read/load from the source data using ‘Sqoop’ into a S3/RedShift Data Vault having Point-In-Time and Bridge table to allow equa-join queries that also uses direct ELT to populate a de-normalized RedShift Data Mart for analytical queries.

The Big Data flow using Spark demonstrates a variant where the source data is read and loaded into an S3/Hive Data Vault using Spark and then a more traditional ETL process to populate a de-normalized RedShift Data Mart for analytical queries.

This years event wil be special, not because I am speaking, no; but because W. H. (Bill) Inmon is!  The Father of Data Warehousing himself.  Other Talend partners like Analytix/DS and Snowflake are also sponsoring this event so it should prove to be the absolute best place to be for meeting, networking, and discussing Data Vaults with the thought leaders from around the world.

Is Your “Data Lake” a Swamp?

Now that we’ve looked at how Talend and the ‘Data Vault’ work together, I wanted to address a key topic that discomforts me.  As most ‘buzz’ words go, the ‘Data Lake’ is fast becoming the most annoying, misunderstood, and misused term yet. The essence of the idea is to dump ALL your data into this puddle, this pond, this lake, this ocean; or perhaps this swamp, this cesspool!  Without technology, methodology, and best practices, that’s exactly what you’ll get.  I do like the idea of dumping everything into a Data Lake.  In fact it adheres to one of Dan Linstedt’s precepts: “100% of the data 100% of the time”.  If you put all your data into the lake, then you’ll already have it when a business need arises.  No downtime required!  The real focus should be on the query instead.  I’m sure you’ve guessed that Talend and the ‘Data Vault’ is the answer.  Creating a Data Lake based upon the Data Vault model and methodology in a Big Data environment with robust data integration tools (Talend) plus pliable best practices, I believe, is the right way to go.  Swim in the lake, don’t drown in it!

Conclusion

So, not my usual lengthy blog; Yes, I can keep it brief.  The reality is: ‘busy is, as busy does’.  I am hopeful to see many of you Data Vault enthusiasts at the WWDVC in May.  If not, let’s keep the dialog going here; post your comments, ask your questions, raise the debate!  I’d be more than happy to respond in kind.

Related Resources

7 Steps to Faster Analytics Processing with Open Source

Products Mentioned

Talend Big Data

Share

Comments

Kent Graziano
Dale - Nice article. Assuming you could just as easily follow the first two patterns using Snowflake DB instead of Redshift?
GraemeP
Hi Dale, any update on the plans for Talend to support automated schema generation and job templates for Data Vault?
Dale Anderson
@JimF Thanks for posting your question on Data Vaults in Hadoop. You would definitely use Hive managed tables against the raw data files. Per Dan Linstedt: "Typically Hive managed 'internal' tables are used for better access and speed - at least for the structured data, and at a minimum for the hubs and links. SOME satellites can sit in Hive 'external' tables (such as word docs, etc.) as long as the hash keys are properly computed."
JimF
When landing data in Hadoop and using Hive, do you create external tables on the raw data files that consist of hubs/satellites/links or do you use Hive managed tables and transform the data?

Leave a comment

Kommentare

Comment: 
When landing data in Hadoop and using Hive, do you create external tables on the raw data files that consist of hubs/satellites/links or do you use Hive managed tables and transform the data?
Comment: 
@JimF Thanks for posting your question on Data Vaults in Hadoop. You would definitely use Hive managed tables against the raw data files. Per Dan Linstedt: "Typically Hive managed 'internal' tables are used for better access and speed - at least for the structured data, and at a minimum for the hubs and links. SOME satellites can sit in Hive 'external' tables (such as word docs, etc.) as long as the hash keys are properly computed."
Comment: 
Hi Dale, any update on the plans for Talend to support automated schema generation and job templates for Data Vault?
Comment: 
Dale - Nice article. Assuming you could just as easily follow the first two patterns using Snowflake DB instead of Redshift?

Neuen Kommentar schreiben

More information?