Big Data introduces new dimensions to data management. These dimensions, taken together, create a unique number of challenges that, if not addressed properly, can generate Big Problems.
Volume
Big Data volume is a relative concept. From gigabytes to exabytes, the palette of data volumes that organizations must process is broad.
Processing this breadth of volumes requires the support of advanced loading and transformation technologies such as ELT, distributed processing, parallelization, and MapReduce, natively supported by Talend.
|
 |
| |
Velocity
Real time is an important dimension of Big Data. Or more accurately, right time is. Big Data is about getting the right data at the right place at the right time, so that it can be leveraged efficiently. Of course, the more often Big Data is loaded & transformed, the biggest strain it's going to put on systems.
The key to timely Big Data management is Talend’s support for a broad palette of latencies – from sub-second to nightly/weekly batches and everything in between. Technologies such as Change Data Capture, event-based triggering or Web Services are critical to the Velocity dimension. |
 |
| |
Variety
The most challenging dimension of Big Data is to get to this data. Spread across a broad variety of sources, storage technologies, locations, and owners, Big Data can be elusive. It is however a strong necessity that all this data be collected and accounted for in order to perform educated decision making.
Collecting Big Data, where it resides, leverages Talend’s broad connectivity to conventional and non-conventional sources, seamless support for many formats, and the ability to discover and inventory data sources. |
 |
| |
Complexity
Beyond Volume and Velocity, Complexity affects greatly the ability to process Big Data. Big Data management turns raw data into actionable information, and involves complex transformations, algorithms, data cleansing and matching, etc. The Variety of sources, contents & formats adds to this Complexity.
Processing Big Data is made possible by Talend’s support for advanced transformations, its native MapReduce mode, extensibility, and the ability to offload the processing to optimized systems such as Hadoop. |
 |
| |
Validation
One of the processing complexities of Big Data is to deliver information the organization can rely on and is proven to be trustworthy, consistent and fresh.
This Validation dimension of Big Data requires confidence in the sources & producers, but also the ability to reconcile & deduplicate Big Data, to enrich it as needed – all in right-time. Talend’s unique Big Data Quality features enable natively this validation process. |
 |
| |
Lineage
Tracing the origin and transformation of any data item is a key challenge for Big Data, and contributes to a proper understanding of the Lineage of information used. It also enables impact analysis when any change occurs.
Talend’s comprehensive repository of metadata & transformation rules, a close integration of sources, targets and processing engines, are key contributors to effective Lineage. |
 |
Talend Integration Suite supports the advanced transformation modes required for Big Data Integration:
- traditional, code generation-based ETL (right-time to batch)
- RDBMS-leveraging ETL
- highly scalable MapReduce-based FileScale engine for flat files
- native support for Hadoop
All these modes are provided within the same development and runtime environment, bringing the greatest degree of flexibility to run the most appropriate Big Data Integration mode.
Talend Data Quality provides the advanced capabilities for profiling, cleansing and enriching Big Data, in close integration with Talend Integration Suite.
All transformation modes are supported (ETL, ELT, FileScale, Hadoop), making it the first Data Quality solution leveraging Hadoop natively.
In-database profiling also facilitates the discovery and analysis of Big Data, without requiring prior movement of the data sets.
Apache Hadoop, the open source MapReduce framework, is the de facto standard platform for sorting and processing Big Data. Organizations are deploying Hadoop alongside or instead of existing relational and OLAP platforms, major database vendors are incorporating the technology into their own offerings, and cloud-based Hadoop environments like Amazon Elastic MapReduce open Big Data processing to organizations of all sizes.
Talend’s Big Data solutions simplify the process of importing and exporting data into Hadoop, and of processing this data inside Hadoop, providing the first enterprise open source data management solution to natively support Hadoop’s distributed computing architecture for high performance and highly scalable Big Data Integration & Big Data Quality.
Talend offers native support for :
- Hadoop Distributed File System (HDFS)
- Pig, and Hive, the database infrastructure built on top of Hadoop, for structured and complex data processing
- Sqoop, to automatically load and interact with relational data in Hadoop
By leveraging Hadoop’s MapReduce architecture for highly distributed data processing. Talend’s Hadoop components generate native Hadoop code and run data transformations directly inside Hadoop for maximum scalability. This solution makes it easy for customers to combine Hadoop-based integration with traditional data integration processes - either ETL or ELT-based - for superior overall performance.
|