Data Integration (DI) Layer
Data Integration (DI
) is the gateway for incoming data into the Data Integration Hub (DIH
) system. The DI components are delivered as part of the GigaSpaces DIH Package.
DI contains four components which are responsible for reading and analyzing Kafka messages and for pushing them into the Space
.
DI Layer Overview
The DI Module is illustrated as follows:
1. Apache Flink
GigaSpaces uses Apache Flink as it is an open source framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink is designed to run in all common cluster environments, performing computations at in-memory speed and at any scale. It is also a powerful and fast framework for stream processing. It also allows deployment of different types of applications at run-time. In addition, Flink supports streaming and batch mode, which is useful for periodic batch updates. One of the most common types of applications that are powered by Flink are Data Pipeline
Applications, which is why we chose to use it in our Smart DIH
solution.
Extract transform load (ETL) is a common approach used to convert and move data between storage systems. Often, ETL jobs are periodically triggered to copy data from transactional database systems to an analytical database or data warehouse. Data pipelines serve a similar purpose as ETL jobs in that they transform and enrich data and can move it from one storage system to another, However, they operate in a continuous streaming mode instead of being periodically triggered.
Additional Information: Apache Flink.
2. Metadata Manager (MDM)
The Metadata Manager (MDM) is a stateful data service which communicates with external components via REST APIs. It can be deployed as a standalone application. It uses Zookeeper
(ZK) as a persistent data store.
Functionality
The MDM stores, edits and retrieves information for the following:
-
The source table structure
-
The structure mapping to the space type
-
The data types conversion maps
-
The configurations of the DI Manager and Pipeline, which are DI layer components
-
The pluggable CDC
templates
-
The created and dropped types in the Space
The MDM refreshes its metadata on-demand from sources into the MDM data store (ZK). The MDM compares and repairs stored metadata against created objects and in Space. The MDM also provides information about stored metadata over REST to the UI and DI Manager.
3. DI Processor
The DI Processor is a Java library deployed to the Flink cluster. It is operated by the Flink Task Manager and is part of the Flink job. It is used to process Kafka messages and automatically identifies the consumed message format based on a pluggable CDC template. It converts messages into a Space document and writes the Space document to the Space.
Flow
-
Parsing Kafka messages
-
Determining source table information
-
Determining CDC operation type (INSERT, UPDATE or DELETE)
-
Extracting all column data from the parsed message.
Extraction information is provided by MDM service
Extraction information includes names of attributes, their types and json path used to extract the values
-
Storing extracted data as table row (e,g, Flink Row for interoperability)
The SpaceDocumentMapper is responsible for converting the table row into corresponding SpaceDocuments which is stored in the OpDoc entity together with the operation type.
The conversion is performed according to the source table name. Multiple types of SpaceDocuments can be generated from a single table row.
Conversion may include:
-
Mapping of row column name to the space document attribute name
-
Conversion of types
-
Non-trivial transformations
-
Calculated expressions
OpDoc
The OpDoc entity contains the following information:
-
Operation (insert, delete or update)
-
Space Document
-
Partition ID
-
Routing
property name
-
Transaction ID
Example:
Keyed by partition id
This is the process that attached a space partition ID to each OpDoc according to the SpaceType routing definition.
Time window aggregation
The process aggregates all OpDocs received during a certain time period for efficient space write operation.
Write to Space
At this phase, all aggregated OpDocs are written to the appropriate partition in space asynchronously using the space task execute mechanism.
4. DI Manager
The DI manager is an interface for communicating with Flink. It also communicates with external components such as UI and MDM via REST API. In addition, the DI Manager retrieves a correct schema and tables structure from a Source of Record (SOR) and stores it in the MDM.
The DI Manager-Flink operations are:
-
Creating, editing and dropping Flink jobs
-
Starting and stopping Flink jobs
-
Getting Flink's job status