21 December 2023

DI Architecture & Components

High-Level DI Architecture

Components

There are several functional components which comprise the DI The Data Integration (DI) layer is a vital part of the Digital Integration Hub (DIH) platform. It is responsible for a wide range of data integration tasks such as ingesting data in batches or streaming data changes. This is performed in real-time from various sources and systems of record (SOR. The data then resides in the In-Memory Data Grid (IMDG), or Space, of the GigaSpaces Smart DIH platform. layer. All serve ongoing DI operations as part of the data integration layer of the Smart DIH Smart DIH allows enterprises to develop and deploy digital services in an agile manner, without disturbing core business applications. This is achieved by creating an event-driven, highly performing, efficient and available replica of the data from multiple systems and applications, platform.

The table below summarizes all the key DI layer components.

Name	Purpose	Details
CDC Change Data Capture. A technology that identifies and captures changes made to data in a database, enabling real-time data integration and synchronization between systems. Primarily used for data that is frequently updated, such as user transactions. (IIIDR) Source Database Agent	Captures the changes from the source System of Records. For example Oracle, DB2, MSSQL.	IIDR IBM Infosphere Data Replication. This is a solution to efficiently capture and replicate data, and changes made to the data in real-time from various data sources, including mainframes, and streams them to target systems. For example, used to move data from databases to the In-Memory Data Grid. It is used for Continuous Data Capture (CDC) to keep data synchronized across environments. Source database agent is a java application that usually installed on a source database server. IIDR Agent captures changes from the source database transaction log files in real time IIDR Oracle agent Port: 11001 IIDR Db2 zos agent Port: 11801 IIDR Db2 AS-400 agent Port: 11111 IIDR MSSQL agent Port: 10501
IIDR Target Kafka Apache Kafka is a distributed event store and stream-processing platform. Apache Kafka is a distributed publish-subscribe messaging system. A message is any kind of information that is sent from a producer (application that sends the messages) to a consumer (application that receives the messages). Producers write their messages or data to Kafka topics. These topics are divided into partitions that function like logs. Each message is written to a partition and has a unique offset, or identifier. Consumers can specify a particular offset point where they can begin to read messages. Agent	Writes the changes captured by IIDR source agent to Kafka.	IIDR Kafka Agent is a Java application that runs on the Linux machine and writes changes captured by the IIDR source agent to Kafka. Port:11710
IIDR Access Server	IIDR administration service and Metadata Manager	IIDR Access Server is responsible for creating all logical IIDR entities and objects such as subscriptions and data stores. All metadata is stored in the internal IIDR database (Pointbase) IIDR AS Port: 10101
DI Manager	This is the primary interface which controls all DI components	Web service, exposes REST APIs REpresentational State Transfer. Application Programming Interface An API, or application programming interface, is a set of rules that define how applications or devices can connect to and communicate with each other. A REST API is an API that conforms to the design principles of the REST, or representational state transfer architectural style. to: 1) Create pipeline and source db connection 2) Stop/ start pipeline 3) Other administration tasks Port: 6080
DI MDM (Metadata Manager)	Stores and retrieves metadata in Zookeeper Apache Zookeeper. An open-source server for highly reliable distributed coordination of cloud applications. It provides a centralized service for providing configuration information, naming, synchronization and group services over large clusters in distributed systems. The goal is to make these systems easier to manage with improved, more reliable propagation of changes.	Web service, expose REST REpresentational State Transfer. Application Programming Interface An API, or application programming interface, is a set of rules that define how applications or devices can connect to and communicate with each other. A REST API is an API that conforms to the design principles of the REST, or representational state transfer architectural style. API Communicates with DI Manager Stores and retrieves metadata that is essential for a DI operation: 1) Data dictionary about tables, columns and indexes 2) Pipeline configuration 3) Other important metadata records that are required for ongoing DI operations Port: 6081
DI Processor	Java library run by Flink Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. as a job. It is responsible for writing changes to the space.	Java library , deployed to Flink and invoked as a Flink job. Main responsibility to read messages from Kafka , perform a transformation from a Kafka message into a space document and write this change into the space relevant object.
Zookeeper (ZK)	Serves as a persistent data store for DI components. Serves as a ZK that is required by Kafka.	ZK runs on 3 nodes for H/A purposes. ZK data is replicated between all nodes. Port: 2181
Kafka	Serves as a streaming processing platform.	Kafka is deployed in a cluster of 3 nodes when it uses ZK is its dependency. IIDR publishes changes to the Kafka topic and theDI Processor (Flink job) consumes these messages and writes changes to Space Where GigaSpaces data is stored. It is the logical cache that holds data objects in memory and might also hold them in layered in tiering. Data is hosted from multiple SoRs, consolidated as a unified data model.. Kafka Port: 9092

High-Level Data Flow

DI Subscription Manager

DI Subscription Manager is a web service that exposes a set of APIs on a port. Its unified API has control over CDC components. Only CDC components are in direct contact with the SoR.

DI Subscription Manager is a micro-service that is responsible for providing the following functionality:

1. Unified API that controls various CDC engines to implement the GigaSpaces pluggable connector vision. It creates and updates IIDR entities.

Defines CDC flows and entities. Defines a new subscription.
Start / Stop subscription data flow via IIDR
Monitors the status of the IIDR components

2. Unified method to extract data dictionary from various sources, such as the CDC engine , source database , schema registry or enterprise data catalog and populate the DI data dictionary internal repository (MDM)

3. Data dictionary extraction from the IIDR.

Significantly simplifies DI operations
Only IIDR components connect to the source database
There is a unified data dictionary extraction, regardless of the source database type

StreamSQL

StreamSQL allows Smart DIH Digital Integration Hub. An application architecture that decouples digital applications from the systems of record, and aggregates operational data into a low-latency data fabric. users to implement a low code (SQL) approach to define and operate with ad-hoc data flows, such as read from Kafka and write directly to the Space or read from one Kafka topic and write to another Kafka topic.

Behind the scenes StreamSQL utilizes powerful low-code Flink capabilities to define a schema via SQL CREATE TABLE API.

StreamSQL operation activities can be defined using SQL statements, for example:

Define structure of messages in a Kafka topic as a table (CREATE TABLE)
Define a data flow (stream of data or pipeline) as INSERT AS SELECT statement
Perform a join of data flow from different Kafka topics using a standard SQL join statement

One of the useful StreamSQL use cases is IoT when continuous flow of sensors data changes is consumed from Kafka, aggregated into a summary table and pushed the aggregated summary to space for data services consumption.

SpaceDeck

For information about how StreamSQL is implemented in SpaceDeck GigaSpaces intuitive, streamlined user interface to set up, manage and control their environment. Using SpaceDeck, users can define the tools to bring legacy System of Record (SoR) databases into the in-memory data grid that is the core of the GigaSpaces system. refer to StreamSQL.