Loading content

Please standby, while we are retrieving your information

Cloud Data Foundation

Significantly expedites the data ingestion and data transformation journey from your premise system to the cloud

Cloud Data Foundation

This service addresses data ingestion, data transformation ,and orchestration requirements from a variety of data sources like stream, batch & API based. This service leverages cloud native services and is built on the spark framework for accelerated data ingestion and transformation journey.
Cloud Data Foundation

The Cloud Data Foundation service enables industrialized end-to-end data processing to allow organizations to generate trusted real-time insights. Cloud Data Foundation offers various sub-services

  • Linked Services
  • Data Ingestion Service
  • Data Processing and Transformation
  • Federated Data lake Management Services
  • Workflow Orchestration Service
  • Catalog Service
  • All these sub-services work together to bring 30% effort savings.

    Cloud Data Foundation

    CLOUD DATA FOUNDATION

    IDEA CDF provides a flexible, scalable architecture that federates data from multiple sources, ingests both stream and batch data, and builds a cloud-based data lake.

    This platform allows for automated and accelerated data ingestion and curation on target cloud data warehouse. It enables metadata-driven pattern-based data Ingestion

    • Supports Batch & Streaming data ingestion
    • Supports cloud native, scalable and elastic data workloads
    • Schema validation/ Schema drift detection while ingestion
    • Support for incremental data ingestion / CDC

    Cloud Data Foundation Features

    A brief summary

    Link
    Service

    Advanced
    Crawler

    Federated Data Lake
    Management

    Transformation
    & Pipeline

    Data
    Ingestion

    Schema
    Registry

    Data Processing
    & Transformation

    Workflow
    Orchestration

    Link Service

    The Link service enables the IDEA platform to connect to the source and target.
    The registered link service will be used by other modules to connect to the source & target to perform the necessary operations.

    The following linked services are being used

    • S3
    • ADLS Gen2
    • HDFS
    • ORACLE
    • Teradata
    • FTP
    • SFTP
    • SQL Server
    • API

    Advanced Crawler

    The Advanced Crawler service is used to extract metadata information from databases and file system servers. The captured metadata will be used for other module job registries as well as job execution.
    This service enables user to create ,view , edit, execute ,and monitor crawler.

    Analyzes data source and data warehouse applications and tables to understand object counts and object schemas. A job needs to be registered with the link service and when executed, the point-in-time snapshot will be captured from source/target and stored in IDEA metadata.

    Supported Sources

    • FTP
    • SFTP
    • S3
    • ADLS Gen2
    • Oracle
    • Teradata
    • HDFS
    • SQL Server
    • GCS
    • Hive

    Federated Data Lake Management

    Data federation is the capability to integrate data from another data store using a single interface. Data federation is an aspect of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.

    In other words ,this service

    • Enables the user to access different types of database servers and files with various formats and integrate data from all different data sources and also offers features for transforming the data and allows accessing data through various APIs and languages.
    • Allows data users to access data stores through data federation and operate it independently—in other words, they can be used outside the scope of data federation.
    • Make possible to have one integrated data store regardless of how and where data is stored .It’s presented as one integrated data set thus involves transformation, cleansing, and possibly even enrichment of data.
    • Provides on-demand data integration where integration takes place on the fly, and not in batch. When the data consumers/users ask for data, only then data is accessed and integrated. So, the data is not stored in an integrated way but remains in its original location and format.

    Transformation & Pipeline

    Data transformation & pipeline services enables user to run ETL logic on ingested data .Once the data lands in the IDEA landscape using data ingestion pipelines, data processing jobs can be created on it for transformation.

    All the transformation logic is written inside templates which will be executed. Templates can be varied based on use case and transformation logic to be used.

    Data transformation & pipeline services enables user to run ETL logic on ingested data .Once the data lands into IDEA landscape using data ingestion pipelines, data processing jobs can be created on that for transformation.

    All the transformation logic is written inside templates which will be executed. Templates can be varied on use case and the transformation logic to be used. Once created, it is placed on specific cloud storage and can be configured in job using metadata-driven model IDEA provides various predefined templates like currency conversion, join operations,and null removal from data.

    Data Ingestion

    Batch and Stream ingestion from different sources to target cloud data warehouse storage

    • The Cloud Data Foundation supports Batch Ingestion using Data Fusion and DataProc.
    • Target cloud Data Fusion is fully managed, scalable enterprise data integration platform. It enables bringing transactional, social or machine data in various formats from databases, applications, messaging systems, mainframes, files, SaaS and IoT devices, offers an easy-to-use visual interface, and provides deployment capabilities to execute data pipelines on ephemeral or dedicated Dataproc clusters in Spark.
    • Target cloud DataProc is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams.
    • Cloud Data Foundation enables Stream Ingestion through Confluent Kafka. Confluent Platform is a full-scale data streaming platform that enables you to easily access, store, and manage data as continuous, real-time streams.
    • It supports 120+ pre-built connectors to quickly connect to critical enterprise applications, databases, and systems of records by using out-of-the-box, expert-certified connectors.
    • A Debezium/dedicated connector has been used to perform CDC from SQL Server to target cloud storage. The SQL Server CDC source connector captures changes in an SQL Server database and writes the changes as change event records in Kafka topics. This connector all the newly inserted, updated ,and deleted data.

    Schema Registry

    Cloud Data Foundation will provision support for registering schema & validating source data / data files against it.

    However, user will always have the flexibility to use CDF’s ingestion features to ingest source systems data without defining & validating against schema.

    The Schema registry is supported by following source via Kafka

    • Oracle
    • SQL Server
    • Teradata
    • S3
    • ADLS Gen2
    • FTP
    • SFTP

    Data Processing & Transformation

    Custom-made prebuilt Classic and Flex templates with extended support for UDFs are available.

    The Cloud Data Foundation supports data processing and transformation using DataFlow.

    Dataflow is a managed service for executing a wide variety of data processing patterns.

    The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines.

    Allow you to stage your pipelines on target Cloud and run them . Classic templates are staged as execution graphs on Cloud Storage while Flex Templates package the pipeline as a Docker image and stage these images on your project's Container Registry or Artifact Registry.

    Workflow Orchestration

    This service enables user to define a pipeline of seemingly independent tasks. Though many of the features in the modernization / migration project are developed as independent, modular pieces of functionality, to achieve the end goal, we must perform one or more of these tasks in a particular sequence.

    In other words,
    It allows user to string together multiple tasks and execute them in a way that allows us to achieve our end goal of data processing. A workflow allows us to define simple and/or complex relationships between various tasks.
    If the workflow is parameterized, it can be reused for multiple executions, for different operations.

    We provide a fully managed workflow orchestration dedicated service which empowers the user to create/author, schedule, and supervise the workflow on the pipeline ,whether it is on-premises, or in multiple clouds, or in target DWH .




    Design for Industrialization

    Service Benefits

    Low code/ No Code

    Low code / No Code web app accelerates tasks for personas such as Data Engineers/ Data Analysts

    Scalable architecture

    Cloud native, scalable, elastic data ingestion framework

    Data Federation

    Federates data from multiple sources, ingests both stream and batch data, and builds a cloud-based data lake.

    Transformation Framework

    Supports jumpstart templates for data standardization and transformation
    Python, Spark and cloud native services like Glue, EMR, Kafka, ADF, Databricks etc.
    Based on client engagement scope IDEA team can support additional linked services. For each new connector we need to add a microservice and setup additional metadata configurations to make the new linked service available on IDEA WebApp. Based on our experience the key challenge we come across is to test the connectivity with the new data source in a development environment. For this we will need support from the account or client team to be able to connect to a dev or test instance of the new data source type with some sample test data?
    No, this is not supported. However this service provides prebuilt transformation notebook templates, built on spark framework, which expedites the ETL build process
    Currently this service supports data load to Redshift, Synapse & Snowflake
    Next Steps

    To learn more about IDEA by Capgemini and how we can help make data your competitive edge.
    Visit : www.capgemini.com/ideabycapgemini

    • Mukesh Jain
      IDEA Head
      mukesh.jain@capgemini.com


    • Harsh Vardhan
      IDEA Chief Digital Leader & Chief Technology Architect
      harsh.c.vardhan@capgemini.com

    • Eric Reich
      Offer Leader and Global Head AI & Data Engineering VP
      eric.reich@capgemini.com

    • Aurobindo Saha
      IDEA Principal Sales Architect
      aurobindo.saha@capgemini.com


    • Sameer Kolhatkar
      IDEA GTM Lead
      sameer.kolhatkar@capgemini.com


    • Sandip Brahmachari
      IDEA D&S Lead sandip.brahmachary@capgemini.com


    • Anupam Srivastava
      IDEA Engineering Lead
      anupam.a.srivastava@capgemini.com


    • Subramanian Srinivasan
      IDEA Shared Services Lead
      subramanian.srinivasan@capgemini.com