Loading content

Please standby, while we are retrieving your information

DATA TRUST

Trusted framework for data asset reconciliation supported by Data Trust

DATA TRUST

Data quality monitoring is critical to ensure the quality of analytical outcomes and to gain trust from the business users. It enables fast tracking implementation of Data Quality Validation/ Data Profiling as part of data pipelines.
Data Profiling & Validation
IDEA supports continuous data quality measurement and monitoring as part of data pipelines. Metadata-based Data Quality framework in IDEA is implemented using Spark DataFrame/ pySpark. Supports 30+ data validation rules out-of-box. Data Profiling service enables examining and analysing data to create valuable summaries of it. It shifts through data to determine its quality and legitimacy. The process yields a high-level overview that aids in discovering data quality issues such as values outside the range, unexpected patterns in data which helps data validation service to validate the data. IDEA Data profiling can be applied to selected columns of the data file / table. Column selection is part of the metadata setup. Supports single column profiling (Min, Max , Average etc.) as well as cross-column profiling (Correlation, Covariance, Cross Tabulation).
Idea migration

Data Trust

Data profiling and validation is a must when it comes to cloud and data analytics services. Ensuring quality of data and validating it are crucial for any data related services.

This service has metadata based framework to perform technical data quality / data profiling. It helps in fast tracking implementation of Data Quality Validation/ Data Profiling as part of data pipelines.

Data Quality and Data profiling dashboards show statistics for all DQ/ Data profiling jobs.


Trust Features

A brief summary

Data Profilling
Service

Data Quality
Service

Data Catalog
Service

Data Lineage
Service

Data Profiling Service

Before data reconciliation is performed, user may want to understand how the data is. To understand the structure of the data and some statistics around it, the user can execute the Data Trust module. While some of the operations performed by the Data Reconciliation and Data Profiling are similar, the difference is that the reconciliation uses two data stores, applies operations, and compares the results between the two data stores, the profiling module uses only one data store.

A quick look at the features

  • Data profiling can be applied to all columns or selected columns of the relation. Column selection is part of the metadata setup. Multiple data profiling jobs can be registered.
  • Allows the user to perform systematic analysis of the content of a data source.
  • Supports single column and multi-column profiling.
  • Data Profiling can be performed on various data sources such as Teradata, Synapse, Azure Blob.
  • Single column profiling - Count, Value, Lists, Data Distribution.
  • Multiple column profiling - Correlation, Covariance, Cross Tabulation.
  • Supports Data Profiling at a Column Level as well across Multiple columns.

Data Quality Service

Data quality validation allows users to check and validate the data through data validation rules. This service offers such service where it supports over 30+ validation rules and performs rectification of data where it’s required. Below steps are how it works

  • Data Quality Validation is performed by registering appropriate jobs and their metadata. Metadata to specify the data store to operate on and the DQ/ Data profiling operations to be performed.
  • Implemented using Spark DataFrame/ pyspark.
  • Supports data validation rules (like Null Check, Value Matching, Range Matching, String-Is-Part-Of-List, Date Format Validation etc).
  • After loading the data, the data validator applies the DQ rules and classifies the data into 'good' records and 'bad' records.
  • 'Good' records are those that pass the rule conditions, while 'Bad' records are those that do not pass the rule conditions. After applying the rules and classifying the records, the data sets are saved to different files / areas.
  • Supports data stores like Teradata, Synapse, Azure Blob, AWS S3 and Snowflake.

Schema Migration

Schema migration service is used to migrate the Database components from Teradata to target warehouse based on the mapping information updated in IDEA metadata during the discovery service. The schema migration service depends on two key components

  • Teradata (aka Source) Database discovery service output
  • Mapping detail

Teradata discovery service extracts the attributes for the following objects from DBC table. Based on the compatibility on Target warehouse, applicable objects will be migrated.

The collection of Teradata specific keywords is stored in mapping metadata tables. These mapping tables will be used as reference and relevant Target warehouse keywords are populated. If there is no direct mapping available, then the alternatives will be used.

  • Database / Dataset level mapping
  • Table level mapping
  • View level mapping
  • Column level mapping

Data Migration

Data Migration service migrates data from Teradata as well as FTP/SFTP server to Target warehouse.

The Job run will be triggered based on the details provided while registering the job and it is split into two stages. Data extracted from Teradata or FTP/SFTP server to target cloud service storage and vice versa.

These two stages are decoupled using Kafka queue which helps to manage the failure-re-run effectively as well as start the next table load from Teradata while the previous table Target warehouse load in-progress.

The job execution will be considered as successful only when all the tables associated with the job is successfully loaded from source to Target warehouse.

The re-run option will be enabled only for the failed job run. A failed job re-run will perform the data migration for the failed tables and the data load resumes from the point where it failed.

Complexity Analyzer

Complexity analyzer scans the Teradata BTEQ scripts and classifies them into Simple, Medium and Complex based on the rules configured.

ETL Migration

ETL migration service constitutes of two services, which generate the following output based on the mapping document available in IDEA metadata:

  • ETL Code Conversion
  • Store Procedure wrapped ETL Conversion

ETL script might have DDL (temporary table creation) and DML statements and the ETL migration supports both DDL and DML conversion into Target warehouse compatible statements. ETL migration also uses parsers to generate two files:

  • Error Parsed File (AI powered)
  • Semantic Parsed File (AI powered)

Percentage Statistics file will help you understand the conversion percentage of the input script. The two Error files will specify the errors at a granular level for better understanding.


Design for Industrialization

Service Benefits

Gain Trust from the Business Users

Regardless of how simple or complex your analytical needs are, how accurately your data is processed determines the integrity of all data outcomes.

Better Decision Making and Improved Business Outcomes

There is a concept: Garbage in, garbage out. Proactive Data Quality Monitoring ensures garbage data is identified and corrected before arriving at business insights leading to improved decision making. A proper data quality management plan and workflow are enough to boost your business operations.

Proactive Crisis Management and Compliance

Poor data quality can cost companies heavily in terms loss in revenue, and reputational damage. Data Profiling can eliminate costly errors that are common in databases. Data Quality Management enables businesses efficiently identify and address data issues. This enables enterprise to stay compliant and avoid fines.
IDEA DQ framework is implemented using Spark DataFrame/ pySpark. It can run on Apache Spark/ Spark Databricks/ EMR Spark etc.
IDEA WebApp offers Data Trust features. Here Data Engineer can select the ingested data files from Landing Zone and then add DQ rules using the WebApp interface. This would create Data Quality or Data Profiling jobs in the metadata layer. These DQ jobs need to be integrated when the Data Engineer builds Data Pipeline DAG workflow.
Framework supports UDF functions to create custom DQ rules
By default, these rules can be applied on the data files from object storage (ADLS Gen 2 or Amazon S3) as part of batch data processing.
Next Steps

To learn more about IDEA by Capgemini and how we can help make data your competitive edge.
Visit : www.capgemini.com/ideabycapgemini

  • Mukesh Jain
    IDEA Head
    mukesh.jain@capgemini.com


  • Harsh Vardhan
    IDEA Chief Digital Leader & Chief Technology Architect
    harsh.c.vardhan@capgemini.com

  • Eric Reich
    Offer Leader and Global Head AI & Data Engineering VP
    eric.reich@capgemini.com

  • Aurobindo Saha
    IDEA Principal Sales Architect
    aurobindo.saha@capgemini.com


  • Sameer Kolhatkar
    IDEA GTM Lead
    sameer.kolhatkar@capgemini.com


  • Sandip Brahmachari
    IDEA D&S Lead sandip.brahmachary@capgemini.com


  • Anupam Srivastava
    IDEA Engineering Lead
    anupam.a.srivastava@capgemini.com


  • Subramanian Srinivasan
    IDEA Shared Services Lead
    subramanian.srinivasan@capgemini.com