Please standby, while we are retrieving your information
Data Profiling & ValidationIDEA supports continuous data quality measurement and monitoring as part of data pipelines. Metadata-based Data Quality framework in IDEA is implemented using Spark DataFrame/ pySpark. Supports 30+ data validation rules out-of-box. Data Profiling service enables examining and analysing data to create valuable summaries of it. It shifts through data to determine its quality and legitimacy. The process yields a high-level overview that aids in discovering data quality issues such as values outside the range, unexpected patterns in data which helps data validation service to validate the data. IDEA Data profiling can be applied to selected columns of the data file / table. Column selection is part of the metadata setup. Supports single column profiling (Min, Max , Average etc.) as well as cross-column profiling (Correlation, Covariance, Cross Tabulation).
Data profiling and validation is a must when it comes to cloud and data analytics services. Ensuring quality of data and validating it are crucial for any data related services.
This service has metadata based framework to perform technical data quality / data profiling. It helps in fast tracking implementation of Data Quality Validation/ Data Profiling as part of data pipelines.
Data Quality and Data profiling dashboards show statistics for all DQ/ Data profiling jobs.
A brief summary
Data Profiling Service
Before data reconciliation is performed, user may want to understand how the data is. To understand the structure of the data and some statistics around it, the user can execute the Data Trust module. While some of the operations performed by the Data Reconciliation and Data Profiling are similar, the difference is that the reconciliation uses two data stores, applies operations, and compares the results between the two data stores, the profiling module uses only one data store.
A quick look at the features
- Data profiling can be applied to all columns or selected columns of the relation. Column selection is part of the metadata setup. Multiple data profiling jobs can be registered.
- Allows the user to perform systematic analysis of the content of a data source.
- Supports single column and multi-column profiling.
- Data Profiling can be performed on various data sources such as Teradata, Synapse, Azure Blob.
- Single column profiling - Count, Value, Lists, Data Distribution.
- Multiple column profiling - Correlation, Covariance, Cross Tabulation.
- Supports Data Profiling at a Column Level as well across Multiple columns.
Data Quality Service
Data quality validation allows users to check and validate the data through data validation rules. This service offers such service where it supports over 30+ validation rules and performs rectification of data where it’s required. Below steps are how it works
- Data Quality Validation is performed by registering appropriate jobs and their metadata. Metadata to specify the data store to operate on and the DQ/ Data profiling operations to be performed.
- Implemented using Spark DataFrame/ pyspark.
- Supports data validation rules (like Null Check, Value Matching, Range Matching, String-Is-Part-Of-List, Date Format Validation etc).
- After loading the data, the data validator applies the DQ rules and classifies the data into 'good' records and 'bad' records.
- 'Good' records are those that pass the rule conditions, while 'Bad' records are those that do not pass the rule conditions. After applying the rules and classifying the records, the data sets are saved to different files / areas.
- Supports data stores like Teradata, Synapse, Azure Blob, AWS S3 and Snowflake.
Schema migration service is used to migrate the Database components from Teradata to target warehouse based on the mapping information updated in IDEA metadata during the discovery service. The schema migration service depends on two key components
- Teradata (aka Source) Database discovery service output
- Mapping detail
Teradata discovery service extracts the attributes for the following objects from DBC table. Based on the compatibility on Target warehouse, applicable objects will be migrated.
The collection of Teradata specific keywords is stored in mapping metadata tables. These mapping tables will be used as reference and relevant Target warehouse keywords are populated. If there is no direct mapping available, then the alternatives will be used.
- Database / Dataset level mapping
- Table level mapping
- View level mapping
- Column level mapping
Data Migration service migrates data from Teradata as well as FTP/SFTP server to Target warehouse.
The Job run will be triggered based on the details provided while registering the job and it is split into two stages. Data extracted from Teradata or FTP/SFTP server to target cloud service storage and vice versa.
These two stages are decoupled using Kafka queue which helps to manage the failure-re-run effectively as well as start the next table load from Teradata while the previous table Target warehouse load in-progress.
The job execution will be considered as successful only when all the tables associated with the job is successfully loaded from source to Target warehouse.
The re-run option will be enabled only for the failed job run. A failed job re-run will perform the data migration for the failed tables and the data load resumes from the point where it failed.
Complexity analyzer scans the Teradata BTEQ scripts and classifies them into Simple, Medium and Complex based on the rules configured.
ETL migration service constitutes of two services, which generate the following output based on the mapping document available in IDEA metadata:
- ETL Code Conversion
- Store Procedure wrapped ETL Conversion
ETL script might have DDL (temporary table creation) and DML statements and the ETL migration supports both DDL and DML conversion into Target warehouse compatible statements. ETL migration also uses parsers to generate two files:
- Error Parsed File (AI powered)
- Semantic Parsed File (AI powered)
Percentage Statistics file will help you understand the conversion percentage of the input script. The two Error files will specify the errors at a granular level for better understanding.