Features | AWS (Amazon Web Services) | GCP (Google Cloud Platform) | Azure | Pentaho |
Sources Supported | Oracle, SQL Server, MySQL, MariaDB, PostgreSQL, MongoDB, and
more. |
MySQL, PostgreSQL, SQL Server. | SQL Server, MySQL, PostgreSQL, Oracle, MongoDB. |
Supports all most most all Databases |
AWS Database Migration Service | AWS Database Migration Service (DMS): Targets Supported: Amazon RDS (all engines), Aurora, Redshift, DynamoDB, S3. Features: Continuous data replication, minimal downtime, schema conversion, data validation. |
Database Migration Service (DMS): Targets Supported: Cloud SQL (MySQL, PostgreSQL, SQL Server). Features: Minimal downtime, automated provisioning of Cloud SQL instances, schema and data migration. |
Azure Database Migration Service (DMS): Targets Supported: Azure SQL Database, Azure SQL Managed Instance, Azure Database for MySQL, Azure Database for PostgreSQL, Cosmos DB. Features: Automated assessment, continuous replication, downtime minimization. |
Pentaho PDI Can be custamized |
AWS Schema Conversion Tool | AWS Schema Conversion Tool (SCT) Purpose: Converts database schema from one database engine to another. |
Database Migration Workbench (Beta): A graphical user interface (GUI) tool that helps in assessing, planning, and executing database migrations. Provides schema assessment and conversion functionalities. |
Azure Migrate: Purpose: Central hub for tracking, assessing, and migrating on-premises databases, VMs, and applications to Azure. |
NA Pentaho PDI Can be custamized for spacific source and target database engine |
Physical data transport solution | AWS Snowball: Purpose: Physical data transport solution for transferring large amounts of data. Usage: Useful for petabyte-scale data migrations where network transfer is impractical |
Transfer Appliance: Purpose: Physical data transfer solution. Usage: For large-scale data transfers where network-based transfer is not feasible. |
Azure Data Box: Purpose: Physical data transfer solution. Usage: For transferring large amounts of data to Azure where network transfer is impractical. |
Pentaho PDI Can be custamized , for bulkload |
Moving large amounts | AWS DataSync: Purpose: Automates moving large amounts of data between on-premises storage and AWS. Usage: For continuous or one-time data migrations, including file transfers. |
BigQuery Data Transfer Service: Purpose: Moves data into BigQuery. Usage: Supports data transfer from SaaS applications, Google services, and other sources into BigQuery. Cloud Data Transfer Service: Purpose: Transfers data to Google Cloud Storage. Usage: Includes online transfer, transfer appliance, and storage transfer service for large datasets. |
Azure Storage Migration Service: Purpose: Transfers data to Azure Storage. Usage: Supports moving data from on-premises storage systems to Azure Blob Storage, File Storage, and more. |
Pentaho PDI Can be custamized for Bkp and restore with any ftp client |
Pentaho Geek Zone
Sunday, 21 July 2024
Azure, GCP and AWS Comparison wrt to Cloud Database Migration.
Saturday, 8 July 2023
which cloud is best option for creating data lake
Both Google Cloud Platform (GCP) and Amazon Web Services (AWS) offer robust services for creating data lakes. The choice between the two depends on various factors, including your specific requirements, existing infrastructure, expertise, and budget. Here are some considerations for each cloud provider:
Google Cloud Platform (GCP) for Data Lakes:
- BigQuery: GCP's BigQuery is a fully managed, serverless data warehouse and analytics platform that can be used as a foundation for a data lake. It offers scalable storage and querying capabilities, along with integration with other GCP services.
- Cloud Storage: GCP's Cloud Storage provides a highly scalable and durable object storage solution. It can be used as a landing zone for ingesting and storing raw data before transforming it into a structured format for analysis.
- Dataflow: GCP's Dataflow is a managed data processing service that enables real-time and batch data processing pipelines. It can be used for data transformation, cleansing, and enrichment tasks within a data lake architecture.
- Dataproc: GCP's Dataproc is a managed Apache Hadoop and Apache Spark service. It provides a scalable environment for running big data processing and analytics workloads on GCP.
Amazon Web Services (AWS) for Data Lakes:
- Amazon S3: AWS's Simple Storage Service (S3) is a highly scalable object storage service that can be used as the foundation for a data lake. It provides durability, availability, and security for storing large volumes of data.
- Amazon Glue: AWS's Glue is a fully managed extract, transform, and load (ETL) service. It can automate the process of cataloging, cleaning, and transforming data to make it ready for analysis within a data lake.
- Amazon Athena: AWS's Athena is an interactive query service that allows you to analyze data directly in Amazon S3 using SQL queries. It provides an on-demand, serverless approach to querying data within a data lake without the need for infrastructure provisioning.
- AWS Lake Formation: AWS Lake Formation is a service that simplifies the process of setting up and managing a data lake. It provides features for data ingestion, metadata management, and access control.
Both GCP and AWS have strong offerings for building data lakes, and the choice depends on your specific needs and preferences. It is recommended to evaluate the features, pricing, scalability, security, and ecosystem of each provider to determine which one aligns best with your requirements and organizational goals.
Storage types in GCP vs AWS and Azure cloud
Here's a comparison of storage types available in Google Cloud Platform (GCP), Amazon Web Services (AWS), and Microsoft Azure:
1. Object Storage:
- GCP: Cloud Storage
- AWS: Amazon S3 (Simple Storage Service)
- Azure: Azure Blob Storage
2. Block Storage:
- GCP: Persistent Disk
- AWS: Amazon Elastic Block Store (EBS)
- Azure: Azure Managed Disks
3. File Storage:
- GCP: Cloud Filestore
- AWS: Amazon Elastic File System (EFS)
- Azure: Azure Files
4. In-Memory Data Store:
- GCP: Cloud Memorystore (supports Redis and Memcached)
- AWS: Amazon ElastiCache (supports Redis and Memcached)
- Azure: Azure Cache for Redis
5. NoSQL Database:
- GCP: Cloud Firestore, Cloud Bigtable
- AWS: Amazon DynamoDB
- Azure: Azure Cosmos DB
6. Relational Database:
- GCP: Cloud Spanner
- AWS: Amazon RDS (Relational Database Service)
- Azure: Azure SQL Database
7. Data Warehousing:
- GCP: BigQuery
- AWS: Amazon Redshift
- Azure: Azure Synapse Analytics (formerly SQL Data Warehouse)
8. Archive Storage:
- GCP: Cloud Storage Coldline, Archive
- AWS: Amazon S3 Glacier
- Azure: Azure Archive Storage
It's important to note that while there are similarities in the storage types provided by GCP, AWS, and Azure, there may be differences in terms of specific features, performance characteristics, pricing models, and regional availability. It's advisable to consult the respective cloud providers' documentation for detailed information on each storage service and evaluate which one best fits your specific requirements.
How Aws can used for data ware housing with pentaho
AWS (Amazon Web Services) offers several services that can be used for data warehousing. Here are some key AWS services commonly used in data warehousing:
1. Amazon Redshift: Redshift is a fully managed data warehousing service that allows you to analyze large volumes of data. It provides a petabyte-scale data warehouse that can handle high-performance analytics workloads. Redshift integrates with various data sources, including Amazon S3, Amazon DynamoDB, and other AWS services. It offers features like columnar storage, parallel query execution, and data compression for efficient data storage and retrieval.
2. Amazon S3 (Simple Storage Service): S3 is an object storage service that can be used as a data lake for storing raw data. It provides scalable storage for structured and unstructured data. You can store data in S3 and then load it into Redshift or other data warehousing systems for analysis. S3 integrates with various AWS services and provides high durability, availability, and security.
Pentaho Can be used as ETL and data loading tool instead of glue/
3. AWS Glue: Glue is a fully managed extract, transform, and load (ETL) service. It allows you to prepare and transform your data for analytics. Glue provides a serverless environment to run ETL jobs, and it automatically generates code to infer schema and transform data. You can use Glue to prepare data for loading into Redshift or other data warehousing solutions.
4. AWS Data Pipeline: Data Pipeline is a web service for orchestrating and automating the movement and transformation of data between different AWS services. It helps you create data-driven workflows and manage dependencies between various steps in your data processing pipeline. Data Pipeline can be used to schedule and automate data movement and transformation tasks for data warehousing.
5. Amazon Athena: Athena is an interactive query service that allows you to analyze data directly from S3 using standard SQL queries. It enables you to perform ad-hoc queries on your data without the need for data loading or pre-defined schemas. Athena is useful for exploratory data analysis and can be integrated with data warehousing solutions for specific use cases.
6. AWS Glue Data Catalog: Glue Data Catalog is a fully managed metadata repository that integrates with various AWS services. It acts as a central catalog for storing and managing metadata about your data assets, including tables, schemas, and partitions. The Glue Data Catalog can be used to discover and explore data stored in S3 or other data sources, making it easier to manage and query data in your data warehouse.
These are just a few examples of how AWS can be used for data warehousing. Depending on your specific requirements and use case, there may be additional AWS services and tools that can be utilized in your data warehousing architecture.
Wednesday, 17 August 2022
Pentaho Email Alerting system
Pentaho can used to create an Alerting tool using that any kind of report can be triggered using email.
Some use case example,
- Suppose I want todays sales report to get this report just send email to given email id with specific subject like “send me todays sales report” After some time (2-3 min) an email with sales report graphical and tabular as attachment (pdf ,csv, excel ..)will be send to the user.
- Data base Artifact report form almost any database can be generated and send on demand to the user .
- It can configured to send some alert also
- if database/table space is less then 10 %
- if any specific query is taking longer than expected to execute.
- Top 10 query by performance.
- Top 10 table by space used.
- It can configured to load data from any excel sheet(from email attachment) to any database in predefined tables.
- It configure to execute any batch of shell command directly from email.
- Major advantage of this tool is to get reports/Alerts without logging into any system, everything can be triggered using email from specific id.
- These reports/Alerts can be scheduled to run on any specific time also,
Friday, 9 October 2020
How to Load data into Anaplan And Amazon S3 using Pentaho
How to Load
data into Anaplan And Amazon S3 using Pentaho
1.Read Data from data source.
2.Get AWS S3 File loacation
3.Load to Anaplan using web services (rest API call)
4.Capture records which are Failed while loading to anaplan.
5.Retrive those failed records using Json query and load failed record to seperate csv file in Amazon S3
6.Stop the cthe process if there is no record to load.
https://drive.google.com/file/d/1pMiFN91r63y_jMp4tDnYwpz1EJ_-BbhA/view?usp=sharing
Wednesday, 14 August 2019
How to load \ retrieve data to neo4j using Pentaho
Neo4j prerequisite
Download working copy of above example from here and 2ndFile
Load data to neo4j from csv download working copy from here and example csv file from here
How to Connect Neo4j from PDI
data-integration\lib
folderSunday, 29 April 2018
How to use Kafka consumer in pentaho 8
1.Create main and sub transformation as discussed below
2.call sub transformation from main Transformation
Note:-Sub transformation required for Kafka consumer step
Download working sample from here
https://drive.google.com/open?id=1Z4C2miczU0BnB4n3r1LcpN78v2UjefWQ
In the kaka transformation,
In the sub transformation.
message None
topic None
partition None
offset None
timestamp Timestamp
Kafka Version: kafka_2.11-1.1.0
References:-https://help.pentaho.com/Documentation/8.0/Products/Data_Integration/Transformation_Step_Reference/Kafka_Consumer
Saturday, 28 April 2018
Load data to Kafka from pentaho
Set Up Kafka and use from pentaho
bin
folder, the sh files are used to set up Kafka in a Linux environment. In the bin\windows
folder, there are also some bat files corresponds to those sh files which are supposed to work in a Windows environment. Some say you can use Cygwin to execute the sh scripts in order to run Kafka. However, they are many additional steps involved, and in the end you may not get the desired outcome. With the correct bat files, there is no need to use Cygwin, and only Server JRE is required to run Kafka on Windows.Step 0: Preparation
Install Java 8 SE Server JRE/JDK
- Unpack it to a folder, for example C:\Java.
- Update the system environment variable PATH to include C:\Java\jre\bin, follow this guide provided by Java.
Download Kafka
- Download the binaries from http://kafka.apache.org/downloads.html
- Unpack it to a folder, for example C:\kafka
Step 1: Update files and configurations
Update Kafka configuration files
- Open
config\server.properties
, change
|
|
|
|
config\zookeeper.properties
, change
|
|
|
|
Step 2: Start the Server
|
|
- Start Zookeeper
- you can create bat file to start Zooker (optional) kafka-server-start.bat and put below conteent kafka-server-start.bat {base folder}\kafkaBinary\kafka_2.11-1.1.0\config\server.properties
- you can create bat file to start Kafka (optional) kafka-server-start.bat and put below conteent zookeeper-server-start.bat {base folder}\kafka_2.11-1.1.0\config\zookeeper.properties
|
|
Step 3: Create a topic
- Create a topic
|
|
|
|
Step 4: Send some messages
- Start console producer
|
|
|
|
Step 5: Start a consumer
|
|
Common Errors and Solutions
classpath is empty. please build the project first e.g. by running 'gradlew jarall'
- Note: Do not download a source files from appache kafka, download a binary file
- Note: Do not download a source files from appache kafka, download a binary file
Kafka java.io.EOFException - NetworkReceive.readFromReadableChannel
References
https://kafka.apache.org/
Kafka tool installation:
Load data to Kafka Topic using pentaho 8.0:
while using pentaho consumer you can use same configration as shown in above example
Important:-you need to change GROUP ID every time if you want to read data from start.
Some useful Json quary Tips.
Exapmle
---------------
{ "store": { "book": [ { "category": "reference", "author": "Nigel Rees", "title": "Sayings of the Century", "price": 8.95 }, { "category": "fiction", "author": "Evelyn Waugh", "title": "Sword of Honour", "price": 12.99 }, { "category": "fiction", "author": "Herman Melville", "title": "Moby Dick", "isbn": "0-553-21311-3", "price": 8.99 }, { "category": "fiction", "author": "J. R. R. Tolkien", "title": "The Lord of the Rings", "isbn": "0-395-19395-8", "price": 22.99 } ], "bicycle": { "color": "red", "price": 19.95 } } }
Json Query Result
XPath | JSONPath | Result |
/store/book/author | $.store.book[*].author | the authors of all books in the store |
//author | $..author | all authors |
/store/* | $.store.* | all things in store, which are some books and a red bicycle. |
/store//price | $.store..price | the price of everything in the store. |
//book[3] | $..book[2] | the third book |
//book[last()] | $..book[(@.length-1)] $..book[-1:] | the last book in order. |
//book[position()<3] | $..book[0,1] $..book[:2] | the first two books |
//book[isbn] | $..book[?(@.isbn)] | filter all books with isbn number |
//book[price<10] | $..book[?(@.price<10)] | filter all books cheapier than 10 |
//* | $..* | all Elements in XML document. All members of JSON structure. |