Saturday 8 July 2023

which cloud is best option for creating data lake

 Both Google Cloud Platform (GCP) and Amazon Web Services (AWS) offer robust services for creating data lakes. The choice between the two depends on various factors, including your specific requirements, existing infrastructure, expertise, and budget. Here are some considerations for each cloud provider:

Google Cloud Platform (GCP) for Data Lakes:

- BigQuery: GCP's BigQuery is a fully managed, serverless data warehouse and analytics platform that can be used as a foundation for a data lake. It offers scalable storage and querying capabilities, along with integration with other GCP services.

- Cloud Storage: GCP's Cloud Storage provides a highly scalable and durable object storage solution. It can be used as a landing zone for ingesting and storing raw data before transforming it into a structured format for analysis.

- Dataflow: GCP's Dataflow is a managed data processing service that enables real-time and batch data processing pipelines. It can be used for data transformation, cleansing, and enrichment tasks within a data lake architecture.

- Dataproc: GCP's Dataproc is a managed Apache Hadoop and Apache Spark service. It provides a scalable environment for running big data processing and analytics workloads on GCP.

Amazon Web Services (AWS) for Data Lakes:

- Amazon S3: AWS's Simple Storage Service (S3) is a highly scalable object storage service that can be used as the foundation for a data lake. It provides durability, availability, and security for storing large volumes of data.

- Amazon Glue: AWS's Glue is a fully managed extract, transform, and load (ETL) service. It can automate the process of cataloging, cleaning, and transforming data to make it ready for analysis within a data lake.

- Amazon Athena: AWS's Athena is an interactive query service that allows you to analyze data directly in Amazon S3 using SQL queries. It provides an on-demand, serverless approach to querying data within a data lake without the need for infrastructure provisioning.

- AWS Lake Formation: AWS Lake Formation is a service that simplifies the process of setting up and managing a data lake. It provides features for data ingestion, metadata management, and access control.

Both GCP and AWS have strong offerings for building data lakes, and the choice depends on your specific needs and preferences. It is recommended to evaluate the features, pricing, scalability, security, and ecosystem of each provider to determine which one aligns best with your requirements and organizational goals.

Storage types in GCP vs AWS and Azure cloud

 Here's a comparison of storage types available in Google Cloud Platform (GCP), Amazon Web Services (AWS), and Microsoft Azure:

1. Object Storage:

   - GCP: Cloud Storage

   - AWS: Amazon S3 (Simple Storage Service)

   - Azure: Azure Blob Storage

2. Block Storage:

   - GCP: Persistent Disk

   - AWS: Amazon Elastic Block Store (EBS)

   - Azure: Azure Managed Disks

3. File Storage:

   - GCP: Cloud Filestore

   - AWS: Amazon Elastic File System (EFS)

   - Azure: Azure Files

4. In-Memory Data Store:

   - GCP: Cloud Memorystore (supports Redis and Memcached)

   - AWS: Amazon ElastiCache (supports Redis and Memcached)

   - Azure: Azure Cache for Redis

5. NoSQL Database:

   - GCP: Cloud Firestore, Cloud Bigtable

   - AWS: Amazon DynamoDB

   - Azure: Azure Cosmos DB

6. Relational Database:

   - GCP: Cloud Spanner

   - AWS: Amazon RDS (Relational Database Service)

   - Azure: Azure SQL Database

7. Data Warehousing:

   - GCP: BigQuery

   - AWS: Amazon Redshift

   - Azure: Azure Synapse Analytics (formerly SQL Data Warehouse)

8. Archive Storage:

   - GCP: Cloud Storage Coldline, Archive

   - AWS: Amazon S3 Glacier

   - Azure: Azure Archive Storage

It's important to note that while there are similarities in the storage types provided by GCP, AWS, and Azure, there may be differences in terms of specific features, performance characteristics, pricing models, and regional availability. It's advisable to consult the respective cloud providers' documentation for detailed information on each storage service and evaluate which one best fits your specific requirements.

How Aws can used for data ware housing with pentaho

AWS (Amazon Web Services) offers several services that can be used for data warehousing. Here are some key AWS services commonly used in data warehousing:

1. Amazon Redshift: Redshift is a fully managed data warehousing service that allows you to analyze large volumes of data. It provides a petabyte-scale data warehouse that can handle high-performance analytics workloads. Redshift integrates with various data sources, including Amazon S3, Amazon DynamoDB, and other AWS services. It offers features like columnar storage, parallel query execution, and data compression for efficient data storage and retrieval.

2. Amazon S3 (Simple Storage Service): S3 is an object storage service that can be used as a data lake for storing raw data. It provides scalable storage for structured and unstructured data. You can store data in S3 and then load it into Redshift or other data warehousing systems for analysis. S3 integrates with various AWS services and provides high durability, availability, and security.

Pentaho Can be used as ETL and data loading tool instead of glue/

3. AWS Glue: Glue is a fully managed extract, transform, and load (ETL) service. It allows you to prepare and transform your data for analytics. Glue provides a serverless environment to run ETL jobs, and it automatically generates code to infer schema and transform data. You can use Glue to prepare data for loading into Redshift or other data warehousing solutions.

4. AWS Data Pipeline: Data Pipeline is a web service for orchestrating and automating the movement and transformation of data between different AWS services. It helps you create data-driven workflows and manage dependencies between various steps in your data processing pipeline. Data Pipeline can be used to schedule and automate data movement and transformation tasks for data warehousing.

5. Amazon Athena: Athena is an interactive query service that allows you to analyze data directly from S3 using standard SQL queries. It enables you to perform ad-hoc queries on your data without the need for data loading or pre-defined schemas. Athena is useful for exploratory data analysis and can be integrated with data warehousing solutions for specific use cases.

6. AWS Glue Data Catalog: Glue Data Catalog is a fully managed metadata repository that integrates with various AWS services. It acts as a central catalog for storing and managing metadata about your data assets, including tables, schemas, and partitions. The Glue Data Catalog can be used to discover and explore data stored in S3 or other data sources, making it easier to manage and query data in your data warehouse.

These are just a few examples of how AWS can be used for data warehousing. Depending on your specific requirements and use case, there may be additional AWS services and tools that can be utilized in your data warehousing architecture.

Wednesday 17 August 2022

 Pentaho Email Alerting system

Pentaho can used to create  an Alerting  tool using that any kind of report can be triggered using email.

Some use case example,

  1. Suppose I want todays sales report to get this report just send email to given email id with specific subject like  “send me todays sales report” After some time (2-3 min) an email with sales report graphical and tabular  as attachment (pdf ,csv, excel ..)will be send to the user.
  2. Data base Artifact report form almost any database can be generated and send on demand to the user .
  • It can configured to send some alert also
    • if database/table  space is less then 10 %
    • if any specific query is taking longer than expected to execute.
    • Top 10 query by performance.
    • Top 10 table by space used.
  • It can configured to load data from any excel sheet(from email attachment) to any database in predefined tables.
  •  It configure to execute any batch of shell command directly from email.
  • Major advantage of this tool is to get reports/Alerts without logging into any system, everything can be triggered using email from specific id.
  • These reports/Alerts can be scheduled to run on any specific time also,

Friday 9 October 2020

How to Load data into Anaplan And Amazon S3 using Pentaho

How to Load data into Anaplan And Amazon S3 using Pentaho

 Beow expample :-

1.Read Data from data source.

2.Get AWS S3 File loacation

3.Load to Anaplan using web services (rest API call)

4.Capture records which are Failed while  loading to anaplan.

5.Retrive those failed records using Json query and load failed record to seperate csv file in Amazon S3

6.Stop the cthe process if there is no record to load.


Wednesday 14 August 2019

How to load \ retrieve data to neo4j using Pentaho

Neo4j and Pentaho (PDI)
This document explain how to connect neo4j with PDI and load and retrieve data from PDI, Load Data from CSV to neo4j
Neo4j prerequisite
aNeo4j should be up and running
      All required credentials including username and password should be available
c  Example below screenshot.
   Download working copy of above example from here and 2ndFile
   Load data to neo4j from csv download working copy from here and example  csv file from here

How to Connect Neo4j from PDI
1. Get JDBC driver
d)     Get the jdbc driver from below location.This driver is pre complied and ready to use.this driver hasbeen tested with pentaho pdi 8.2 and Neo4j desktop 3.5.6.
f)     Add above driver to add to the <pdi Istalation directoey>/data-integration\lib folder
g)    Restart Spoon
2. Create Connection though pentaho
1.Open spoon new transformation
3.Select Table input step
4.go for  creating new connection

5.Create New connection As per below
·         Connection type as Generic database
·         Custom connection URL as jdbc:neo4j://localhost:7474
·         Custom driver class name as org.neo4j.jdbc.Driver

6.Test connection:-

2. Load Data to Neo4j using Pentaho(PDI)
1.Select Execute SQL script
2.Put SQL script into the  " Execute SQL script "  step. Refer below pic attached.

Run the transformation:
After successful run this will load one record "SHANKAR" to Neo4j

2. Retrieve Data from Neo4j using Pentaho(PDI)
1.Select Table input step  
2.Put SQL script into the  " Table input "  step. Refer below pic attached.

Run the transformation:
After successful run one record will be retrived from Neo4j,
below is log snip shot
2019/08/14 14:01:20 - Write to logs.0 - ------------> Linenr 1------------------------------
2019/08/14 14:01:20 - Write to logs.0 - ====Data retrived from neo4j========
2019/08/14 14:01:20 - Write to logs.0 -
2019/08/14 14:01:20 - Write to logs.0 - Shankar = {"born":1982,"name":"Shankar"}

Load Data from CSV file to neo4j

Load your Csv file to neo4j installation directory or you can put ditrctly http,https,FTP location as well.

<Instalation directory >\.Neo4jDesktop\neo4jDatabases\database-39ba8418-e334-4730-b8b2-1434f4d6db48\installation-3.5.6\import\desktop-csv-import\<csv file name>

Download working copy of above example from here and 2ndFile

Load data to neo4j from csv download working copy from here and expale csv file from here

Some useful Cypher commands:-

1.Load Data from CSV file to neo4j without headers.

LOAD CSV FROM 'file:///desktop-csv-import/NeotestCSV2.csv' AS line

 CREATE (:Artist2 { Test: line[1], Name: (line[2])})

.Load Data from CSV file to neo4j with headers.

LOAD CSV FROM 'file:///desktop-csv-import/NeotestCSV2.csv' AS line
 CREATE (:Artist2 { Test: line[1], Name: (line[2])})

1.check count of loaded record
 MATCH (p:Artist)
 RETURN count(p)

2.Select record from Lable (Table in neo4j)
MATCH (p:Artist)

3. Get the queryId by useing below command
 CALL dbms.listQueries()

4.Kill Running query in neo4j
CALL dbms.killQuery('query-685')

Sunday 29 April 2018

How to use Kafka consumer in pentaho 8

1.Create  main and sub transformation as discussed below sub transformation from main Transformation

Note:-Sub transformation required for Kafka consumer step

Download working sample from here

In the kaka transformation,

1.We are using direct bootstrap server on connection. 
2. we added the consumer group "test-consumer-group1" change consumer group after every run to retrieve Kafka message from start.
Important:-if you not change consumer group, kafka will not retrieve any message unless any new message arrived to topic.
like test-consumer-group1,test-consumer-group2,test-consumer-group3 .....
3. Changed the auto.offset.reset to "earliest" on options tab.

In the sub transformation.

In "Get records from stream" step, we gave the below fields Fieldname Type key None
message None
topic None
partition None
offset None
timestamp Timestamp

Kafka Version: kafka_2.11-1.1.0

I uploaded a sample .ktr that works


Saturday 28 April 2018

Load data to Kafka from pentaho

Set Up Kafka and use from pentaho

In this tutorial, I will show how to set up and run Apache Kafka on Windows/linux and how to load/read data from pentaho. 
Kafka comes with two sets of scripts to run Kafka. In the bin folder, the sh files are used to set up Kafka in a Linux environment. In the bin\windows folder, there are also some bat files corresponds to those sh files which are supposed to work in a Windows environment.  Some say you can use Cygwin to execute the sh scripts in order to run Kafka. However, they are many additional steps involved, and in the end you may not get the desired outcome. With the correct bat files, there is no need to use Cygwin, and only Server JRE is required to run Kafka on Windows.

Step 0: Preparation

Install Java 8 SE Server JRE/JDK

You need Java SE Server JRE in order to run Kafka. If you have JDK installed, you already have Server JRE installed, just check if the folder {JRE_PATH}\bin\server exists. If it is not, follow the following steps to install Java SE Server JRE:
  1. Unpack it to a folder, for example C:\Java.
  2. Update the system environment variable PATH to include C:\Java\jre\bin, follow this guide provided by Java.

Download Kafka

  1. Download the binaries from
  2. Unpack it to a folder, for example C:\kafka

Step 1: Update files and configurations

Update Kafka configuration files

The config files need to be updated corresponding to Windows path naming convention.
Change this path if you using Windows
*** Important :- create this path inside Kafka root directory,other wise Kafka server may not start
  1. Open config\, change


2.. Open config\, change



Step 2: Start the Server

In Windows Command Prompt, switch the current working directory to C:\kafka:

cd C:\kafka

  1. Start Zookeeper
  2. you can create bat file to start Zooker (optional)  kafka-server-start.bat and put below conteent      kafka-server-start.bat {base folder}\kafkaBinary\kafka_2.11-1.1.0\config\
Kafka uses ZooKeeper so you need to first start a ZooKeeper server if you don’t already have one. You can use the convenience script packaged with Kafka to get a quick-and-dirty single-node ZooKeeper instance.

  1. you can create bat file to start Kafka (optional)    kafka-server-start.bat and put below conteent      zookeeper-server-start.bat {base folder}\kafka_2.11-1.1.0\config\

> .\bin\windows\kafka-server-start.bat .\config\

Step 3: Create a topic

  1. Create a topic
Let’s create a topic named “test” with a single partition and only one replica:

> .\bin\windows\kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

2.. List topics
We can now see that topic if we run the list topic command:

> .\bin\windows\kafka-topics.bat --list --zookeeper localhost:2181

Step 4: Send some messages

Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default each line will be sent as a separate message.
  1. Start console producer

> .\bin\windows\kafka-console-producer.bat --broker-list localhost:9092 --topic test

2.. Write some messages

This is a message
This is another message

Step 5: Start a consumer

> .\bin\windows\kafka-console-consumer.bat --zookeeper localhost:2181 --topic test --from-beginning

If you have each of the above commands running in a different terminal then you should now be able to type messages into the producer terminal and see them appear in the consumer terminal.
Yay cheers!

Common Errors and Solutions

classpath is empty. please build the project first e.g. by running 'gradlew jarall'

  • Note: Do not download a source files from appache kafka, download a binary file

Kafka - NetworkReceive.readFromReadableChannel

Zookeeper may not configured propery (may be configured with different port number)


Kafka tool installation:

This is GUI interface of kafka where you can view messages,check count of messages and create and delete Kafka topic
Down load kafka tool from
Now install executable file in your system
This is very useful tool to view data in kafka

Load data to Kafka Topic using pentaho 8.0:

step 0:-if you are not using pentaho 8 you may require to download kafka producer and consumer from pentaho wiki

1.start spoon.bat spoon canvas
3.Read  data from file/database or data grid
4.convert data to valid Json object 
5.Load data to kafka topic (may be conversion to valid json object is required)

Download below example from here

while using  pentaho consumer you can use same configration as shown in above example

Important:-you need to change GROUP ID every time if you want to read data from start. 

Some useful Json quary Tips. 


{ "store": {
    "book": [ 
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      { "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      { "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
    "bicycle": {
      "color": "red",
      "price": 19.95

Json Query Result 

/store/book/author$[*].authorthe authors of all books in the store
//author$..authorall authors
/store/*$.store.*all things in store, which are some books and a red bicycle.
/store//price$.store..pricethe price of everything in the store.
//book[3]$[2]the third book
the last book in order.
the first two books
//book[isbn]$[?(@.isbn)]filter all books with isbn number
//book[price<10]$[?(@.price<10)]filter all books cheapier than 10
//*$..*all Elements in XML document. All members of JSON structure.