Wednesday, 18 June 2014

Pentaho BA Server Performance Tips

Pentaho BA Server Performance Tips
 
The Pentaho BA Server ships in a condition designed to work well for the majority of customers. However, deployments
that drift toward opposite extremes -- very large and very small -- will need to adjust certain settings, and possibly even
remove certain unused functionality, in order to achieve the desired performance goals without adding hardware.
Read through the subsections below and decide which ones apply to your scenario.
 
Move Pentaho Managed Data Sources to JNDI
 
Most production BI environments have finely-tuned data sources for reporting and analysis. If you haven't done any
data warehouse performance-tuning, you may want to consult
Pentaho Analysis (Mondrian) Performance Tips
on page
20
for basic advice before proceeding.
Pentaho provides a Data Source Wizard in the Pentaho User Console and a data source dialogue in the Pentaho
Enterprise Console that enable business users to develop rapid prototype data sources for ad hoc reporting and
analysis. This is a great way to get off the ground quickly, but they are "quick and dirty" and not performant. For
maximum performance, you should establish your own JNDI data connections at the Web application server level, and
tune them for your database.
JNDI data sources can be configured for Pentaho client tools by adding connection details to the
~/.pentaho/
simple-jndi/default.properties
file on Linux, or the
%userprofile%\.pentaho\simple-jndi
\default.properties
file on Windows. Design Studio requires that connection details be added to
/pentaho-
solutions/system/simple-jndi/jdbc.properties
as well.
Manual Cleanup of the /tmp Directory
Every time you generate content on the BA Server, temporary files are created on the local file system in the
/
pentaho-solutions/system/tmp/
directory. In some cases, the BA Server may not properly purge that temporary
content, leaving behind orphaned artifacts that can slowly build up and reduce performance on the volume that contains
the pentaho-solutions directory. One way to address this is to mount the /tmp directory on a separate volume, thereby
siphoning off all disk thrash associated with creating new content. However, you will still have to perform a manual
garbage collection procedure on this directory on a regular basis. You can accomplish this via a script that runs through
your system scheduler; it should be safe to delete any content files in this directory that are more than a week old.
 
Memory Optimization for the Geo Service Plugin
 
The Pentaho Geo Service enables Geo Map visualizations in Analyzer.
If you do not use Analyzer
or are sure that you are not using the Geo Service, you can free up approximately 600MB
of RAM by removing the Geo Service plugin. Simply shut down the BA Server and delete the
/pentaho/server/
biserver-ee/pentaho-solutions/system/pentaho-geo/
directory.
If you are a heavy user of the Geo Service
, update the cache setting for
pentaho-geo-municipality
in the
ehcache.xml
. This file can be found in the
/pentaho/server/biserver-ee/tomcat/webapps/pentaho/WEB-
INF/classes
directory.
<cache
name="pentaho-geo-municipality"
maxElementsInMemory="125000"
eternal="false"
overflowToDisk="true"
timeToIdleSeconds="0"
timeToLiveSeconds="0"
diskPersistent="false"
diskExpiryThreadIntervalSeconds="120"
/>
The default setting,
maxElementsInMemory="125000"
, is a relatively low number and might need to be increased
if you are using the Geo Service. Pentaho suggests a setting between
275000
and
700000
for a heavy usage. When
|
Pentaho BA Server Performance Tips
|
7
you increase the
maxElementsInMemory
setting, less memory is available for other resources. To accomodate higher
maxElementsInMemory
settings, increase the maximum memory allocated to the JVM running Pentaho processes.
The default setting is 768m that there are no memory leaks or zombie sessions inherent in the BI Platform.
 
Please Refer below PDF for more Detail
 
https://drive.google.com/file/d/0B_j1hJxesvxdbDdEZWhkbTZmQW8/edit?usp=sharing

Sunday, 15 June 2014

kettle variables to property file

Over time I've grown a habit of making a configuration file for my kettle jobs. This is especially useful if you have a reusable job, where the same work has to be done but against different conditions. A simple example where I found this useful is when you have separate development, testing and production environments: when you're done developing your job, you transfer the .kjb file (and its dependencies) to the testing environment. This is the easy part. But the job still has to run within the new environment, against different database connections, webservice urls and file system paths.

Variables


In the past, much has been written about using kettle variables, parameters and arguments. Variables are the basic features that provide the mechanism to configure the transformation steps and job entries: instead of using literal configuration values, you use a variable reference. This way, you can initialize all variables to whatever values are appropriate at that time, and for that environment. Today, I don't want to discuss variables and variable references - instead I'm just focussing on how to manage the configuration once you already used variable references inside your your jobs and transformations.

Managing configuration


To manage the configuration, I typically start the main job with a set-variables.ktr transformation. This transformation reads configuration data from a config.properties file and assigns it to the variables so any subsequent jobs and transformations can access the configration data through variable references. The main job has one parameter called ${CONFIG_DIR} which has to be set by the caller so the set-variables.ktr transformation knows where to look for its config.properties file:


Reading configuration properties


The config.properties file is just a list of key/value pairs separated by an equals sign. Each key represents a variable name, and the value the appropriate value. The following snippet should give you an idea:
#staging database connection
STAGING_DATABASE=staging
STAGING_HOST=localhost
STAGING_PORT=3351
STAGING_USER=staging
STAGING_PASSWORD=$74g!n9
The set-variables.ktr transformation reads it using a "Property Input" step, and this yields a stream of key/value pairs:


Pivoting key/value pairs to use the "set variables" step


In the past, I used to set the variables using the "Set variables" step. This step works by creating a variable from selected fields in the incoming stream and assigning the field value to it. This means that you can't just feed the stream of key/value pairs from the property input step into the set variables step: the stream coming out of the property input step contains multiple rows with just two fields called "Key" and "value". Feeding it directly into the "Set variables" step would just lead to creating two variables called Key and Value, and they would be assigned values multiple times for all key/value pairs in the stream. So in order to meaningfully assign variable, I used to pivot the stream of key/value pairs into a single row having one field for each key in the stream using the "Row Denormaliser" step:

As you can see in the screenshot, "Key" is the key field: the value of this field is scanned to determine in which output fields to put the corresponding value. There are no fields that make up a grouping: rather, we want all key/value pairs to end up in one big row. Or put another way, there is just one group comprising all key/value pairs. Finally, the grid below specifies for each distinct value of the "Key" field to which output field name it should be mapped, and in all cases, we want the value of the "Value" field to be stored in those fields.

Drawbacks


There are two important drawbacks to this approach:

  • The "Row Normaliser" uses the value of the keys to map the value to a new field. This means that we have to type the name of each and every variable appearing in the config.properties file. So you manually need to keep he config.propeties and the "Denormaliser" synchronized, and in practice it's very easy to make mistakes here.
  • Due to the fact that the "Row Denormaliser" step literally needs to know all variables, the set-variables.ktr transformation becomes specific for just one particular project.
Given these drawbacks, I seriously started to question the usefulness of a separate configuration file: because the set-variables.ktr transformation has to know all variables names anyway, I was tempted to store the configration values themselves also inside the transformation (using a "generate rows" or "data grid" step or something like that), and "simply" make a new set-variables.ktr transformation for every environment. Of course, that didn't feel right either.

Solution: Javascript


As it turns out, there is in fact a very simple solution that solves all of these problems: don't use the "set variables" step for this kind of problem! We still need to set the variables of course, but we can conveniently do this using a JavaScript step. The new set-variables.ktr transformation now looks like this:



The actual variable assignemnt is done with Kettle's built-in setVariable(key, value, scope). The key and value from the incoming stream are passed as arguments to the key and value arguments of the setVariable() function. The third argument of the setVariable() function is a string that identifies the scope of the variable, and must have one of the following values:

  • "s" - system-wide

  • "r" - up to the root

  • "p" - up to the parent job of this transormation

  • "g" - up to the grandparent job of this transormation
For my purpose, I settle for "r".

The bonus is that this set-variables.ktr

download kettle file from below location
https://drive.google.com/file/d/0B_j1hJxesvxdTEdpalZHUjJDenM/edit?usp=sharing

source:-http://rpbouman.blogspot.in/2011/05/managing-kettle-job-configuration.html

Sunday, 11 May 2014

How to apply SCD Type2 using pentaho Kettle

The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
Following are the steps:
1. Create required tables in database.
2. Insert following Steps with following properties

Create a job like below:-



Create a  Transformation like below:- 


  you can Download job and tranformation from below link

https://drive.google.com/file/d/0B_j1hJxesvxdV3hMMWJmZndyQUk/edit?usp=sharing

Detail of Each Step is given Below:-
2.1 Table Input
 

2.2 Database Lookup


2.3 Filter Rows

2.4 Get System Info

2.5 Add Sequence 2

2.6 Select Value

2.7 Table Output

2.8 Filter row 2

2.9 Dummy(do nothing)

2.10 Execute SQL Script

2.11 Add System Info 2

2.12 Add Sequence

2.13 Select Value 2

2.14 Table Output 2

3. The overall steps arrangement will be as follows:

4. Save and execute.