Quantcast
Channel: Diethard Steiner on Business Intelligence
Viewing all 85 articles
Browse latest View live

How to work with MapReduce Key Value Pairs in Pentaho Data Integration

$
0
0

How to work with MapReduce Key Value Pairs in Pentaho Data Integration

My main objective for this article is to provide you an understanding on how to use multiple fields to group by and multiple fields to aggregate on in Pentaho PDI MapReduce.


The input key for the mapper is auto-generated, the value is usually the line of text which is read in (fields separated by comma in example). This section here will focus on the output key value pair of the mapper and input and output key value pair of the reducer. Also we will not discuss the simple scenario where we only use one field for the key and one field for the value.


I have more than one key field. How do I set up a compound key?


You are aware that the input and output of the mapper and reducer are key value pairs. If you haven’t been exposed that much to the internals of MapReduce and come more from a traditional ETL world, this is probably one of the most important concepts to understand.
Did you ever run a Hive query? Did you have to worry about the key fields … no. Hive is doing quite some work in the background … which some users are never exposed to. So when you come to PDI and create the key for your mapper and reducer transformations, the important point is that you have to separate the fields that form the key by the standard separator of the specified output format of the MapReduce job. If you chose the output format org.apache.hadoop.mapred.TextOutputFormat, tab is the standard separator.


Option 1: Thankfully Pentaho introduced not too long ago a step to just do this in an easy fashion: Use the new Concat Fields step (Wiki entry). This step allows you to create a new field based on several concatenated source fields which are separated by a character of your choice, such as a tab. If you specified the org.apache.hadoop.mapred.TextOutputFormat in the Pentaho MapReduce job entry as output format, tab is the standard separator.


“4.4.0 release note: Unfortunately we found an issue (PDI-8857) with this step that was too late to incorporate into 4.4.0. The step adds carriage return and line feed to the fields it creates. Workaround is to use the String operations step with the option "carriage return & line feed" after the step or to enable the advanced option "Fast data dump (no formatting)"


Option 2: Use a User Defined Java Expression step. This option was mainly used before the Concat Fields step was available. Generate the output key by writing some Java expression which concatenates the fields you want to group by.
Separate the fields with a tab in the concatenate output key, in example:


date + '\t' + brand


Important: Replace the tab with a real tab! So it should look like this then:


date + '' + brand


This way, all the fields will be properly separated in the final output. Tab in this case is the standard separator of org.apache.hadoop.mapred.TextOutputFormat.


I have more than one value field. How do I create a compound values field?
What if I want more than one value to aggregate on?


Create a new field i.e. called output_values in a Concat Fields or User Defined Java Expression step in the mapper transformation and concatenate all the values and define the separator. Then in the reducer split these values (use the Split Fields step), next aggregate them (use the Group By step) and after this you have to concatenate them again (use the Concat Fields step).


Let’s walk through a very simple example. We have some sales data which we want to analyze. Let’s say we want the sum of sales and a count of rows by date and brand.


The Kettle job:


Our input data for the Pentaho MapReduce job looks like this (date, brand, department, sales):


$ hadoop fs -cat /user/dsteiner/sales-test/input/sales.txt
2013-04-01,SimplePurpose,Clothes,234.2
2013-04-01,SimplePurpose,Accessories,2314.34
2013-04-01,RedPride,Kitchen,231.34
2013-04-02,SimplePurpose,Clothes,453.34
2013-04-01,SimplePurpose,Accessories,5432.34
2013-04-01,RedPride,Kitchen,432.23
2013-04-03,RedPride,Kitchen


The mapper transformation (simple example):
If we want to inspect what the output of the mapper transformation looks like, we can just simply execute the Pentaho MapReduce job entry without specifying a reducer.


Output of mapper - Note the key is formed by the first two fields which are separated by a tab and the value is formed by the sales and count field separated by a comma:


$ hadoop fs -cat /user/dsteiner/sales-test/output/part-00000
2013-04-01RedPride231.34,1
2013-04-01RedPride432.23,1
2013-04-01SimplePurpose234.2,1
2013-04-01SimplePurpose2314.34,1
2013-04-01SimplePurpose5432.34,1
2013-04-02SimplePurpose453.34,1
2013-04-03RedPride,1


The reducer transformation (simple example):


Our output data looks like this (date, brand, sum of sales, count):
$ hadoop fs -cat /user/dsteiner/sales-test/output/part-00000
2013-04-01RedPride663.572
2013-04-01SimplePurpose7980.883
2013-04-02SimplePurpose453.341
2013-04-03RedPride01


So you can see that we successfully managed to aggregate our data by date and brand and sum up the sales as well as perform a count on the rows.

It’s best if you take a look at my sample files (which you can download from here) to understand all the details. I hope that this brief article shed some light onto creating key value pairs for the Pentaho MapReduce framework.


Pentaho Kettle Parameters and Variables: Tips and Tricks

$
0
0

Pentaho Kettle Parameters and Variables: Tips and Tricks

This blog post is not intended to be a formal introduction to using parameters and variables in Pentaho Kettle, but more a practical showcase of possible usages.


Please read my previous blog post Pentaho Data Integration: Scheduling and command line arguments as an introduction on how to pass command line arguments to a Kettle job.


When I mention parameters below, I am always talking about named parameters.

Parameters and Variables

Definitions upfront

Named Parameter: “Named parameters are a system that allows you to parameterize your transformations and jobs.  On top of the variables system that was already in place prior to the introduction in version 3.2, named parameters offer the setting of a description and a default value.  That allows you in turn to list the required parameters for a job or transformation.” (Source)


Variable: “Variables can be used throughout Pentaho Data Integration, including in transformation steps and job entries. You define variables by setting them with the Set Variable step in a transformation or by setting them in the kettle.properties file. [...] The first usage (and only usage in previous Kettle versions) was to set an environment variable. Traditionally, this was accomplished by passing options to the Java Virtual Machine (JVM) with the -D option. The only problem with using environment variables is that the usage is not dynamic and problems arise if you try to use them in a dynamic way.  Changes to the environment variables are visible to all software running on the virtual machine.  [...] Because the scope of an environment variable is too broad, Kettle variables were introduced to provide a way to define variables that are local to the job in which the variable is set. The "Set Variable" step in a transformation allows you to specify in which job you want to set the variable's scope (i.e. parent job, grand-parent job or the root job).” (Source). “

Example

Let’s walk through this very simple example of using parameters and variables. I try to explain all the jobs and transformations involved. The files are also available for download here. You can find the following files in the folder intro_to_parameters_and_variables.

jb_main.kjb

In this extremely simple job we call a subjob call jb_slave.kjb. In this case, we defined hard coded parameter values in the job entry settings. Alternatively, to make this more dynamic, we could have just defined parameters in the job settings.

jb_slave.kjb

This subjob executes the transformations tr_set_variables.ktr and tr_show_param_values.ktr. In this case, in order to access the parameter values from the parent job, we defined the parameters without values in the job settings:
Note: This is just one of the ways you can pass parameters down to the subprocess.

tr_set_variables.ktr

This transformation sets a variable called var1 with scope Valid in parent job so that successive processes can make use if it. In this case the values originate from a Generate Rows step for demonstration purposes; in real world examples you might read in some values from a file or a database table.

tr_show_param_values.ktr

The main transformation has the sole purpose of writing all the parameter and variable values to the log. We retrieve the parameters and variable by using a Get Variables step. We also check if a value is present by using a Filter Rows step. In case one value is missing, we Abort the transformation, otherwise the values are written to the log.


There is no need to set the parameter names in this transformations; there is an advantage though if you do it:
Missing parameter values will be properly displayed as NULL, which makes it a bit easier to check for them.
If you don't define them in the transformation settings, missing parameter values will be displayed as ${PARAMETERNAME}.


Important: Variables coming from tr_set_variables.ktr MUST NOT be listed in the Parameter tab in the Transformation Settings as this overrides the variable.

Making Parameters available for all subprocesses in an easy fashion

As you saw above, defining the parameters for each subprocess just to be able to pass them down can be a bit labour intensive. Luckily, there is a faster way of doing just this:


  1. In the main job specify the parameters that you want to pass in in the Job Settings:
    This way parameters and their values can be passed in from the command line in example.
  2. Right after the Start job entry use the Set Variables job entry. Specify the variable names, reference the parameters you set up in step 1 and set the scope to Valid in the current job.
  3. There is no need to specify any parameters/variables in any of the subprocesses.


To see how this is working, run jb_main.kjb in the passing_down_parameters_in_an_easy_fashion folder (part of the provided examples).

What if I still want to be able to run my subprocess independently sometimes?

You might have some situations, when you have to run the subprocess independently (so in other words: You do not execute it from the parent/main job, but run it on its own). When we pass down parameters or variables, this can be a bit tricky and usually it just doesn’t work out of the box. Luckily, there is a way to achieve this though:
  1. In the subprocess, specify the parameter that you want to be able to pass in. In our example (which is based on the previous example), we modified the transformation tr_show_param_values.ktr and added following parameters to the Transformation Settings:
    We also amended the
    Get Variables step to make use of these parameters:
    This way, we can already run this transformation on its own. Now we only have to adjust the parent job so that we can run it from there as well.
  2. In the parent job, in the Job or Transformation job entry settings, go to the Parameters tab and tick Pass all parameter values down to the sub-transformation/sub-job. Next, as the Parameter set the name of the parameter you defined in the subprocess. As the Value define the variable that you want to pass down: ${variable}. This assumes that this variable was set beforehand by some Set Variables job entry/step.
    In our case, we modified transformation job entry in the job
    jb_slave.kjb and added following mapping to the job entry settings in the Parameters tab:
A sample for this setup is provided in the mulitpurpose_setup_allow_individual_execution_of_subprocesses folder.

Closing remarks


Using parameters and variables in Kettle jobs and transformations allows you to create highly dynamic processes. I hope this tutorial shed some light onto how this can be achieved.

Introducing the Kettle Test Framework Beta

$
0
0
Kettle Test Framework (KTF)

Subtitle: Kettle Testing for the Non-Java Developers

Announcing the KTF Beta:

Precautions



Please note that KTF is still in Beta and has undergone only minimal testing. Please report any bugs on the dedicated Github page so that they can be easily fixed for everybody’s advantage. Do not use for any production purposes.


You must not run this process on a production environment! You should only run this process on a dedicated test environment where it is ok to lose all the data in the database. You must run this process on a dedicated test database! This process wipes out all your tables!

The small print upfront

Please note that this is a community contribution and not associated with Pentaho. You can make use of this framework at your own risk. The author makes no guarantees of any kind and should not be hold responsible for any negative impact.  


You should have a solid understanding of Pentaho Data Integration/Kettle. I made a minimal attempt to document this framework - the rest is up to you to explore and understand.

Motivation

The main idea behind this framework is to create a base for best test practises in regards to working with the Kettle ETL tool. Please add any ideas which could improve this test framework as “improvement” on the dedicated Github page.


Code and samples can be downloaded from Github.


Testing data integration processed should be core part of your activities. Unfortunately, especially for non Java developers, this is not quite so straightforward (Even for Java developers it is not quite that easy to unit test their ETL processes, as highlighted here.). This framework tries to fill this gap by using standard Kettle transformations and jobs to run a test suite.

When you create or change a data integration process, you want to be able to check if the output dataset(s) match the ones you expect (the "golden" dataset(s)). Ideally, this process should be automated as well. By using KTF's standard Kettle transformations and jobs to do this comparison every data integration architect should be in the position to perform this essential task.


Some other community members have published blog posts on testing before, which this framework strongly took ideas/inspiration from (especially Dan Moore’s excellent blog posts [posts, github] ). Also, some books published on Agile BI methodologies were quite inspirational (especially Ken Colliers “Agile Analytics”) as well.
While Dan focused on a complete file based setup, for now I tried to create a framework which works with processes (jobs, transformations) which make use of Table input and Table output steps. In the next phase the aim is to support file based input and output (csv, txt) as well. Other features are listed below.

Contribute!

Report bugs and improvements/ideas on Github.


Features

Let’s have a look at the main features:
  • Supports multiple input datasets
  • Supports multiple output datasets
  • Supports sorting of the output dataset so that a good comparison to the golden output dataset can be made
  • The setup is highly configurable (but a certain structure is enforced - outlined below)
  • Non conflicting parameter/variable names (all prefixed with “VAR_KTF_”)
  • Non intrusive: Just wraps around your existing process files (except some parameters for db connections etc will have to be defined … but probably you have this done already anyways)

Current shortcomings

  • Not fully tested (only tested with included samples and on PostgreSQL)
  • Currently works only with Table input / output transformations. Text/CSV file input/output will be supported in a future version (which should not be too complicated to add).
  • Dictates quite a strict folder structure
  • Limited documentation

Project Folder Structure

Stick to this directory layout for now. In future versions I might make this more flexible.


  1. Adjust parameter values in config/.kettle/kettle.properties and JNDI details in config/simple-jndi/jdbc.properties
  2. Your transformations have to be prefixed for now with “tr_” and your jobs with “jb_”. Do not use any special characters or spaces in your job/transformation names. Your transformations and jobs have to be saved within repository/main. There is currently now subfolder structure allowed within this folder.
  3. Within the config/test-cases directory create a folder for the processes you want to test. A process can be a transformation or a job. Name these folders exactly the same as the job/transformation you want to test (just without the file extension).  Each process folder must have an input and output folder which hold the DDL, Kettle data type definitions and in case of the output the sort order definition (see tr_category_sales sample on how to set this up). If your output dataset does not require any sorting, create an empty sort def file (see tr_fact_sales example). Note that KTF can handle more than one output/input dataset.
  4. The process folder must also contain at least one test case folder (which has to have a descriptive name). In the screenshot above it is called “simpletest”. A test case folder must contain an input and output folder which each hold the dedicated datasets for this particular test case. In case of the output folder it will hold the golden output dataset(s) (so that dataset that you want to compare your ETL output results to).
  5. Users working on Windows: For all the CSV output steps in transformations under /repository/test set the Format to Windows (Content tab). KTF has not been tested at all on Windows, so you might have to make some other adjustments as well.
  6. Add environment variables defined in config/set-project-variables.sh to your .bashrc. Then run: source ~/.bashrc
  7. Start Spoon (this setup requires PDI V5) or run the process from the command line.
  8. Run the test suite
  9. Analyze results in tmp folder. If there is an error file for a particular test case, you can easily visually inspect the differences like this:
dsteiner@dsteiner-Aspire-5742:/tmp/kettle-test/tr_category_sales/sales_threshold$ diff fact_sales_by_category.csv fact_sales_by_category_golden.csv
2c2
< 2013-01-01;Accessories;Mrs Susi Redcliff;399;Yes
---
> 2013-01-01;Accessories;Mrs Susi Redcliff;399;No
5c5
< 2013-01-02;Groceries;Mr James Carrot;401;No
---
> 2013-01-02;Groceries;Mr James Carrot;401;Yes


All files related to testing (KTF) are stored in repository/test. You should not have to alter them unless you find a bug or want to modify their behaviour.


To get a better idea on how this is working, look at the included examples, especially tr_category_sales (has multiple inputs and outputs and proper sorting). The other example, tr_fact_sales has only one output and input and no sorting defined (as it only outputs one figure).

Future improvements

Following improvements are on my To-Do list:
  • Write test results to dedicated database table
  • Improvement of folder structure
  • Support for text file input and output for main processes (jobs/transformations)

FAQ

My input data sets come from more than one data source. How can I test my process with the KTF?

Your process must have parameter driven database connections. This way you can easily point your various JNDI connections to just one testing input database. The main purpose of testing is to make sure if the output is as expected, not to test various input database connections. Hence for testing, you can “reduce” you multiple input connections to one.

Going Agile: Sqitch Database Change Management

$
0
0

Going Agile: Sqitch Database Change Management

You have your database scripts under a dedicated version control and change management system, right? If not, I recommend doing this now.
While there have been handful of open source projects around which focus on DB script versioning and change management control, none has really gained a big momentum and a lot of them are dormant.
But there is a new player on the ground! A light at the end of the db change management tunnel - so to speak. David Wheeler has been working on Sqitch over the last year and the results are very promising indeed! Currently the github projects shows 7 other contributors, so let’s hope this project gets a strong momentum! Also a new github project for a Sqitch GUI was just founded.

Why I like Sqitch:
  • You can run all the commands from the command line and get very good feedback.
  • Everything seems quite logical and straightforward: It’s easy to get to know the few main commands  and in a very short amount of time you are familiar with the tool.
  • You can use your choice of VCS.
  • It works very well.

Supported DBs are currently MySQL, Oracle, SQLite and PostgreSQL. CUBRID support is under way.



So what do we want to achieve? So what do we want to achieve?

Bring all DDL, stored procedures etc under version control. This is what Git is very good for (or your choice of CVS).

Keep track of the (order of) changes we applied to the database, verify that they are valid, be able to revert them back to a specific state if required. Furthermore, we want to deploy these changes (up to a specific state) to our test and production databases. This is was Sqitch is intended for:


    The below write-up are my notes partially mixed with David’s ones.

    Info

    Forum:

    Installation


    Options:
    PostgreSQL: cpan App::Sqitch DBD::Pg (You also have to have PostgreSQL server installed)
    SQLite: cpan App::Sqitch DBD::SQLite
    Oracle: cpan App::Sqitch DBD::Oracle (You also have to have SQL*Plus installed)
    MySQL: cpan App::Sqitch

    If you want to have support for i.e. PostgreSQL and Oracle you can just run:
    PostgreSQL: cpan App::Sqitch DBD::Pg DBD::Oracle

    For more install options see here.
    Below I will only discuss the setup for PostgreSQL.


    On the terminal run:
    $ sudo cpan App::Sqitch DBD::Pg

    During installation it will ask you for the PostgreSQL version. If you are not sure, run:
    $ psql --version

    It then asks you for a PostgreSQL bin directory. On Ubuntu, this is located in:
    /usr/lib/postgresql/9.1/bin

    Next it will ask you where the PostgreSQL include directory is located. You can find this out by running the following:
    $ pg_config --includedir

    If you don’t have pg_config installed, run first:
    $ sudo apt-get install libpq-dev

    The include location of Ubuntu should be:
    /usr/include/postgresql

    Once installation is finished, check out the man page:
    $ man sqitch

    Within your git project directory, create a dedicated folder:
    $ mkdir sqitch
    $ git add .
    $ cd sqitch
    $ sqitch --engine pg init projectname

    Let's have a look at sqitch.conf:
    $ cat sqitch.conf

    Now let’s add the connection details:
    $ vi sqitch.conf

    uncomment and specify:
    [core "pg"]
    client = psql
    username = postgres
    password = postgres
    db_name = dwh
    host = localhost
    port = 5432
    # sqitch_schema = sqitch

    If psql is not in the path, run:
    $ sqitch config --user core.pg.client /opt/local/pgsql/bin/psql
    Add your details:
    $ sqitch config --user user.name 'Diethard Steiner'
    $ sqitch config --user user.email 'diethard.steiner@'

    Let’s add some more config options: Define the default db so that we don’t have to type it all the time:
    $ sqitch config core.pg.db_name dwh
    Let's also make sure that changes are verified after deploying them:
    $ sqitch config --bool deploy.verify true
    $ sqitch config --bool rebase.verify true

    Check details:
    cat ~/.sqitch/sqitch.conf

    Have a look at the plan file. The plan file defines the execution order of the changes:
    $ cat sqitch.plan

    $ git add .
    $ git commit -am 'Initialize Sqitch configuration.'

    Add your first sql script/change:
    $ sqitch add create_stg_schema -n 'Add schema for all staging objects.'
    Created deploy/create_stg_schema.sql
    Created revert/create_stg_schema.sql
    Created verify/create_stg_schema.sql

    As you can see, Sqitch creates deploy, revert a verify files for you.

    $ vi deploy/create_stg_schema.sql

    Add:
    CREATE SCHEMA staging;

    Make sure you remove the default BEGIN; COMMIT; for this as we are just creating a schema and don’t require any transaction.

    $ vi revert/create_stg_schema.sql

    Add:
    DROP SCHEMA staging;

    $ vi verify/create_stg_schema.sql

    Add:
    SELECT pg_catalog.has_schema_privilege('staging', 'usage');

    This is quite PostgreSQL specific. For other dbs use something like this:
    SELECT 1/COUNT(*) FROM information_schema.schemata WHERE schema_name = 'staging';

    Now test if you can deploy the script and revert it:

    Try to deploy the changes:
    The general command looks like this:
    $ sqitch -d <dbname> deploy

    As we have already specified a default db in the config file, we only have to run the following:
    $ sqitch deploy
    Adding metadata tables to dwh
    Deploying changes to dwh
     + create_stg_schema .. ok

    Note the plus sign in the feedback which means this change was added.

    When you run deploy for the very first time, Sqitch will create maintenance tables in a dedicated schema automatically for you. These tables will (among other things) store in which “version” the DB is.

    Check the current deployment status of database dwh:
    $ sqitch -d dwh status
    # On database dwh
    # Project:  yes
    # Change:   bc9068f7af60eb159e2f8cc632f84d7a93c6fca5
    # Name:     create_stg_schema
    # Deployed: 2013-08-07 13:01:33 +0100
    # By:       Diethard Steiner <diethard.steiner@>


    To verify the changes run:
    $ sqitch -d dwh verify
    Verifying dwh
     * create_stg_schema .. ok
    Verify successful


    To revert the changes the the previous state, run:
    $ sqitch revert --to @HEAD^ -y

    Side note
    You can use @HEAD^^ to revert to two changes prior the last deployed change.

    To revert everything:
    $ sqitch revert
    Revert all changes from dwh? [Yes] Yes
     - create_stg_schema .. ok

    To revert back to a specific script (you can also revert back to a specific tag):
    $ sqitch revert create_dma_schema
    Revert changes to create_dma_schema from dwh? [Yes]

    Let’s inspect the log:
    $ sqitch log

    Note that the actions we took are shown in reverse chronological order, with the revert first and then the deploy.

    Now let's commit it.
    $ git add .
    $ git commit -m 'Added staging schema.'

    Now that we have successfully deployed and reverted the current change, let’s deploy again:
    $ sqitch deploy
    Let’s add a tag:
    $ sqitch tag v1.0.0-dev1 -n 'Tag v1.0.0-dev1.'

    Deployment to target DBs
    So if you want to deploy these changes to your prod DB in example, you can either do it like this:
    $ sqitch -d <dbname> -u <user> -h <host> -p <port> deploy
    (Important: If you are working with PostgreSQL, make sure you add your password to ~/.pgpass and then comment the password out in sqitch.conf beforehand otherwise this will not work.)
    Or bundle them up, copy the bundle to your prod server and deploy it there:
    $ sqitch bundle
    Distribute the bundle
    On the prod server:
    $ cd bundle
    $ sqitch -d dwhprod deploy

    A future version of Sqitch will have better support for target DBs (see here).

    Using Sqitch with an existing project (where some ddl already exists)

    Sometimes you take over a project and want to bring the existing DDL under version control and change management.

    Thanks to David for providing details on this:

    The easiest option is to export the existing DDL and store it in one deploy file. For the revert file you could use a statement like this then:

       DROP $schema CASCADE;

    Let’s assume we call this change “firstbigchange”:

    The first time you do a deploy to the existing database with Sqitch, do it twice: once with --log-only to apply your first big change, and then, from then on, without:

       $ sqitch deploy --log-only --to firstbigchange
       $ sqitch deploy --mode change

    The --long-only option has Sqitch do everything in the deploy except actually run deploy scripts. It just skips it, assumes it worked successfully, and logs it. You only want to do this --to that first big dump change, as after that you of course want Sqitch to actually run deploy scripts.

    Using more than one DB

    DS: Currently it seems like there is a Sqitch version for each of these dbs. What if I was working on a project that used two different dbs installed on the same server and I wanted to use Sqitch for both of them (especially for dev I have more than one db installed on the same server/pc)?

    DW: You can have more than one plan and accompanying files in a single project by putting them into subdirectories. They would then be effectively separate Sqitch projects in your SCM. The sqitch.conf file at the top can be shared by them all, though, which is useful for setting up separate database info for them ([core.pg] and [core.mysql] sections, for example).

    If you are starting a new project, you would do it like this:

    $ sqitch --engine pg --top-dir pg init myproject
    $ sqitch --top-dir mysql init myproject

    Then you have sqitch.plan, deploy, revert, and verify in pg/, and sqitch.plan deploy, revert, and verify in mysql/. To add a new change, add it twice:

    $ sqitch --top-dir pg add foo
    $ sqitch --top-dir mysql add foo

    Pentaho 5.0 Reporting by Example: Beginner’s Guide (Book review)

    $
    0
    0
    Ok, ok ... I said in the previous blog post that it was the last one for this year, but I certainly mustn't let the year end without mentioning the best book yet on Pentaho Report Designer: Pentaho 5.0 Reporting by Example: Beginner’s Guide!
    It has taken quite a long time for somebody to publish a book on PRD for end users. Mariano García Mattío and Dario R. Bernabeu did an excellent job in explaining the broad functionality of PRD in an easy accessible manner. This is certainly a book that I will recommend to anyone starting out with PRD!
    I was also asked to review the book (along with Dan) which was quite an interesting experience. So all in all a big thumbs up for this books ... go out an get it!

    Talend Open Studio Cookbook (Book review)

    $
    0
    0
    I had some time in the holidays to read the recently published book "Talend Open Studio Cookbook" by Rick Barton. I have to admit, I was quite impressed by this book!
    I can highly recommend it to everyone who is looking into getting started with Talend Data Integration but also to someone who has some experience with it, as there are quite a lot of useful tips and tricks mentioned as well.
    The book takes a very practical approach, which ensures that you are up-to-speed in a very short amount of time. It covers all the essentials (from creating data integration jobs to finally scheduling them) plus covers some more advanced topics as well.

    Overall the book is very well structured and brings across best practices in a very easy to understand manner. All the exercises focus on creating only the required functionality, so you start off with an already partially build data integration job and only fill in the required pieces, which makes it indeed a very good experience. The accompanying files all worked very well and I have to applaud the author for providing both the "partially build" job files as well as the completed ones.

    In a nutshell: A highly recommended book - go out and get it!

    Building a Data Mart with Pentaho Data Integration (Video Course)

    $
    0
    0
    I have been an enthusiastic follower of the Pentaho open source business intelligence movement for many years. At the beginning of 2013 I got asked to create a video tutorial/course on populating a star schema with Pentaho Kettle. This was my first foray into video tutorials. This video is now available on the Packt website.

    To me the most interesting experience on this project was finding an open source columnar database. Certainly I could have just gone down the road of using a standard row-oriented one: But having worked on projects which made use of commercial columnar databases, I quite well understood their advantage. To my surprise, the landscape of open source columnar database was quite small. There has been some revival of sorts in the Hadoop world with Impala etc (using dedicated file formats), but this was at that time probably a bit too much cutting edge. The tutorial required a DB, which had established itself for some time and was easy to install: MonetDB. This is the same DB which is actually used by Kettle as well for Instaview.  This gave me the opportunity to discuss bulk loading and talk about some advantages of columnar DBs.

    Creating these videos was not quite as easy as I initially anticipated. I spent actually quite a lot of time on this project and at the end of 2013 rerecorded most of the video sessions to fix some pronunciation problems (Although I’ve lived in the UK for 9 years now I can’t quite hide my roots ;)) as well as rewriting all the files to work with PDI v4.4 (initially I was working with a trunk version of PDI v5).

    I do hope that these videos provide the viewer with a nice introduction into this exciting topic. As I mention at the beginning of the course, this is not an introduction to Pentaho Kettle in general - I do assume that the viewer already has some basic Pentaho Kettle knowledge. Furthermore I decided to only focus on the Linux command line - but it shouldn’t be all to difficult for the viewer to translate everything to a Windows or Mac OS X environment as well. Is this course perfect? I don’t think so - but for my first foray into the video tutorial world I do hope it is worthwhile and teaches the viewer a few tips and tricks.

    Lastly I want to thank my reviews for their support and their honest feedback, Unnati at Packt Publishing for the administrative side and finally Brandon Jackson for his help, support and work on some bugs related to MonetDB bulk loader!

    Easy way to pass down all parameters to Pentaho subreports

    $
    0
    0
    At yesterday’s Pentaho London User Group (PLUG) meetup I discussed with Thomas Morgner again the topic sub-reports. For a long time I’ve tried to explain that sub-reports are a concept that shouldn’t be exposed to the actual report designer. We should all just start with a blank canvas (with no bands) and add objects (tables, crosstabs, charts, etc) and then link these objects to the data sources (frankly quite similar to how it is done in some other report designers or CDE). And parameters should just be available everywhere without having to map them to the specific objects.
    Anyways, during this discussion Thomas mentioned that there is actually no need to specify all the parameters in the parameter mapping dialog in the subreport. You can just add one mapping with stars (*) in it and all parameters will be passed down automatically to the subreport. Frankly, I was puzzled and astound about this … I remembered the days when I was working on monster reports with 30 or 40 subreports and all the time had to define 17 or so parameters for each of them. Why this was not documented anywhere, I was quite wondering. Needless to say that this is a real time safer!
    My next question was then, why this was not the default behaviour? If I don’t specify any parameter mappings, PRD should just pass down all parameters by default. So, I just created this JIRA case therefore and we all hope that this will be implemented as soon as possible. Please vote for it!


    So excited as I was about these news, this morning I had to quickly test this approach. Here the mapping in the subreport:
    Then I just output the values in the details band of the sub-report:
    And here is how the preview looks:
    Now that was easy! Thanks Thomas!

    Pentaho for Big Data Analytics book review

    $
    0
    0
    When I first heard about this book, I got quite excited. I looked up the info on the Packt website and when I saw the page count, 118 pages, a bit question mark came up. Then I had a look at the table of contents, and suddenly, all my excitement was gone.

    The book feels and reads like a marketing booklet that Pentaho themselves could have published with a title like 'Getting started with Pentaho Big Data within 6 hours'. Certainly, for somebody completely new to this topic, such a high level overview is a great introduction. But if you are already a bit familiar with Pentaho and know a little bit about Big Data, I can't quite see what you would win in ready this book. Don't get me wrong: The book is well written, easy to understand, but most of the chapters just scratch the surface, in the sense that they help you to get started, but then don't get into any further detail. The only chapter that probably provides a bit more detail is the one on CDE. The chapter on Pentaho Report Designer even only shows you have to open an existing report (from the biserver) and speaks you through the structure of a report.

    One thing that I was really expecting to find in this book were some detailed examples about using Pentaho Kettle with Hadoop. The only thing covered is copying a file to HDFS, then to Hive and exporting a dataset from Hive, which is fairly easy to accomplish. At the very minimum, creating a simple map reduce job in Kettle (like the famous wordcount example) could have been covered. And even then, there could have been so much more written about this topic.

    Also, another point is, who actually uses Hive as the data source of choice for powering a dashboard? If the data source has to be something related to Big Data, why not use Impala (or similar projects), where latency wouldn't be such an issue? Or follow to common approach and export the prepared data to a columnar DB like MonetDB etc.

    So to sum it up: If you are new to Pentaho and new to Big Data, this book is well worth a read as a brief introduction. It will help you configure most Pentaho components correctly within a short amount of time and give you some ideas on what can be achieved. Take this as a starting point, more detailed questions will then have to be answered by other sources.

    Sparkl: Create your own app for the Pentaho BI/BA Server

    $
    0
    0

    Installing Sparkl

    This is just a very brief walkthrough, Francesco Corti has already published an excellent step by step tutorial here, so please have a look there for detailed instructions):
    From Home > Marketplace install Sparkl (and any dependencies like CDE etc).
    Once installed, restart the biserver.

    Initial App Setup

    Let’s create our first app: Tools > Sparkl. Click on the BIG plus icon, then on CREATE:
    Assign a unique name and click Create Plugin:
    Then you are informed that:
    So do as you are told ;) and restart the server.

    If on next login your Pentaho User Console looks fairly blank and you find this error message in the log:

    21:23:55,460 ERROR [GenericServlet] GenericServlet.ERROR_0004 - Resource /data-access/resources/gwt/D9EA02CD60EF4F8D3A2BD2613D9BB9A8.cache.html not found in plugin data-access

    … then clear the browser cache and all problems should be solved.

    So once you have your normal PUC back, go to Tools> Sparkl. You should see your app now:
    Click the Edit icon. Time to provide some app details:
    Next click on Elements, where we can define the endpoints. Endpoints can either be CDE dashboards [frontend] or Kettle transformations [backend].  

    Click on Add new Element. Add 1) a dashboard and 2) a Kettle transformation:
    All your app files are stored under:
    pentaho-solutions/system/myPasswordChanger

    The Pentaho Kettle job can be found in:
    pentaho-solutions/system/myPasswordChanger/endpoints/kettle/mypasswordchangerendpoint.ktr

    I am sure you can’t wait to have a look at this Kettle transformation, which has just been created for you … well, let’s go ahead and fire up Spoon and open up this file:

    Still excited? Well, you didn’t expect everything to be done for you … did you? That’s why your special insider knowledge is still required here! But more about this later on.

    Creating the dashboard


    As you are probably aware of by now, Sparkl makes use of two very popular tools: CDE and Kettle. So if you have ever used CDE to create dashboards before, editing the Sparkl dashboard should be quite familiar to you!
    Ok, let’s edit the dashboard:

    We will use a very simple approach here (the aim here is not to create a fancy looking dashboard, but just some very simple prototype):

    1. Create a row called passwordTextRow.
    2. With passwordTextRow still marked, click on the Add HTML icon.
    3. Mark the HTML row and add this HTML snippet on the right hand side:
      <p>Specify your new password:</p>
    4. Then add a new row called passwordInputRow. For this row, add two columns, one called passwordInputColumn and the other one passwordSubmitColumn. The layout should now look like this:
    5. Save the dashboard and switch to the Component Panel.
    6. Create a parameter: Generic > Simple Parameter. Call it passwordParameter.
    7. Add a button: Others > Button Component. Call it passwordSubmitButton. For Label specify Submit and for HtmlObjectpasswordSubmitColumn (just press CTRL+Space to retrieve the values of the available HtmlObjects):
    8. Add an Input field: Select > TextInput Component. Name it passwordTextInput, assign the ParameterpasswordParameter and the HtmlObjectpasswordInputColumn to it:
    9. Now switch to the Datasource Panel. Remove the SQL dummy query.
    10. From the left hand side open MYPASSWORDCHANGE Endpoints and choose mypasswordchangerendpoint Endpoint. For this datasource specify the name myPasswordChangerDS in the Properties section:
    11. Switch back to the Components Panel. In the Components area select the Button Component. Click on Advanced Properties.
    • For Action Parameters specify passwordParameter as [["passwordParameter"],["passwordParameter"]] and
    • For Action Datasource specify myPasswordChangerDS. At the time of this writing there were considerations about moving the datasource property to main properties area (instead of advanced properties), so this might have changed by the time you read this.
    • For Listeners specify passwordParameter

    1. Save the dashboard.
    2. Let’s see how our Sparkl plugin looks so far. Choose Tools > MyPasswordChanger:
      And you should see something like this:

    Preparing the Kettle transformation

    You can find a detailed documentation of the REST endpoints here (download the doc file). Harris also wrote an excellent blog post about it.

    Extract the documentation zip file (which you just download) and then open the index file in your favourite web browser.

    Click on UserRoleDaoResource from the REST Resources list, then choose /userroledao/updatePassword:
    Let’s inspect the request body element a bit closer, so click on the user link:
    So this tells us that we can either send an XML or JSON document containing the userName and password to this REST endpoint.


    It’s a good idea to first experiment a bit with the REST services by using the popular command line tool curl:

    Let’s start first with a simple get request. Note the use of --user to specify the authentication details:

    $ curl --user admin:password -i -H "Accept: application/json" -H "Content-Type: application/json"http://localhost:8080/pentaho/api/userroledao/users

    HTTP/1.1 200 OK
    Server: Apache-Coyote/1.1
    Set-Cookie: JSESSIONID=7A850427D26D6F6AABA2B5BC3C7F40D7; Path=/pentaho
    Content-Type: application/json
    Transfer-Encoding: chunked
    Date: Tue, 28 Jan 2014 20:56:49 GMT

    {"users":["suzy","pat","tiffany","admin"]}



    Let’s get some role info (here we supply one parameter):

    $ curl --user admin:password -i -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8080/pentaho/api/userroledao/userRoles?userName=admin

    HTTP/1.1 200 OK
    Server: Apache-Coyote/1.1
    Set-Cookie: JSESSIONID=92D98FAF9C5D23CF0ACADAB8655E287D; Path=/pentaho
    Content-Type: application/json
    Transfer-Encoding: chunked
    Date: Tue, 28 Jan 2014 20:55:29 GMT

    {"roles":["Administrator"]}



    So now let’s try the more complicated task of updating the password: We have send along some XML document. This can be accomplished by using PUT with the -d option (d for data):

    $ curl -i --user admin:password -H "Content-Type: application/xml" -H "Accept: application/xml" -X PUT -d '<?xml version="1.0" encoding="UTF-8"?><user><userName>admin</userName><password>test123</password></user>'http://localhost:8080/pentaho/api/userroledao/updatePassword

    HTTP/1.1 200 OK
    Server: Apache-Coyote/1.1
    Set-Cookie: JSESSIONID=3AC126B2276665B4F0FE655EC29A4964; Path=/pentaho
    Content-Length: 0
    Date: Tue, 28 Jan 2014 21:15:08 GMT

    Alternatively you could create an XML file:
    $ vi update.xml

    Copy and paste:
    <?xml version="1.0" encoding="UTF-8"?>
    <user>
     <userName>admin</userName>
     <password>password</password>
    </user>

    Then issue this command:
    $ curl -i --user admin:password -H "Content-Type: application/xml" -H "Accept: application/xml" -X PUT -d @update.xml http://localhost:8080/pentaho/api/userroledao/updatePassword

    If you try to log on to PUC now, you will have to use your new password!

    Enough playing around with curl! We have established that updatePassword REST API is working as expected.

    The next step is to prepare a basic working Kettle transformation with the same functionality:
    1. Create a new biserver-user with admin rights which we can use just for authentication purposes.
    2. Fire up Spoon and open mypasswordchangerendpoint.ktr
    3. Amend the transformation to look like the one shown in the screenshot below:
    4. Double click on Generate Rows. Set the Limit to 1. Create three fields:
    Field
    Type
    Value
    url
    String
    http://localhost:8080/pentaho/api/userroledao/updatePassword
    user
    String
    your username
    password
    String
    your password
    1. Double click on Add XML. Into the Output Value field type userXML and into the Root XML element field user:
    2. Next click on the Fields tab. Configure as shown below:
    3. Double click on the REST client step. Configure as outlined below:
    • General:
      • URL name field: url
      • HTTP method: PUT
      • Body field: userXML
      • Application Type: XML
    • Authentication
      • HTTP Login: admin-rest
      • HTTP password: test123 (note: use the details of the specially created rest admin user here!)
  1. Finally configure the logging step to output the essential infos.
  2. Run the transformation.



  3. When finished, restart the server (which is the official way to deploy your transformation).

    Go to Tools > MyPasswordChanger. Click the submit button.

    In a terminal window watch the Tomcat log:
    [dsteiner@localhost biserver-ce]$ tail -f tomcat/logs/catalina.out
    2014/01/29 19:03:49 - Write to log.0 - =======================
    2014/01/29 19:03:49 - Write to log.0 - === KTR REST RESULT ===
    2014/01/29 19:03:49 - Write to log.0 - =======================
    2014/01/29 19:03:49 - Write to log.0 -
    2014/01/29 19:03:49 - Write to log.0 - result = Password was changed successfully!
    2014/01/29 19:03:49 - Write to log.0 -
    2014/01/29 19:03:49 - Write to log.0 - ====================
    2014/01/29 19:03:49 - OUTPUT.0 - Finished processing (I=0, O=0, R=1, W=1, U=0, E=0)

    Watch out for the REST call result message. Here we see that the password was changed successfully. You don’t believe it? Log out of PUC and back in (with the new password).

    Ok, now we know that this is working. Let’s make everything a bit more dynamic:
    • username: CPK server side parameter (currently logged in user)
    • password: supplied by CDE dashboard
    • URL: IP and port
    • authentication username
    • authentication password

    We will not test the transformation any more in Spoon.

    How to pass a standard parameter from a dashboard to a Kettle job or transformation

    As you remember, we defined earlier on the passwordParameter parameter in CDE. To pass the value of this parameter to a Kettle job or transformation, simply define the parameter in your job/transformation as you would normally do:

    1. Open mypasswordchangerendpoint.ktr in Spoon.
    2. Right click on the canvas and choose Transformation settings.
    3. Specify the a new parameter called passwordParameter. As you might have guessed, this parameter name has to be exactly the same as defined in the dashboard:
    4. Change the transformation to look like this one:
      Disable the hob from the original
      Generate Rows step to Add XML. Add a new Generate Rows and a Get Variables step. For now the newGenerate Rows should only supply the url and user values: Right now we just want to test if the password parameter is passed on properly. Open up the Get Variables step and create a new fields called password which references the variable ${passwordParameter}:
      This setup enables us to use the
      passwordParameter originating from the dashboard in our Kettle transformation stream.
    5. Just for testing purposes write the password field to the log (so that we see that the value is actually passed on). Change the Write to Log step config respectively.

    Server-side Parameters

    Kettle Properties

    As you might well know, Kettle transformations/jobs running on the server have access to the kettle.properties file as well as long as everything is properly configured. If you have everything installed on one machine and are using the default settings, there is actually no additional setup on the biserver side required. The biserver will automatically access ~/.kettle/kettle.properties.

    So for the REST authentication (server IP and port) I just defined a parameter in the kettle.properties file like this:
    VAR_PENTAHO_BISERVER_URL=http://localhost:8080/pentaho
    VAR_PENTAHO_BISERVER_USER=admin-rest
    VAR_PENTAHO_BISERVER_PW=test123

    Sparkl server side parameters

    Sparkl currently provides these server-side parameters:
    Parameter
    Description
    cpk.plugin.id
    the plugin ID
    cpk.solution.system.dir
    the pentaho solution system dir (full path)
    cpk.plugin.dir
    the plugin dir (full path)
    cpk.plugin.system.dir
    the plugin system dir (full path, this isn't used very often though and it might become deprecated)
    cpk.webapp.dir
    webapp dir (full path)
    cpk.session.username
    session username
    cpk.session.roles
    session roles (string with session authorities separated by commas)

    Marco Vala explains:

    Additionally, if you add a parameter named
    cpk.session.SOMETHING
    CPK will try to find a session variable named SOMETHING and "injects" its current value
    (this is one way only, just for reading session variables)

    There is also a handy parameter named
    cpk.executeAtStart
    If its default value is set to "true", CPK will execute that Kettle when plugin starts

    Amending our transformation


    We have to implement following changes:
    1. Remove any fields from the new Generate rows step. It should be blank:
    2. Open up the Set Variables set and configure it as shown in the screenshot … add the username variable referencing cpk.session.username:
    3. Adjust the Add XML step:
    4. Change the config of the REST Client step to not read any more the URL from the field but instead key it into the URL config field like this:
      ${VAR_PENTAHO_BISERVER_URL}/api/userroledao/updatePassword
    5. Also, in the Authentication reference the parameters we just set up: ${VAR_PENTAHO_BISERVER_USER} and ${VAR_PENTAHO_BISERVER_PW} respectively.
    6. Save and then restart the server.
    7. Then test again the dashboard and watch the Tomcat log.

    Oh well, oh well, all is not perfect right now … you wonder why we actually have to authenticate - this file is already on the server - right? Rest assured, Pedro’s team is already working on this feature!

    So with the solution created so far every user should be able to change their password. The core functionality is provided, we will not cover the rest in this tutorial.

    Closing words

    This is certainly an ugly looking app, the password should not be shown, there has to be a confirmation page etc … basically a long list of tasks still to do. The point here was not to create a fully functional, nice looking Pentaho Sparkl app, but to build a bare bones prototype: Demonstrating how the two main components - Kettle and CTools - can work together; enabling us to create a wide variety of apps for the Pentaho BI-Server ecosystem. Go and explore this exciting new world now and create some fancy Sparkl apps!

    Having problems starting Pentaho Kettle Spoon on Linux? Here are some solutions ...

    $
    0
    0
    End of last year I decided it was time to ditch Ubuntu for a pure Gnome experience. I didn't quite fancy installing Gnome 3.10 on top of Ubuntu due to some user having problems with this. So Fedora looked like a good candidate. And having worked with it for nearly two months now, I must admit that I quite like the new clean Gnome interface (and Fedora). Gnome gets criticized everywhere, a lot of people don't have anything good to say about, but honestly to me it seems like the Gnome development is going into the right direction. But this of course a matter of personal taste. Enough on this, let's talk about Pentaho Kettle:

    Quite often, Pentaho Kettle Spoon - the GUI for designing transformations and jobs - starts up just fine on Linux OSes. Sometimes though, there might be some dependencies to install or special flags to set.

    When starting Pentaho Kettle on Fedora I came across this nasty error message:
    spoon.sh: line 166: 10487 Aborted (core dumped) "$_PENTAHO_JAVA"

    On other systems I also got this error message:
    The error was 'BadDrawable (invalid Pixmap or Window parameter)'.
     (Details: serial 13561 error_code 9 request_code 62 minor_code 0)
     (Note to programmers: normally, X errors are reported asynchronously;
      that is, you will receive the error a while after causing it.
      To debug your program, run it with the --sync command line
      option to change this behavior. You can then get a meaningful
      backtrace from your debugger if you break on the gdk_x_error() function.)

    To fix this problems, just add this to the spoon.sh OPS section:
    -Dorg.eclipse.swt.browser.DefaultType=mozilla



    Another error message you might come across is the following:
    org.eclipse.swt.SWTError: XPCOM error -2147467259


    One dependency that you might have to install is xulrunner:
    $ yum list xulrunner*
    $ yum install xulrunner.x86_64


    find the directory xulrunner is installed in:
    $ which xulrunner
    /usr/bin/xulrunner


    To fix this open spoon.sh and add at the end of the OPS section:
    -Dorg.eclipse.swt.browser.XULRunnerPath=/usr/lib64/xulrunner/



    This advice was originally posted here and here and in this blog post. Also here: Source1 and Source2.

    NOTE: Lastest Firefox has no dependency on XULRunner any more, so PDI shoud not need to any more as well (but I haven't checked this yet)!

    Did you get any other error messages when starting Spoon and found a solution for it? Please comment below and I'll add it to this blog post so that we have a good resource for trouble shooting.

    Matt Casters:

    FYI, the package to install  on Ubuntu is usually  libwebkitgtk-1.0-0 (as documented).  I'm sure it's the same on Fedora.  I would avoid all that xulrunner stuff if possible.
    For those of us on Kubuntu there are bugs in theme oxygen-gtk so best switch to another theme like Ambiance of turn off a bunch of fancy-shmancy animations with oxygen-settings.

    Pentaho CDE and Bootstrap: The essential getting started info

    $
    0
    0
    Quite inspired by the recent interesting webinar by Harris Ward (@hazamonzo) of Ivy Information Systems I prepared this short write-down of the core principles of using the Bootstrap framework with Pentaho CDE (which in essence summarize the main points of the webinar). I take that you are familiar with CDE, so I will not go into details on how to install it or on how to create a dashboard. This article only focuses on using Bootstrap with CDE.

    Why Bootstrap

    For a long time Blueprint had been the default framework for CDE. While it worked nicely on the desktop, the content didn’t quite adjusted as nicely to tablets and mobile phones.
    As of recently CDE supports the Bootstrap framework. The good news is that you can apply Bootstrap to the selector components like buttons, drop down menus etc as well, hence pathing the way to a uniform look of your dashboard!

    How to use it with CDE

    Configure your CDE Dashboard to use Bootstrap

    The very first thing to do is define the settings for your dashboard. In CDE click on the Settings button in the main menu:
    In the settings screen provide a Name and then set the Dashboard Type to Bootstrap:
    Click Save.

    Generate your standard layout

    Next generate your Layout Structure as usual, just keep the following in mind:
    • With Bootstrap, the total span size of the page is 12 columns (as opposed to Blueprint, which has 24 columns). This means that if your page has only one column, the span size should be 12. If your page has 2 columns, the span size should be 6 for each of them and so on (calc: 12 / no of cde columns).
    • Within each column nest an HTML element: This one will hold the Bootstrap HTML snippet. In the most simple form it will look like this:
      Make sure to provide a name for the HTML object in the
      Properties panel (otherwise, the styles do not show up properly in the preview - at least in my case this happened).

    Get code from Bootstrap website

    The Bootstrapwebsite offers a wealth of components. These ones can be found by click in the main menu on either CSS or Components. Use the side panel menu to get detailed info about each component:


    One component which you will most likely use is called Penals [reference]. Again, there is a wide choice of panels available, let’s take a look at the Penal with header:
    Copy the last main div section - this is the one we want to base our layout on.

    Amend Bootstrap HTML

    You can use for example JSFiddle to simply amend the Bootstrap HTML snippet and get a preview of it.


    1. Open up JSFiddle in a web browser tab.
    2. Copy the Bootstrap HTML snippet into the HTML panel
    3. Copy this reference into the CSS panel: @import url('http://getbootstrap.com/dist/css/bootstrap.css')
    4. Adjust the HTML. Do at least the following:
      1. Provide a proper title.
      2. Add an id attribute to the content div. This way we can later on reference it when we want to assign a chart component.
      3. Delete the default content text.
    5. Then click the Run button to get a preview:
    We are happy with the preview, so let’s copy the HTML code.

    Add amended Bootstrap HTML to CDE Layout Structure

    When we initially created the Layout Structure in CDE, we nested an HTML element inside each column - just copy bootstrap HTML there. Choose the HTML element in the Layout Structure panel, then click on the HTML ellipsis [...] button in the Properties panel:
    Finally paste the Bootstrap HTML snippet:
    Click OK.


    Save the dashboard now and do a preview:
    Our Bootstrap panel is nicely rendered within the CDE dashboard now. It’s time to add some content!

    Create your data sources

    Business as usual - nothing special to consider. Just create your datasources as you would do it for any other dashboard.

    Create your components

    Overall business as usual, just keep in mind that you do not want to reference the HTML object the old way (so based on the elements you defined in the Layout Structure), but instead make use of the div ids defined within the Bootstrap HTML snippets (that you saved within the HTML element in the Layout Structure).


    So for this article I quickly created a CCCBarChart component. The important bit about it is that I specified chart1for the HTMLObject property:
    Note that this ties back to the div id attribute we defined in our Bootstrap HTML snippet [stored in the Layout Structure HTML element]:


    So let’s do another preview:
    So that is it: It turns out using Bootstrap with Pentaho CDE is actually fairly easy. This example was certainly extremely simple and only meant to give you a brief introduction.

    Last Example

    Let’s add a button:
    1. Adjust the Layout Structure of your dashboard to make room for a button.
    2. Visit the Bootstrap website to find out all about their fancy buttons: reference.
    3. Let’s copy the HTML snippet for the standard button. Jump over to JSFiddle and adjust it until you are happy with the preview. Remember to add a dedicated id attribute.
    4. Copy the HTML snippet again and paste it into the HTML properties field of your HTML element in the Layout Structure panel.
    5. Save and Preview your dashboard.
      Again, a very easy example. I guess now you are certainly interested in creating some more challenging dashboards with Pentaho CDE and Bootstrap!


    And finally, a big thanks to Harris for providing this excellent webinar!

    Pentaho CDE: Create your custom table

    $
    0
    0

    Pentaho Dashboards CDE: Create your custom table

    You want to implement something in your dashboard that is not covered by the out-of-the-box dashboard components? Luckily, with Pentaho CDE the world is open: CDE makes use of the standard web technologies (CSS, JavaScript, HTML), so theoretically you can implement whatever is in the realm of these technologies. Obviously you will need some basic knowledge of these technologies (setup is not as easy any more as filling out some config dialogs), but the possibilities are endless. In this post I’ll briefly talk you through how to source some data and then to create a custom table with it (which you can easily do with one of the CDE components as well, but that’s not the point here … imagine what else you could do):

    1. In CDE, register a Datasource. In example create a sql over sqlJndi datasource, provide a Name i.e. qry_generic_select, choose SampleData for JNDI and specify following query:

       SELECT customername, customernumber, phone FROM customers

    2. In the component section, add a Query Component. This component is most commonly used for displaying simple results, like one number in a dashboard (i.e. max temperature). Here we will use this component to retrieve a bigger result set.
    3. Click on Advanced Properties.
    4. For the Datasource property specify the datasource you created in step 1 (i.e. qry_generic_select)
    5. Provide a name for the Result Var. This is the variable, which will hold the output data of your datasource.
    6. Write a Post Execution function, in example:

       function() {
      document.getElementById('test').innerHTML = JSON.stringify(select_result);
      }
    7. We will only use this function for now to test if the query is working. Later on we will change it.

    8. The setup so far should look like this:
    9. In the Layout Panel create a basic structure which should at least have one column. Name the column test as we referenced it already in our JavaScript function.
    10. Preview your dashboard (partial screenshot):
    11. Let’s change the Post Execution function to return only the first record:

      function() {
      document.getElementById('test').innerHTML = JSON.stringify(select_result[0]);
      }

      And the preview looks like this:

    12. Let’s change the Post Execution function to return only the first entry from the first record:

      function() {
      document.getElementById('test').innerHTML = JSON.stringify(select_result[0][0]);
      }

      And the preview looks like this:

    13. Let’s extend our Post Execution function to create a basic table:

      function() {
      var myContainer = document.getElementById('test');
      var myTable = document.createElement('table');
      var myTr = document.createElement('tr');
      var myTd = document.createElement('td');
      myContainer.appendChild(myTable).appendChild(myTr).appendChild(myTd).innerHTML = select_result[0][0];
      }

      Do a preview and make use of your browser’s developer tools to see the generated HTML:

    14. Ok, now that this is working, let’s add some very basic design. Click on Settings in the main CDE menu:
    15. Choose bootstrap from the Dashboard Type pull down menu: Click Save.

    16. Back to the Post Execution function of the Query Component: Now we want to make this a bit more dynamic: For every data row must be enclosed by <td> and within each data row each data value must be enclosed by <td>. We also have to add the <tbody> element to make a proper table. And we will apply the Bootstrap Striped Table design:

      // Simple function preparing the table body 

      function() {
      var myContainer = document.getElementById('test');
      var myTable = document.createElement('table');
      var myTBody = document.createElement('tbody');
      var myTr = document.createElement('tr');
      var myTd = document.createElement('td');

      //myTable.id = 'table1';
      myTable.className = 'table table-striped';

      myContainer.appendChild(myTable).appendChild(myTBody);

      for(var i = 0; i < select_result.length; i++) {
      myContainer.lastChild.lastChild.appendChild(myTr.cloneNode());
      for(var j = 0; j < select_result[i].length; j++) {
      myText = document.createTextNode(select_result[i][j]);
      myContainer.lastChild.lastChild.lastChild.appendChild(myTd.cloneNode()).appendChild(myText);
      }
      }
      }

      You can find a text version of this JavaScript code a bit further down as well in case you want to copy it.

    17. Do a preview now and you will see that we have a basic table now:

      Note: In case you are creating this dashboard as part of a Sparkl plugin and you are having troubles seeing the bootstrap styles applied (and are sure that the problem is not within your code), try to preview the dashboard from within your Sparkl project endpoint listing (which seems to work better for some unknown reason):

    18. One important thing missing is the header. Let’s source this info now. The Query Component provides following useful functions, which you can access within Post Execution function:

      this.metadata
      this.queryInfo
      this.resultset

      To get an idea of what is exactly available with in the metadata object, you can use in example this function:

        document.getElementById('test').innerHTML = JSON.stringify(this.metadata);

      Which reveals the following:

    19. This is the function preparing the full table (header and body):

      // function preparing the full table (header and body)

      function() {

      var myContainer = document.getElementById('test');
      var myTable = document.createElement('table');
      var myTHead = document.createElement('thead');
      var myTh = document.createElement('th');
      var myTBody = document.createElement('tbody');
      var myTr = document.createElement('tr');
      var myTd = document.createElement('td');

      //myTable.id = 'table1';
      myTable.className = 'table table-striped';

      //document.getElementById('test').innerHTML = JSON.stringify(this.metadata);
      myMetadata = this.metadata;

      myContainer.appendChild(myTable).appendChild(myTHead).appendChild(myTr);

      for(var s = 0; s < myMetadata.length; s++){
      myHeaderText = document.createTextNode(myMetadata[s]['colName']);
      myContainer.lastChild.lastChild.lastChild.appendChild(myTh.cloneNode()).appendChild(myHeaderText);
      }

      myContainer.lastChild.appendChild(myTBody);

      for(var i = 0; i < select_result.length; i++) {
      myContainer.lastChild.lastChild.appendChild(myTr.cloneNode());
      for(var j = 0; j < select_result[i].length; j++) {
      myText = document.createTextNode(select_result[i][j]);
      myContainer.lastChild.lastChild.lastChild.appendChild(myTd.cloneNode()).appendChild(myText);
      }
      }

      }
    20. And the preview looks like this:

    Voilá, our custom boostrap table is finished. This is not to say that you have to create a table this way in CDE: This was just an exercise to demonstrate a bit of the huge amount of flexibility that CDE offers. Take this as a starting point for something even better.

    Setting a variable value dynamically in a Pentaho Data Integration job

    $
    0
    0

    Setting a variable value dynamically in a Pentaho Data Integration job

    On some occasions you might have to set a variable value dynamically in a job so that you can pass it on to the Execute SQL Script job entry in example. In this blog post we will take a look at how to create an integer representation of the date of 30 days ago. And we want to achieve this without using an additional transformation!

    The way to achieve this in a simple fashion on the job level is to use the Evaluate JavaScript job entry [Pentaho Wiki]. While this job entry is not really intended to do this, it currently offers the easiest way to accomplish just this. Just add this job entry to your Kettle job and paste the following JavaScript:

    date = new java.util.Date();
    date.setDate(date.getDate()-30); //Go back 30 full days
    var date_tk_30_days_ago = new java.text.SimpleDateFormat("yyyyMMdd").format(date);
    parent_job.setVariable("VAR_DATE_TK_30_DAYS_AGO", date_tk_30_days_ago);
    true; // remember that this job entry has to return true or false

    To test this let's add a Log job entry:

    Add this to the log message to the job entry settings:

    The date 30 days ago was: ${VAR_DATE_TK_30_DAYS_AGO}

    And then run the job. You should see something similar to this:

    Certainly you could just pass the value as parameter from the command line to the job, but on some occasions it is more convenient to create the value dynamically inside the job.

    Software used:

    • pdi-4.4.0-stable

    Moving Blog: Goodbye Blogger - Hello Github Pages

    $
    0
    0
    I've posted my articles for many years now on Blogger. While Blogger is fairly low maintenance, getting the formatting right for my articles was always a bit of a hassle. I used to write my articles on Google Docs and then copy and paste them to Blogger. I was quite disappointed when Google decided to axe the direct publishing feature from within Google Docs to Blogger a few years ago and the Blogger editor always seemed a bit limited in functionality to me.

    Writing technical articles should be easy and I really shouldn't worry too much about formatting, so I decided to embrace Markdown full swing. 

    My new blog is hosted on Github Pages.

    This Blog is closed. Archive only.

    Pentaho Report Designer: How to show the parameter display name in your report when it is different from the parameter value

    $
    0
    0
    One of my blog's readers just asked me quite an interesting question: How can I show the parameter display name in my Pentaho report if it is different from the parameter value?


    Note: Just to clarify, the scenario covered here is when the parameter value and display name are different. So in example when you set the parameter value on an id field and the name on the descriptive field. Because if parameter value and display name are set to the same field, then you can simply drag and drop the parameter name onto your report.


    So in our case we defined a parameter called PARAM_OFFICECODE. We set the Parameter Value to OFFICECODE (which happens to be an id) and the Parameter Display Name is set to CITY. We want to use the OFFICECODE to constrain the result set of our main report query (in our case this works better because there happens to be an index on this database table column).

    In the report we would like to show in the header the selected office name (CITY) ... but how do we do this? We can not just simply drag and drop the PARAM_OFFICECODE element onto the report header, because it would only display the id (OFFICECODE) and not the display name (CITY).

    You might think there should be an easy solution to this … and right you are. It’s just not as easy as it could be, but quite close …


    So I quickly put together a bare bone example (don’t expect any fancy report layout … we just want to see if we can solve this problem):


    Our parameter:
    So if we placed this parameter element on the main report, we would just see the OFFICECODE when we ran the report. So how do we get the display name?


    1. If it is possible to access the name field (in our case CITY) via the SQL query, we could change our main report SQL query and add it there as a new field. But this is not very efficient, right?
    2. We could create a new query which takes the code/id (in our case OFFICECODE) as a parameter and returns the name (CITY) and then run this query in a sub-report which could return the value to the main report (this is in fact what you had to do some years back). Well, not that neat either.
    3. Here comes the savior: The SINGLEVALUEQUERY formula function. You can find this one in the Open Formula section. Thomas posted some interesting details about it on his blog some time ago.


    Basically for a very long time we had the restriction that we could only run one query to feed data to our report. With the SINGLEVALUEQUERY and MULTIVALUEQUERY formula functions you can run additional queries and return values to the main report.


    So here we go … to retrieve the display value:
    1. We create an additional query called ds_office_chosen which is constrained by the code/id and returns the (display) name: SELECT city AS office_chosen FROM offices WHERE officecode = ${param_officecode}
    2. We create a new formula element called formula_office_chosen and reference the query ds_office_chosen: =SINGLEVALUEQUERY("ds_office_chosen")
    3. We can now use formula_office_chosen in our report:


    Once this is set up, we can run the report and the display name of the chosen parameter value will be shown:
    My very simple sample report can be downloaded from here.

    How to work with MapReduce Key Value Pairs in Pentaho Data Integration

    $
    0
    0

    How to work with MapReduce Key Value Pairs in Pentaho Data Integration

    My main objective for this article is to provide you an understanding on how to use multiple fields to group by and multiple fields to aggregate on in Pentaho PDI MapReduce.


    The input key for the mapper is auto-generated, the value is usually the line of text which is read in (fields separated by comma in example). This section here will focus on the output key value pair of the mapper and input and output key value pair of the reducer. Also we will not discuss the simple scenario where we only use one field for the key and one field for the value.


    I have more than one key field. How do I set up a compound key?


    You are aware that the input and output of the mapper and reducer are key value pairs. If you haven’t been exposed that much to the internals of MapReduce and come more from a traditional ETL world, this is probably one of the most important concepts to understand.
    Did you ever run a Hive query? Did you have to worry about the key fields … no. Hive is doing quite some work in the background … which some users are never exposed to. So when you come to PDI and create the key for your mapper and reducer transformations, the important point is that you have to separate the fields that form the key by the standard separator of the specified output format of the MapReduce job. If you chose the output format org.apache.hadoop.mapred.TextOutputFormat, tab is the standard separator.


    Option 1: Thankfully Pentaho introduced not too long ago a step to just do this in an easy fashion: Use the new Concat Fields step (Wiki entry). This step allows you to create a new field based on several concatenated source fields which are separated by a character of your choice, such as a tab. If you specified the org.apache.hadoop.mapred.TextOutputFormat in the Pentaho MapReduce job entry as output format, tab is the standard separator.


    “4.4.0 release note: Unfortunately we found an issue (PDI-8857) with this step that was too late to incorporate into 4.4.0. The step adds carriage return and line feed to the fields it creates. Workaround is to use the String operations step with the option "carriage return & line feed" after the step or to enable the advanced option "Fast data dump (no formatting)"


    Option 2: Use a User Defined Java Expression step. This option was mainly used before the Concat Fields step was available. Generate the output key by writing some Java expression which concatenates the fields you want to group by.
    Separate the fields with a tab in the concatenate output key, in example:


    date + '\t' + brand


    Important: Replace the tab with a real tab! So it should look like this then:


    date + '' + brand


    This way, all the fields will be properly separated in the final output. Tab in this case is the standard separator of org.apache.hadoop.mapred.TextOutputFormat.


    I have more than one value field. How do I create a compound values field?
    What if I want more than one value to aggregate on?


    Create a new field i.e. called output_values in a Concat Fields or User Defined Java Expression step in the mapper transformation and concatenate all the values and define the separator. Then in the reducer split these values (use the Split Fields step), next aggregate them (use the Group By step) and after this you have to concatenate them again (use the Concat Fields step).


    Let’s walk through a very simple example. We have some sales data which we want to analyze. Let’s say we want the sum of sales and a count of rows by date and brand.


    The Kettle job:


    Our input data for the Pentaho MapReduce job looks like this (date, brand, department, sales):


    $ hadoop fs -cat /user/dsteiner/sales-test/input/sales.txt
    2013-04-01,SimplePurpose,Clothes,234.2
    2013-04-01,SimplePurpose,Accessories,2314.34
    2013-04-01,RedPride,Kitchen,231.34
    2013-04-02,SimplePurpose,Clothes,453.34
    2013-04-01,SimplePurpose,Accessories,5432.34
    2013-04-01,RedPride,Kitchen,432.23
    2013-04-03,RedPride,Kitchen


    The mapper transformation (simple example):
    If we want to inspect what the output of the mapper transformation looks like, we can just simply execute the Pentaho MapReduce job entry without specifying a reducer.


    Output of mapper - Note the key is formed by the first two fields which are separated by a tab and the value is formed by the sales and count field separated by a comma:


    $ hadoop fs -cat /user/dsteiner/sales-test/output/part-00000
    2013-04-01RedPride231.34,1
    2013-04-01RedPride432.23,1
    2013-04-01SimplePurpose234.2,1
    2013-04-01SimplePurpose2314.34,1
    2013-04-01SimplePurpose5432.34,1
    2013-04-02SimplePurpose453.34,1
    2013-04-03RedPride,1


    The reducer transformation (simple example):


    Our output data looks like this (date, brand, sum of sales, count):
    $ hadoop fs -cat /user/dsteiner/sales-test/output/part-00000
    2013-04-01RedPride663.572
    2013-04-01SimplePurpose7980.883
    2013-04-02SimplePurpose453.341
    2013-04-03RedPride01


    So you can see that we successfully managed to aggregate our data by date and brand and sum up the sales as well as perform a count on the rows.

    It’s best if you take a look at my sample files (which you can download from here) to understand all the details. I hope that this brief article shed some light onto creating key value pairs for the Pentaho MapReduce framework.

    Pentaho Kettle Parameters and Variables: Tips and Tricks

    $
    0
    0

    Pentaho Kettle Parameters and Variables: Tips and Tricks

    This blog post is not intended to be a formal introduction to using parameters and variables in Pentaho Kettle, but more a practical showcase of possible usages.


    Please read my previous blog post Pentaho Data Integration: Scheduling and command line arguments as an introduction on how to pass command line arguments to a Kettle job.


    When I mention parameters below, I am always talking about named parameters.

    Parameters and Variables

    Definitions upfront

    Named Parameter: “Named parameters are a system that allows you to parameterize your transformations and jobs.  On top of the variables system that was already in place prior to the introduction in version 3.2, named parameters offer the setting of a description and a default value.  That allows you in turn to list the required parameters for a job or transformation.” (Source)


    Variable: “Variables can be used throughout Pentaho Data Integration, including in transformation steps and job entries. You define variables by setting them with the Set Variable step in a transformation or by setting them in the kettle.properties file. [...] The first usage (and only usage in previous Kettle versions) was to set an environment variable. Traditionally, this was accomplished by passing options to the Java Virtual Machine (JVM) with the -D option. The only problem with using environment variables is that the usage is not dynamic and problems arise if you try to use them in a dynamic way.  Changes to the environment variables are visible to all software running on the virtual machine.  [...] Because the scope of an environment variable is too broad, Kettle variables were introduced to provide a way to define variables that are local to the job in which the variable is set. The "Set Variable" step in a transformation allows you to specify in which job you want to set the variable's scope (i.e. parent job, grand-parent job or the root job).” (Source). “

    Example

    Let’s walk through this very simple example of using parameters and variables. I try to explain all the jobs and transformations involved. The files are also available for download here. You can find the following files in the folder intro_to_parameters_and_variables.

    jb_main.kjb

    In this extremely simple job we call a subjob call jb_slave.kjb. In this case, we defined hard coded parameter values in the job entry settings. Alternatively, to make this more dynamic, we could have just defined parameters in the job settings.

    jb_slave.kjb

    This subjob executes the transformations tr_set_variables.ktr and tr_show_param_values.ktr. In this case, in order to access the parameter values from the parent job, we defined the parameters without values in the job settings:
    Note: This is just one of the ways you can pass parameters down to the subprocess.

    tr_set_variables.ktr

    This transformation sets a variable called var1 with scope Valid in parent job so that successive processes can make use if it. In this case the values originate from a Generate Rows step for demonstration purposes; in real world examples you might read in some values from a file or a database table.

    tr_show_param_values.ktr

    The main transformation has the sole purpose of writing all the parameter and variable values to the log. We retrieve the parameters and variable by using a Get Variables step. We also check if a value is present by using a Filter Rows step. In case one value is missing, we Abort the transformation, otherwise the values are written to the log.


    There is no need to set the parameter names in this transformations; there is an advantage though if you do it:
    Missing parameter values will be properly displayed as NULL, which makes it a bit easier to check for them.
    If you don't define them in the transformation settings, missing parameter values will be displayed as ${PARAMETERNAME}.


    Important: Variables coming from tr_set_variables.ktr MUST NOT be listed in the Parameter tab in the Transformation Settings as this overrides the variable.

    Making Parameters available for all subprocesses in an easy fashion

    As you saw above, defining the parameters for each subprocess just to be able to pass them down can be a bit labour intensive. Luckily, there is a faster way of doing just this:


    1. In the main job specify the parameters that you want to pass in in the Job Settings:
      This way parameters and their values can be passed in from the command line in example.
    2. Right after the Start job entry use the Set Variables job entry. Specify the variable names, reference the parameters you set up in step 1 and set the scope to Valid in the current job.
    3. There is no need to specify any parameters/variables in any of the subprocesses.


    To see how this is working, run jb_main.kjb in the passing_down_parameters_in_an_easy_fashion folder (part of the provided examples).

    What if I still want to be able to run my subprocess independently sometimes?

    You might have some situations, when you have to run the subprocess independently (so in other words: You do not execute it from the parent/main job, but run it on its own). When we pass down parameters or variables, this can be a bit tricky and usually it just doesn’t work out of the box. Luckily, there is a way to achieve this though:
    1. In the subprocess, specify the parameter that you want to be able to pass in. In our example (which is based on the previous example), we modified the transformation tr_show_param_values.ktr and added following parameters to the Transformation Settings:
      We also amended the
      Get Variables step to make use of these parameters:
      This way, we can already run this transformation on its own. Now we only have to adjust the parent job so that we can run it from there as well.
    2. In the parent job, in the Job or Transformation job entry settings, go to the Parameters tab and tick Pass all parameter values down to the sub-transformation/sub-job. Next, as the Parameter set the name of the parameter you defined in the subprocess. As the Value define the variable that you want to pass down: ${variable}. This assumes that this variable was set beforehand by some Set Variables job entry/step.
      In our case, we modified transformation job entry in the job
      jb_slave.kjb and added following mapping to the job entry settings in the Parameters tab:
    A sample for this setup is provided in the mulitpurpose_setup_allow_individual_execution_of_subprocesses folder.

    Closing remarks


    Using parameters and variables in Kettle jobs and transformations allows you to create highly dynamic processes. I hope this tutorial shed some light onto how this can be achieved.

    Introducing the Kettle Test Framework Beta

    $
    0
    0
    Kettle Test Framework (KTF)

    Subtitle: Kettle Testing for the Non-Java Developers

    Announcing the KTF Beta:

    Precautions



    Please note that KTF is still in Beta and has undergone only minimal testing. Please report any bugs on the dedicated Github page so that they can be easily fixed for everybody’s advantage. Do not use for any production purposes.


    You must not run this process on a production environment! You should only run this process on a dedicated test environment where it is ok to lose all the data in the database. You must run this process on a dedicated test database! This process wipes out all your tables!

    The small print upfront

    Please note that this is a community contribution and not associated with Pentaho. You can make use of this framework at your own risk. The author makes no guarantees of any kind and should not be hold responsible for any negative impact.  


    You should have a solid understanding of Pentaho Data Integration/Kettle. I made a minimal attempt to document this framework - the rest is up to you to explore and understand.

    Motivation

    The main idea behind this framework is to create a base for best test practises in regards to working with the Kettle ETL tool. Please add any ideas which could improve this test framework as “improvement” on the dedicated Github page.


    Code and samples can be downloaded from Github.


    Testing data integration processed should be core part of your activities. Unfortunately, especially for non Java developers, this is not quite so straightforward (Even for Java developers it is not quite that easy to unit test their ETL processes, as highlighted here.). This framework tries to fill this gap by using standard Kettle transformations and jobs to run a test suite.

    When you create or change a data integration process, you want to be able to check if the output dataset(s) match the ones you expect (the "golden" dataset(s)). Ideally, this process should be automated as well. By using KTF's standard Kettle transformations and jobs to do this comparison every data integration architect should be in the position to perform this essential task.


    Some other community members have published blog posts on testing before, which this framework strongly took ideas/inspiration from (especially Dan Moore’s excellent blog posts [posts, github] ). Also, some books published on Agile BI methodologies were quite inspirational (especially Ken Colliers “Agile Analytics”) as well.
    While Dan focused on a complete file based setup, for now I tried to create a framework which works with processes (jobs, transformations) which make use of Table input and Table output steps. In the next phase the aim is to support file based input and output (csv, txt) as well. Other features are listed below.

    Contribute!

    Report bugs and improvements/ideas on Github.


    Features

    Let’s have a look at the main features:
    • Supports multiple input datasets
    • Supports multiple output datasets
    • Supports sorting of the output dataset so that a good comparison to the golden output dataset can be made
    • The setup is highly configurable (but a certain structure is enforced - outlined below)
    • Non conflicting parameter/variable names (all prefixed with “VAR_KTF_”)
    • Non intrusive: Just wraps around your existing process files (except some parameters for db connections etc will have to be defined … but probably you have this done already anyways)

    Current shortcomings

    • Not fully tested (only tested with included samples and on PostgreSQL)
    • Currently works only with Table input / output transformations. Text/CSV file input/output will be supported in a future version (which should not be too complicated to add).
    • Dictates quite a strict folder structure
    • Limited documentation

    Project Folder Structure

    Stick to this directory layout for now. In future versions I might make this more flexible.


    1. Adjust parameter values in config/.kettle/kettle.properties and JNDI details in config/simple-jndi/jdbc.properties
    2. Your transformations have to be prefixed for now with “tr_” and your jobs with “jb_”. Do not use any special characters or spaces in your job/transformation names. Your transformations and jobs have to be saved within repository/main. There is currently now subfolder structure allowed within this folder.
    3. Within the config/test-cases directory create a folder for the processes you want to test. A process can be a transformation or a job. Name these folders exactly the same as the job/transformation you want to test (just without the file extension).  Each process folder must have an input and output folder which hold the DDL, Kettle data type definitions and in case of the output the sort order definition (see tr_category_sales sample on how to set this up). If your output dataset does not require any sorting, create an empty sort def file (see tr_fact_sales example). Note that KTF can handle more than one output/input dataset.
    4. The process folder must also contain at least one test case folder (which has to have a descriptive name). In the screenshot above it is called “simpletest”. A test case folder must contain an input and output folder which each hold the dedicated datasets for this particular test case. In case of the output folder it will hold the golden output dataset(s) (so that dataset that you want to compare your ETL output results to).
    5. Users working on Windows: For all the CSV output steps in transformations under /repository/test set the Format to Windows (Content tab). KTF has not been tested at all on Windows, so you might have to make some other adjustments as well.
    6. Add environment variables defined in config/set-project-variables.sh to your .bashrc. Then run: source ~/.bashrc
    7. Start Spoon (this setup requires PDI V5) or run the process from the command line.
    8. Run the test suite
    9. Analyze results in tmp folder. If there is an error file for a particular test case, you can easily visually inspect the differences like this:
    dsteiner@dsteiner-Aspire-5742:/tmp/kettle-test/tr_category_sales/sales_threshold$ diff fact_sales_by_category.csv fact_sales_by_category_golden.csv
    2c2
    < 2013-01-01;Accessories;Mrs Susi Redcliff;399;Yes
    ---
    > 2013-01-01;Accessories;Mrs Susi Redcliff;399;No
    5c5
    < 2013-01-02;Groceries;Mr James Carrot;401;No
    ---
    > 2013-01-02;Groceries;Mr James Carrot;401;Yes


    All files related to testing (KTF) are stored in repository/test. You should not have to alter them unless you find a bug or want to modify their behaviour.


    To get a better idea on how this is working, look at the included examples, especially tr_category_sales (has multiple inputs and outputs and proper sorting). The other example, tr_fact_sales has only one output and input and no sorting defined (as it only outputs one figure).

    Future improvements

    Following improvements are on my To-Do list:
    • Write test results to dedicated database table
    • Improvement of folder structure
    • Support for text file input and output for main processes (jobs/transformations)

    FAQ

    My input data sets come from more than one data source. How can I test my process with the KTF?

    Your process must have parameter driven database connections. This way you can easily point your various JNDI connections to just one testing input database. The main purpose of testing is to make sure if the output is as expected, not to test various input database connections. Hence for testing, you can “reduce” you multiple input connections to one.

    Going Agile: Sqitch Database Change Management

    $
    0
    0

    Going Agile: Sqitch Database Change Management

    You have your database scripts under a dedicated version control and change management system, right? If not, I recommend doing this now.
    While there have been handful of open source projects around which focus on DB script versioning and change management control, none has really gained a big momentum and a lot of them are dormant.
    But there is a new player on the ground! A light at the end of the db change management tunnel - so to speak. David Wheeler has been working on Sqitch over the last year and the results are very promising indeed! Currently the github projects shows 7 other contributors, so let’s hope this project gets a strong momentum! Also a new github project for a Sqitch GUI was just founded.

    Why I like Sqitch:
    • You can run all the commands from the command line and get very good feedback.
    • Everything seems quite logical and straightforward: It’s easy to get to know the few main commands  and in a very short amount of time you are familiar with the tool.
    • You can use your choice of VCS.
    • It works very well.

    Supported DBs are currently MySQL, Oracle, SQLite and PostgreSQL. CUBRID support is under way.



    So what do we want to achieve? So what do we want to achieve?

    Bring all DDL, stored procedures etc under version control. This is what Git is very good for (or your choice of CVS).

    Keep track of the (order of) changes we applied to the database, verify that they are valid, be able to revert them back to a specific state if required. Furthermore, we want to deploy these changes (up to a specific state) to our test and production databases. This is was Sqitch is intended for:


      The below write-up are my notes partially mixed with David’s ones.

      Info

      Forum:

      Installation


      Options:
      PostgreSQL: cpan App::Sqitch DBD::Pg (You also have to have PostgreSQL server installed)
      SQLite: cpan App::Sqitch DBD::SQLite
      Oracle: cpan App::Sqitch DBD::Oracle (You also have to have SQL*Plus installed)
      MySQL: cpan App::Sqitch

      If you want to have support for i.e. PostgreSQL and Oracle you can just run:
      PostgreSQL: cpan App::Sqitch DBD::Pg DBD::Oracle

      For more install options see here.
      Below I will only discuss the setup for PostgreSQL.

      # make sure you have build tools install (make etc) sudo apt-get install built-essential

      # sqitch requires Locals support (asks for Locale::TextDomain and Locale::Messages)
      sudo apt-get install libintl-perl
      # make sure pg_config is installed sudo apt-get install libpq-dev # RHEL (Fedora, CentOs, etc) use instead: # sudo yum install postgresql-devel # Note where the psql include dir is located. You might be asked for it on installing sqitch pg_config --includedir # Note the version of psql. You might be asked for it on installing sqitch psql --version # install sqitch (adjust for your DB) sudo cpan App::Sqitch DBD::Pg 

      Once installation is finished, check out the man page:
      $ man sqitch

      Within your git project directory, create a dedicated folder:
      $ mkdir sqitch
      $ git add .
      $ cd sqitch
      $ sqitch --engine pg init projectname

      Let's have a look at sqitch.conf:
      $ cat sqitch.conf

      Now let’s add the connection details:
      $ vi sqitch.conf

      uncomment and specify:
      [core "pg"]
      client = psql
      username = postgres
      password = postgres
      db_name = dwh
      host = localhost
      port = 5432
      # sqitch_schema = sqitch

      If psql is not in the path, run:
      $ sqitch config --user core.pg.client /opt/local/pgsql/bin/psql
      Add your details:
      $ sqitch config --user user.name 'Diethard Steiner'
      $ sqitch config --user user.email 'diethard.steiner@'

      Let’s add some more config options: Define the default db so that we don’t have to type it all the time:
      $ sqitch config core.pg.db_name dwh
      Let's also make sure that changes are verified after deploying them:
      $ sqitch config --bool deploy.verify true
      $ sqitch config --bool rebase.verify true

      Check details:
      cat ~/.sqitch/sqitch.conf

      Have a look at the plan file. The plan file defines the execution order of the changes:
      $ cat sqitch.plan

      $ git add .
      $ git commit -am 'Initialize Sqitch configuration.'

      Add your first sql script/change:
      $ sqitch add create_stg_schema -n 'Add schema for all staging objects.'
      Created deploy/create_stg_schema.sql
      Created revert/create_stg_schema.sql
      Created verify/create_stg_schema.sql

      As you can see, Sqitch creates deploy, revert a verify files for you.

      $ vi deploy/create_stg_schema.sql

      Add:
      CREATE SCHEMA staging;

      Make sure you remove the default BEGIN; COMMIT; for this as we are just creating a schema and don’t require any transaction.

      $ vi revert/create_stg_schema.sql

      Add:
      DROP SCHEMA staging;

      $ vi verify/create_stg_schema.sql

      Add:
      SELECT pg_catalog.has_schema_privilege('staging', 'usage');

      This is quite PostgreSQL specific. For other dbs use something like this:
      SELECT 1/COUNT(*) FROM information_schema.schemata WHERE schema_name = 'staging';

      Now test if you can deploy the script and revert it:

      Try to deploy the changes:
      The general command looks like this:
      $ sqitch -d <dbname> deploy

      As we have already specified a default db in the config file, we only have to run the following:
      $ sqitch deploy
      Adding metadata tables to dwh
      Deploying changes to dwh
       + create_stg_schema .. ok

      Note the plus sign in the feedback which means this change was added.

      When you run deploy for the very first time, Sqitch will create maintenance tables in a dedicated schema automatically for you. These tables will (among other things) store in which “version” the DB is.

      Check the current deployment status of database dwh:
      $ sqitch -d dwh status
      # On database dwh
      # Project:  yes
      # Change:   bc9068f7af60eb159e2f8cc632f84d7a93c6fca5
      # Name:     create_stg_schema
      # Deployed: 2013-08-07 13:01:33 +0100
      # By:       Diethard Steiner <diethard.steiner@>


      To verify the changes run:
      $ sqitch -d dwh verify
      Verifying dwh
       * create_stg_schema .. ok
      Verify successful


      To revert the changes the the previous state, run:
      $ sqitch revert --to @HEAD^ -y

      Side note
      You can use @HEAD^^ to revert to two changes prior the last deployed change.

      To revert everything:
      $ sqitch revert
      Revert all changes from dwh? [Yes] Yes
       - create_stg_schema .. ok

      To revert back to a specific script (you can also revert back to a specific tag):
      $ sqitch revert create_dma_schema
      Revert changes to create_dma_schema from dwh? [Yes]

      Let’s inspect the log:
      $ sqitch log

      Note that the actions we took are shown in reverse chronological order, with the revert first and then the deploy.

      Now let's commit it.
      $ git add .
      $ git commit -m 'Added staging schema.'

      Now that we have successfully deployed and reverted the current change, let’s deploy again:
      $ sqitch deploy
      Let’s add a tag:
      $ sqitch tag v1.0.0-dev1 -n 'Tag v1.0.0-dev1.'

      Deployment to target DBs
      So if you want to deploy these changes to your prod DB in example, you can either do it like this:
      $ sqitch -d <dbname> -u <user> -h <host> -p <port> deploy
      (Important: If you are working with PostgreSQL, make sure you add your password to ~/.pgpass and then comment the password out in sqitch.conf beforehand otherwise this will not work.)
      Or bundle them up, copy the bundle to your prod server and deploy it there:
      $ sqitch bundle
      Distribute the bundle
      On the prod server:
      $ cd bundle
      $ sqitch -d dwhprod deploy

      A future version of Sqitch will have better support for target DBs (see here).

      Using Sqitch with an existing project (where some ddl already exists)

      Sometimes you take over a project and want to bring the existing DDL under version control and change management.

      Thanks to David for providing details on this:

      The easiest option is to export the existing DDL and store it in one deploy file. For the revert file you could use a statement like this then:

         DROP $schema CASCADE;

      Let’s assume we call this change “firstbigchange”:

      The first time you do a deploy to the existing database with Sqitch, do it twice: once with --log-only to apply your first big change, and then, from then on, without:

         $ sqitch deploy --log-only --to firstbigchange
         $ sqitch deploy --mode change

      The --long-only option has Sqitch do everything in the deploy except actually run deploy scripts. It just skips it, assumes it worked successfully, and logs it. You only want to do this --to that first big dump change, as after that you of course want Sqitch to actually run deploy scripts.

      Using more than one DB

      DS: Currently it seems like there is a Sqitch version for each of these dbs. What if I was working on a project that used two different dbs installed on the same server and I wanted to use Sqitch for both of them (especially for dev I have more than one db installed on the same server/pc)?

      DW: You can have more than one plan and accompanying files in a single project by putting them into subdirectories. They would then be effectively separate Sqitch projects in your SCM. The sqitch.conf file at the top can be shared by them all, though, which is useful for setting up separate database info for them ([core.pg] and [core.mysql] sections, for example).

      If you are starting a new project, you would do it like this:

      $ sqitch --engine pg --top-dir pg init myproject
      $ sqitch --top-dir mysql init myproject

      Then you have sqitch.plan, deploy, revert, and verify in pg/, and sqitch.plan deploy, revert, and verify in mysql/. To add a new change, add it twice:

      $ sqitch --top-dir pg add foo
      $ sqitch --top-dir mysql add foo

      Viewing all 85 articles
      Browse latest View live