PDI python executor

One of my clients has a python script to validate incoming data files. One important feature is to test the HashCode of the file, to investigate if it is a legitimate file. Of course it would be possible to convert the python script to a Pentaho pdi transformation, but why not use the existing script.


pdi python executorPDI has a plugin called Cpython Script executor, which is developed in the pentaho labs. It is installable via the Marketplace. But unfortunately it did not mention the requirements to execute a python script. Luckily it was on the documentation provided within the github repository. It needs Pandas and Sklearn. Knowing a little bit about python I tried to install it using pip. But on my Ubuntu laptop that did not work. I did not manage to install sklearn. So a little browsing brought me to http://scikit-learn.org/stable/install.html with the suggestion to install Sklearn using the Linux repositories. So I did (and removed the pipped install pandas and install it from the linux repo). After that I was able to run the sample pdi transformations provided by Mark Hall.

First results

The cpython script executor is targeted to data scientists. And I guess it is of great value to manipulate big datasets or complex calculations. However for my purpose it seems rather slow. I tried a simple transformation which reads 10 rows with one variable containing the value ‘pietje’. The python script check if the value was ‘pietje’. If so, it returns 1, else 0. It takes about 6 seconds to complete. So a more difficult script with more data probably needs a different approach.