Evolution over period with annotations

Google analytics records blog posting (test)

Nerd alert:
If we want to interpret web statistics it is useful to have information when important events happened. Such an event is the date a new blog post is published. I use Google Analytics and Piwik next to each other, mostly for testing and comparing. Both Google Analytics and Piwik have ‘annotations’. These can be used to analyse your data. However….

The connection between this WordPress site and Piwik is handled by the plugin WP-Piwik. This plugin is responsible for recording website visits. But it is also possible to automatically send an annotation to Piwik if a new blog post is published. These annotations are for instance displayed in the ‘Evolution over the period’ graph on the visitors overview page. You can easily see when a certain post is published and compare this with the user visitsEvolution over period with annotations statistics (see graph).

Google Analytics also have annotations, but unfortunately the Google Analytics API does not seem to have a method (yet?) to receive these annotations from an external source (you could add these manually each time you publis a post, but who is going to do this?) A work around is provided with the WordPress plugin Google Analytics Internal. It should trigger an Analytics event when we publish a post.

This morning I installed this plugin and now it is time to test to see if it is working and to investigate how we can use these event to get a better insight of the influence of certain blog post to website visits

Results in Google Analytics

It took a while, but via custom reports I’m able to display the publish event and the page visits at the same time. And I could investigate which events took place. But it needs some further investigation to see if we can tailor this more to my wishes…. (and some more time to display the graph with the effect of this blogpost)Visits last week and publish event recording

Combine two list of registry agencies (fuzzy match)

The data


The IATI-standard has a codelist of organisation registry agencies (Chamber of Commerces) to create an organisation identifier. The organisation identifier should start with the code of one of these agencies. However the codelist is not complete (yet).



At the opencorporates website it is possible to search companies that are registered via several ‘registry agencies’. To use the information found via opencorporates the original registry agencies should be on the IATI codelist. So we need to get a list of agencies from the opencorporate website that are not known to IATI.

The Problem

On both list we have a (iso 3166) country code and a registry name. The same registry could have a (slightly?) different name on both list. But (about) the same name could exists in multiple countries. So we want to perform a fuzzy match on registry name, taken the country into account. We don’t need to compare the UK Companies House with the Companies House of Gibraltar.

In Pentaho Data-Integration is a fuzzy match step available to match two datasets using one field with different search algorithms. Unfortunately it is not possible to add another field to restrain the possible matches.

The solution

pdi fuzzy match

Instead of a one step approach we need three steps:

  1. merge the datasets based on the countrycode
  2. calculate the (levenshtein) distance
  3. determin the best/correct matches

In this case this method was sufficient. We only had a few registries in each country. But with other datasets this method is not optimal. I hope somebody will extent the Pdi fuzzy match step. A jira case is already filed.


Pentaho Community Meeting 2014: Hackathon

Presenting our results at PCM14 (gpx output on screen)This year the Pentaho Community Meeting 2014 (pcm14) was in Antwerpen and started with a (short) hackathon. Some company groups were formed. Together with Peter Fabricius I joined the people from Cipal (or they joined us). We did not get an assignment so we have to come up with something nice ourselves. Our first thought was to do ‘something’ with Philae (the lander who just started sending information form the comet “Churyumov-Gerasimenko“). We searched for some data, but we could not find anything useful.

So we decided to take a subject closer to home and wondered if we could map the locations of the PCM14 participants. We already had a kettle transformation to get the location data from Facebook pages(city, country), parse it to a geocoding service to get the latitude and longitude and save it to a gpx file. It was based on some work Peter did for German rally teams to the Orient. We ‘only’ need to adjust it to our needs and we need data to request the Facebook company page of the participants.

From Bart we got a list of the email addresses of the participants (it has advantages that you are part of a semi Belgium team and one of the team members was actually working on Bart his machine ;-)). We were able to grap the domain name without country code using Libre Office (sorry we only had an hour to code) and tried to feed it to the Facebook Graph API. It is basically just a http client step to get the info from eg http://graph.facebook.com/pentaho. This results in the company page in a nice json format (Unfortunately(?) the Graph API does not return the location for normal ‘users’ with this method). One request broke the kettle transformation (some strange error), so we removed that organization.

Facebook returned the country name, but the geocoding tool needed the 2-character country code. Because Peter had only German teams, he just added GE, but of course this was not an option for us. Fortunately we had a databases with the country-isocode translation. So we could feed the geocoding service with the right data and this also returns some nice json.
After about 37 requests we got an error: no content is allowed before the prolog (or something like that). Damn we reach some rate limiting….  So we need delayed each request a second to get all the results. The first run we did not get all the results. Why? we don’t know…

In the mean time Peter and ‘uh I forgot his name’ were busy trying to get the bi-server installed and prepare a dashboard with a map, which should read a kettle transformation step and plotting the participants. They had also some issues, but……

It was time for the presentations…. At that point we did not have anything to show…. No results of the kettle transformation, no map….. During the setup of one of the presentations I run the kettle transformations again and hooray I get a GPX file. It contains 9 locations of the participants (we had about 55 different companies in our list). Since we did not have the map ready, we could not present it using the bi-server. But also in this case ‘Google was our friend’. Uploading it to Google drive, using preview content (using My GPX Reader (it took some clicks) we were able to show it to the public.

On my way to podium I noticed Facebook also returns the latitude and longitude. So we did not need to use the detour via the geocoding service 🙁


After al presentations were made, the jury discussed the products and presentations and we won!!! (as did all the other teams). We got some nice raspberry PI B+. In case you don’t know what it is: Basically it is a hand sized desktop computer with no case and a lot of connectors…

Thanks Bart and Matt for organizing this hackathon!!!

Edit: By request I added a sample input file. I also changed it to read csv: facebook_locations