'Kaggle Display Advertising Challenge' working with vw-luigi

By Shotaro Kohama on May 03, 2016

When you tackle some machine learning problems with vowpal wabbit, have you felt annoying to write monotonous evaluation code like cross-validation? vw-luigi (https://github.com/shotarok/vw-luigi) helps you at such time.

The vw-luigi includes luigi workflows to evaluate models trained by vowpal wabbit. All you need to do is prepare for training and test data. If you use vw-luigi, it would train a model, predict with the model and you can get the evaluation result automatically.

In this post, I’ll explain an example usage of vw-luigi using ‘Kaggle Display Advertising Challenge’ dataset.

vw-luigi with ‘Kaggle Display Advertising Challenge’

Download Dataset

‘Display Advertising Challenge’ is a competition to benchmark the most accurate ML algorithms for estimation of Click-through rate (CTR). It has been held for 90 days in 2014. Dataset was provided by critio. The data is not available on the page on kaggle.com. Currently you can download the data through the page of critio labs (here). It is available for non-commercial purposes only.

dac dataset

If you download dac.tar.gz from this page and decompress it, then you can get readme, train.txt, and test.txt.

Prepare training and test data for vw

According to the readme, the tsv data consists of 40 columns. The first column is a binary value that means whether ad is clicked or not. Following 13 columns are integer values mainly representing count features. Rest 26 columns are 32 bits hashed value indicating categorical features.

To use this data as input of vw, we need to convert tsv into vw format.

If you save this gist as tsv_to_vw.py, you can convert training and test data via flowing commands.

$ mkdir -p /tmp/work/space
$ cat train.txt | python train.txt > /tmp/work/space/train.vw
$ cat test.txt | python test.txt > /tmp/work/space/test.vw

Evaluate model with vw-luigi

Finally, let’s clone vw-luigi and run it.

$ git clone git@github.com:shotarok/vw-luigi.git ~/
$ cd ~/vw-luigi
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

We can get an evaluation result, which includes AUROC, AUPR and LossLoss calculated by scikit-learn like below. The following code uses /tmp/work/space/train.vw as training data, /tmp/work/space/test.vw as test data and squared loss as loss function.

$ cd ~/vw-luigi
$ source venv/bin/activate
$ ls /tmp/work/space
> train.vw test.vw
$ python -m luigi --module vwluigi EvalTask --loss-func squared --work-dir /tmp/work/space --local-scheduler
 ...
$ ls /tmp/work/space
> model.vw predict.vw result.txt train.vw
$ cat /tmp/work/space/result.txt

You can get the evaluation result like this gif.

gif