'Kaggle Display Advertising Challenge' working with vw-luigi

May 3, 2016   #English  #ML 

When you tackle some machine learning problems with vowpal wabbit, have you felt annoying to write monotonous evaluation code like cross-validation? vw-luigi (https://github.com/shotarok/vw-luigi) helps you at such time.

The vw-luigi includes luigi workflows to evaluate models trained by vowpal wabbit. All you need to do is prepare for training and test data. If you use vw-luigi, it would train a model, predict with the model and you can get the evaluation result automatically.

In this post, I’ll explain an example usage of vw-luigi using ‘Kaggle Display Advertising Challenge’ dataset.

vw-luigi with ‘Kaggle Display Advertising Challenge’

Download Dataset

‘Display Advertising Challenge’ is a competition to benchmark the most accurate ML algorithms for estimation of Click-through rate (CTR). It has been held for 90 days in 2014. Dataset was provided by critio. The data is not available on the page in kaggle.com. Currently you can download the data through the page of critio labs (here). It is available for non-commercial purposes only.

dac dataset

If you download dac.tar.gz from this page and decompress it, then you can get readme, train.txt and test.txt.

Prepare training and test data for vw

According to readme, the tsv data consists of 40 columns. The first column is a binary value that means whether ad is clicked or not. Following 13 columns are integer values meaning mostly count features. Rest 26 columns are 32 bits hashed value meaning categorical features.

In order to use this data for input of vw, we need to convert tsv into vw format.

If you save this gist as tsv_to_vw.py, you can convert training and test data via flowing commands.

$ mkdir -p /tmp/work/space
$ cat train.txt | python train.txt > /tmp/work/space/train.vw
$ cat test.txt | python test.txt > /tmp/work/space/test.vw

Evaluate model with vw-luigi

Finally, let’s clone vw-luigi and run it.

$ git clone [email protected]:shotarok/vw-luigi.git ~/
$ cd ~/vw-luigi
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

In case you use /tmp/work/space/train.vw as training data, /tmp/work/space/test.vw as test data and squared loss as loss function, you can get an evaluation result, which includes AUROC, AUPR and LossLoss calculated by scikit-learn, following to below commands.

$ cd ~/vw-luigi
$ source venv/bin/activate
$ ls /tmp/work/space
> train.vw test.vw
$ python -m luigi --module vwluigi EvalTask --loss-func squared --work-dir /tmp/work/space --local-scheduler
 ...
$ ls /tmp/work/space
> model.vw predict.vw result.txt train.vw
$ cat /tmp/work/space/result.txt

You can get the evaluation result like this gif.

gif