Jekyll2020-07-19T07:12:55+00:00andersource.github.io/feed.xmlandersourceExperiments and musingsAsking the right question2020-07-12T18:00:00+00:002020-07-12T18:00:00+00:00andersource.github.io/2020/07/12/supervised-task-framing<p>Supervised learning is the machine learning branch that deals with function approximation: using several input-output pairs generated by an unknown target function, construct a different function that approximates the target function. For example, the target function may be my personal movie preferences, and we might be interested in obtaining a model that can predict (approximately) how much I will enjoy watching some new movie. With such a model we can create a movie recommendation app.</p>
<p>Some functions can be easier to approximate than others (given a definition of approximation difficulty, but I won’t go down that rabbit hole right now), and some tasks can be framed as more than one function. This raises the question - do different framings result in different model performance? To find out I tried playing with two framings of a toy problem.</p>
<h2 id="the-data">The data</h2>
<p>I used the <a href="https://scikit-learn.org/stable/datasets/index.html#olivetti-faces-dataset">Olivetti faces dataset</a>, which contains grayscale, 64x64 images of the faces of 40 subjects (10 images per subject). Here are some of the faces:
<img src="/assets/faces_framing/faces_sample.png" alt="Face data sample" /></p>
<h2 id="the-task">The task</h2>
<p>The task is the classical face recognition task (which has been quite controversial lately due to questionable use in settings such as law enforcement). To make things more interesting, I decided to use only two images from each subject for training, and the rest as the test set. So the goal is to train a model which, given an image, outputs the subject that the model believes this face belongs to.</p>
<h3 id="scope">Scope</h3>
<p>I wanted to focus just on the aspects of training that pertain to the problem framing, and treat it as a general problem. For that purpose I excluded many specifics that would be very important for a real face recognition application:</p>
<ul>
<li>Using existing face recognition models or <a href="https://docs.opencv.org/2.4/modules/contrib/doc/facerec/facerec_tutorial.html">existing techniques specific to face recognition</a></li>
<li>Using <a href="https://link.springer.com/article/10.1186/s40537-019-0197-0">data augmentation</a> to generate more training samples</li>
<li>Obtaining more face data (even without subject information) and perform unsupervised pre-training</li>
<li>Assigning each prediction a confidence score, and fixing a confidence threshold below which no result is reported</li>
</ul>
<p>In short, I wanted to see what difference just changing the target function would make. Since the functions are different the models may be somewhat different as well, but they are trained on the same (base) data.</p>
<h3 id="performance-metric">Performance metric</h3>
<p>To measure model performance, I used the accuracy metric - percentage of correct classifications. For each framing I ran about 100 train/test splits (with two images in the training set and eight in the test set).</p>
<h2 id="baseline">Baseline</h2>
<p>As a baseline I used a (single) nearest neighbor classifier with the L2 norm. I.e. when classifying a new face, for each face in the training set we calculate the sum of the squared differences bewteen every two pixels (in similar positions), and take as the answer the face that was closest.</p>
<p><img src="/assets/faces_framing/faces_knn.png" alt="Nearest neighbor face classification" /></p>
<p>Intuitively it’s hard to tell how well this model would fare. On one hand there should obviously be many similarities between images of the same person (including factors
we would have liked to exclude, such as lighting and clothing).
On the other hand, many of the similarities we perceive in faces will not be reflected in the pixel-level comparison.
In this case the performance (measured as accuracy - percent of correct classifications) of the model was about <strong>70.5%</strong>, which is quite impressive in my opinion, considering that a random model would achieve about 2.5% accuracy on average.</p>
<p>Let’s see how a more sophisticated model fares.</p>
<h2 id="first-approach">First approach</h2>
<p>The first framing is the explicit one: given an image, we want to know whose face it is, so that’s what we’ll ask the model. The function maps images to subject identifiers.</p>
<p><img src="/assets/faces_framing/first_approach.png" alt="Mapping image to subject ID" /></p>
<p>For the model I used a simple network with Keras:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">([</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="n">X_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">BatchNormalization</span><span class="p">(),</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">BatchNormalization</span><span class="p">(),</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">32</span><span class="p">),</span>
<span class="n">Dense</span><span class="p">(</span><span class="n">y_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">)</span>
<span class="p">])</span>
<span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span> <span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">1200</span><span class="p">)</span></code></pre></figure>
<p>I played with several variations and this seemed to be the best with regards to number of layers, their sizes and activation functions. Its test accuracy was, on average, about <strong>70.9%</strong> - an ever so slight improvement.
I think part of the challenge is that classifying faces requires relatively complex features, but we have very little training data (especially considering the number of positive instances for each class).
So the model either fails to find a pattern if the network is too small, or overfits if it’s too large.</p>
<h2 id="second-approach">Second approach</h2>
<p>Let’s try a less direct framing. We know that if two images belong to the same person, they should be relatively similar, and vice versa. Therefore, instead of training the model to identify faces, we can train the model to <em>compare</em> faces. In this case, instead of 40 classes (one for every subject) we only have two classes: “same person” or “not the same person”.</p>
<p><img src="/assets/faces_framing/second_approach.png" alt="Mapping image pairs to similarity" /></p>
<p>Training this model was a little trickier:</p>
<ul>
<li>The best architecture turned out to be pretty similar to two (“sideways”) concatenations of the first approach model, which I thought was pretty neat.</li>
<li>Due to a vanishing gradients issue, I had to go with a slower learning rate and slow it even more as the loss decreased.</li>
<li>This time we have an <em>imbalanced</em> classification task, so I gave the positive class a bigger weight.</li>
<li>Training took longer and in a handful of cases (about 5 out of 100) didn’t converge and needed restarting.</li>
</ul>
<p>Another difference is that using this framing, inference isn’t straightforward. Instead, we run the model on the input image along with each of the training images, and pick the subject of the image that the model deemed most similar to the input image.</p>
<p>Here is the code for the model and training:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">([</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="n">X_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">BatchNormalization</span><span class="p">(),</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">BatchNormalization</span><span class="p">(),</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">64</span><span class="p">),</span>
<span class="n">BatchNormalization</span><span class="p">(),</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">)</span>
<span class="p">])</span>
<span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">optimizers</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">learning_rate</span><span class="o">=</span><span class="p">.</span><span class="mi">0001</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">45</span><span class="p">):</span>
<span class="n">hist</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">to_categorical</span><span class="p">(</span><span class="n">y_train</span><span class="p">),</span>
<span class="n">epochs</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">class_weight</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">:</span> <span class="mi">79</span><span class="p">},</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">last_loss</span> <span class="o">=</span> <span class="n">hist</span><span class="p">.</span><span class="n">history</span><span class="p">[</span><span class="s">'loss'</span><span class="p">][</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">lr</span> <span class="o">=</span> <span class="p">.</span><span class="mi">0001</span>
<span class="k">if</span> <span class="n">last_loss</span> <span class="o"><=</span> <span class="p">.</span><span class="mi">1</span><span class="p">:</span>
<span class="n">lr</span> <span class="o">=</span> <span class="p">.</span><span class="mi">00001</span>
<span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">optimizers</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">learning_rate</span><span class="o">=</span><span class="n">lr</span><span class="p">))</span></code></pre></figure>
<p>The accuracy of this model was, on average, about <strong>74.4%</strong>, which is an improvement over both the first approach and the baseline. However, the spread of the results was larger, resulting in both much worse and much better runs. In this problem, a different framing made quite a significant difference.</p>
<h2 id="combined-approach">Combined approach</h2>
<p>After seeing the better average but also bigger spread of the second approach I wondered if it would be possible to create a model that optimizes for both using a non-linear computation graph.
The idea was this: each input sample would contain two faces, which would each “go through” several dense layers. The images would be transformed by the same layers separately, and the resulting representation would be used in two ways:</p>
<ol>
<li>Classify each face</li>
<li>Concatenate the two representations and, after several more dense layers, classify whether or not they belong to the same person</li>
</ol>
<p>I also used different weights for the two framings, which worked a little better.</p>
<p>Here’s the code for this model and its training:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x1</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">pre_X_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],),</span> <span class="n">name</span><span class="o">=</span><span class="s">'face1'</span><span class="p">)</span>
<span class="n">x2</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">pre_X_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],),</span> <span class="n">name</span><span class="o">=</span><span class="s">'face2'</span><span class="p">)</span>
<span class="n">L1</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="n">x1</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],),</span> <span class="n">name</span><span class="o">=</span><span class="s">'face_rep1'</span><span class="p">)</span>
<span class="n">BN1</span> <span class="o">=</span> <span class="n">BatchNormalization</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'batch_norm1'</span><span class="p">)</span>
<span class="n">L2</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">128</span><span class="p">,),</span> <span class="n">name</span><span class="o">=</span><span class="s">'face_rep2'</span><span class="p">)</span>
<span class="n">BN2</span> <span class="o">=</span> <span class="n">BatchNormalization</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'batch_norm2'</span><span class="p">)</span>
<span class="n">L3</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">64</span><span class="p">,),</span> <span class="n">name</span><span class="o">=</span><span class="s">'face_rep3'</span><span class="p">)</span>
<span class="n">O1</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">32</span><span class="p">,),</span> <span class="n">name</span><span class="o">=</span><span class="s">'face_class'</span><span class="p">)</span>
<span class="n">R1</span> <span class="o">=</span> <span class="n">BN2</span><span class="p">(</span><span class="n">L2</span><span class="p">(</span><span class="n">BN1</span><span class="p">(</span><span class="n">L1</span><span class="p">(</span><span class="n">x1</span><span class="p">))))</span>
<span class="n">R2</span> <span class="o">=</span> <span class="n">BN2</span><span class="p">(</span><span class="n">L2</span><span class="p">(</span><span class="n">BN1</span><span class="p">(</span><span class="n">L1</span><span class="p">(</span><span class="n">x2</span><span class="p">))))</span>
<span class="n">C1</span> <span class="o">=</span> <span class="n">concatenate</span><span class="p">([</span><span class="n">R1</span><span class="p">,</span> <span class="n">R2</span><span class="p">],</span> <span class="n">name</span><span class="o">=</span><span class="s">'face_rep_concat'</span><span class="p">)</span><span class="n">i</span>
<span class="n">L4</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">128</span><span class="p">,),</span> <span class="n">name</span><span class="o">=</span><span class="s">'comparison_dense'</span><span class="p">)</span>
<span class="n">BN3</span> <span class="o">=</span> <span class="n">BatchNormalization</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'batch_norm3'</span><span class="p">)</span>
<span class="n">O2</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">64</span><span class="p">,),</span> <span class="n">name</span><span class="o">=</span><span class="s">'comparison_res'</span><span class="p">)</span>
<span class="n">face1_res</span> <span class="o">=</span> <span class="n">O1</span><span class="p">(</span><span class="n">L3</span><span class="p">(</span><span class="n">R1</span><span class="p">))</span>
<span class="n">face2_res</span> <span class="o">=</span> <span class="n">O1</span><span class="p">(</span><span class="n">L3</span><span class="p">(</span><span class="n">R2</span><span class="p">))</span>
<span class="n">comparison_res</span> <span class="o">=</span> <span class="n">O2</span><span class="p">(</span><span class="n">BN3</span><span class="p">(</span><span class="n">L4</span><span class="p">(</span><span class="n">C1</span><span class="p">)))</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">x1</span><span class="p">,</span> <span class="n">x2</span><span class="p">],</span> <span class="n">outputs</span><span class="o">=</span><span class="p">[</span><span class="n">face1_res</span><span class="p">,</span> <span class="n">face2_res</span><span class="p">,</span> <span class="n">comparison_res</span><span class="p">])</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">plot_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="s">'model.png'</span><span class="p">,</span> <span class="n">show_shapes</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">optimizers</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">learning_rate</span><span class="o">=</span><span class="p">.</span><span class="mi">0005</span><span class="p">),</span>
<span class="n">loss</span><span class="o">=</span><span class="p">[</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">losses</span><span class="p">.</span><span class="n">categorical_crossentropy</span><span class="p">,</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">losses</span><span class="p">.</span><span class="n">categorical_crossentropy</span><span class="p">,</span>
<span class="n">weighted_categorical_crossentropy</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">79</span><span class="p">]),</span>
<span class="p">],</span>
<span class="n">loss_weights</span><span class="o">=</span><span class="p">[.</span><span class="mi">05</span><span class="p">,</span> <span class="p">.</span><span class="mi">05</span><span class="p">,</span> <span class="mf">1.</span><span class="p">])</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">130</span><span class="p">):</span>
<span class="n">hist</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">([</span><span class="n">X1_train</span><span class="p">,</span> <span class="n">X2_train</span><span class="p">],</span> <span class="p">[</span><span class="n">y1_train</span><span class="p">,</span> <span class="n">y2_train</span><span class="p">,</span> <span class="n">y3_train</span><span class="p">],</span>
<span class="n">epochs</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">last_loss</span> <span class="o">=</span> <span class="n">hist</span><span class="p">.</span><span class="n">history</span><span class="p">[</span><span class="s">'comparison_res_loss'</span><span class="p">][</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">lr</span> <span class="o">=</span> <span class="p">.</span><span class="mi">0005</span>
<span class="k">if</span> <span class="n">last_loss</span> <span class="o"><=</span> <span class="p">.</span><span class="mi">5</span><span class="p">:</span>
<span class="n">lr</span> <span class="o">=</span> <span class="p">.</span><span class="mi">0001</span>
<span class="k">if</span> <span class="n">last_loss</span> <span class="o"><=</span> <span class="p">.</span><span class="mi">1</span><span class="p">:</span>
<span class="n">lr</span> <span class="o">=</span> <span class="p">.</span><span class="mi">00001</span>
<span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">optimizers</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">learning_rate</span><span class="o">=</span><span class="n">lr</span><span class="p">),</span>
<span class="n">loss</span><span class="o">=</span><span class="p">[</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">losses</span><span class="p">.</span><span class="n">categorical_crossentropy</span><span class="p">,</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">losses</span><span class="p">.</span><span class="n">categorical_crossentropy</span><span class="p">,</span>
<span class="n">weighted_categorical_crossentropy</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">79</span><span class="p">]),</span>
<span class="p">],</span>
<span class="n">loss_weights</span><span class="o">=</span><span class="p">[.</span><span class="mi">05</span><span class="p">,</span> <span class="p">.</span><span class="mi">05</span><span class="p">,</span> <span class="mf">1.</span><span class="p">])</span></code></pre></figure>
<p>Here’s a visual description of what’s happening:</p>
<p><img src="/assets/faces_framing/combined_approach.png" alt="Combined approach model" /></p>
<p>This model took the longest to train. The average accuracy was <strong>73.3%</strong>, better than the baseline and the first approach but not as good as the second; however, it was much more stable and there were no incidents of non-convergence. So it seems like the combination indeed enabled us to enjoy both worlds: a little better performance while preserving stability.</p>
<h2 id="comparison">Comparison</h2>
<table>
<thead>
<tr>
<th>Model</th>
<th>Description</th>
<th>Mean</th>
<th>Median</th>
<th>5%</th>
<th>95%</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>Nearest neighbor</td>
<td>70.55%</td>
<td>70.625%</td>
<td>65%</td>
<td>76.25%</td>
</tr>
<tr>
<td>First approach</td>
<td>Face classification</td>
<td>70.916%</td>
<td>71.094%</td>
<td>65.587%</td>
<td>76.25%</td>
</tr>
<tr>
<td>Second approach</td>
<td>Similarity classification</td>
<td><strong>74.381%</strong></td>
<td><strong>75.312%</strong></td>
<td>65.75%</td>
<td><strong>81.9%</strong></td>
</tr>
<tr>
<td>Combined approach</td>
<td>first + second</td>
<td>73.328%</td>
<td>73.125%</td>
<td><strong>66.875%</strong></td>
<td>78.656%</td>
</tr>
</tbody>
</table>
<p>Here’s a plot describing the result distributions:
<img src="/assets/faces_framing/result_distributions.png" alt="Result distributions" /></p>
<h2 id="conclusions">Conclusions</h2>
<p>In this instance, framing the task in an alternative, non-straightforward fashion resulted in better model performance.</p>
<p>Bear in mind that this experiment was done on a toy dataset and problem, and the results aren’t necessarily applicable to every problem. However, it highlighted for me the potential in trying out different framings, and going forward I will try to be mindful of alternative framings when I work on supervised tasks.</p>
<p>The source code for this post can be found <a href="https://github.com/andersource/face-classification-problem-framing">here</a>. Not as tidy as I would like, but I think it’s clear enough.</p>Supervised learning is the machine learning branch that deals with function approximation: using several input-output pairs generated by an unknown target function, construct a different function that approximates the target function. For example, the target function may be my personal movie preferences, and we might be interested in obtaining a model that can predict (approximately) how much I will enjoy watching some new movie. With such a model we can create a movie recommendation app.The case for better-than-random splits2020-04-15T19:00:00+00:002020-04-15T19:00:00+00:00andersource.github.io/2020/04/15/random-vs-balanced-splits<h4 id="tldr-random-splits-are-common-but-maybe-not-balanced-enough-for-some-use-cases-i-made-a-python-library-for-balanced-splitting">tl;dr: Random splits are common, but maybe not balanced enough for some use cases. I made a <a href="https://pypi.org/project/balanced-splits/">python library for balanced splitting</a>.</h4>
<p>Random numbers are cool, and also useful for a lot of stuff. Among others, whenever you want to balance things in some manner,
random assignment is a good first choice. A load balancer which assigns tasks randomly to servers would fare quite well. This is such a
simple and powerful idea that the ideas of balance and randomness are often mixed, and we perceive the results of a random process as balanced.
And they are balanced - <em>on average</em>. Sometimes that’s good enough, and sometimes it’s not.</p>
<h2 id="when-random-isnt-balanced-enough">When random isn’t balanced enough</h2>
<p><a href="https://gamedevelopment.tutsplus.com/articles/solving-player-frustration-techniques-for-random-number-generation--cms-30428">This</a>
article, about random numbers in game design, provides a great example of a situation where an innocent random process leads
to undesired behavior. Using <code class="language-plaintext highlighter-rouge">random(0, 1) <= 0.1</code> to determine the outcome of a positive event
which should happen 10% of the time sounds about right - the player will need about 10 attempts, maybe a little more,
maybe a little less. The “little less” part is no problem, but if we zoom on the “little more” we see that the tail of the distribution is long -
12% of players will have to make more than 20 attempts, twice as many as we (presumably) intended. If the game is long and contains,
say, 100 such events, then 40% of players will experience at least one instance where they will need as many as 50(!) attempts. Definitely not what
we want. So randomness has to be controlled.</p>
<h3 id="splitting-students-to-study-groups">Splitting students to study groups</h3>
<p>Several years ago I was responsible for an intensive, several-month training course of about 100 students.
The students are divided to several groups which become their primary environment within the training - lessons are held for each
group separately and the instructors are fixed per group, and get to know each student quite well. There was a general consensus
that the groups should be balanced, both in demographic composition and with respect to several different aptitude tests.</p>
<p>There was no established process for splitting the students to groups - some of my predecessors used random assignment, others
performed the split manually with an Excel sheet. The person who was in charge of the previous training complained that
the groups weren’t balanced, with some containing a greater percentage of weaker students, creating excessive load on the instructors of those groups
and higher dropout rate in those groups. They also said that, in hindsight, the group imbalance could already be seen in the groups’ aptitude test distributions.</p>
<p>Fearing that some random fluke would mess things up, I started with a random split and spent about 3 hours manually balancing the groups (the schedule was tight and I didn’t want to risk <a href="https://xkcd.com/1319/">getting lost here</a>), and (related or unrelated) things turned out fine. But it was very tedious, and frustrating enough that when I had the time I wrote a script to automate the task, performing a heuristic search for a split that minimizes the distribution differences between the groups.</p>
<h3 id="balanced-split-search">Balanced split search</h3>
<p>Here is an example of using (crude) <a href="https://en.wikipedia.org/wiki/Simulated_annealing">simulated annealing</a> to search for a split that is “balanced”:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">optimized_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">n_partitions</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">t_start</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">t_decay</span><span class="o">=</span><span class="p">.</span><span class="mi">99</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
<span class="n">score_threshold</span><span class="o">=</span><span class="p">.</span><span class="mi">99</span><span class="p">):</span>
<span class="s">"""Perform an optimized split of a dataset using simulated annealing"""</span>
<span class="n">var_types</span> <span class="o">=</span> <span class="p">[</span><span class="n">guess_var_type</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="n">i</span><span class="p">])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])]</span>
<span class="k">def</span> <span class="nf">_score</span><span class="p">(</span><span class="n">indices</span><span class="p">):</span>
<span class="n">partitions</span> <span class="o">=</span> <span class="p">[</span><span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">indices</span><span class="p">]</span>
<span class="k">return</span> <span class="n">score</span><span class="p">(</span><span class="n">partitions</span><span class="p">,</span> <span class="n">var_types</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_neighbor</span><span class="p">(</span><span class="n">curr_indices</span><span class="p">):</span>
<span class="n">curr_indices</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">curr_indices</span><span class="p">)</span>
<span class="n">part1</span><span class="p">,</span> <span class="n">part2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">curr_indices</span><span class="p">)),</span>
<span class="n">size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">part1_ind</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">curr_indices</span><span class="p">[</span><span class="n">part1</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">part2_ind</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">curr_indices</span><span class="p">[</span><span class="n">part2</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">temp</span> <span class="o">=</span> <span class="n">curr_indices</span><span class="p">[</span><span class="n">part1</span><span class="p">][</span><span class="n">part1_ind</span><span class="p">]</span>
<span class="n">curr_indices</span><span class="p">[</span><span class="n">part1</span><span class="p">][</span><span class="n">part1_ind</span><span class="p">]</span> <span class="o">=</span> <span class="n">curr_indices</span><span class="p">[</span><span class="n">part2</span><span class="p">][</span><span class="n">part2_ind</span><span class="p">]</span>
<span class="n">curr_indices</span><span class="p">[</span><span class="n">part2</span><span class="p">][</span><span class="n">part2_ind</span><span class="p">]</span> <span class="o">=</span> <span class="n">temp</span>
<span class="k">return</span> <span class="n">curr_indices</span>
<span class="k">def</span> <span class="nf">_T</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
<span class="k">return</span> <span class="n">t_start</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">power</span><span class="p">(</span><span class="n">t_decay</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_P</span><span class="p">(</span><span class="n">curr_score</span><span class="p">,</span> <span class="n">new_score</span><span class="p">,</span> <span class="n">t</span><span class="p">):</span>
<span class="k">if</span> <span class="n">new_score</span> <span class="o">>=</span> <span class="n">curr_score</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">t</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="n">curr_score</span> <span class="o">-</span> <span class="n">new_score</span><span class="p">)</span> <span class="o">/</span> <span class="n">t</span><span class="p">)</span>
<span class="n">all_indices</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">all_indices</span><span class="p">)</span>
<span class="n">indices</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array_split</span><span class="p">(</span><span class="n">all_indices</span><span class="p">,</span> <span class="n">n_partitions</span><span class="p">)</span>
<span class="n">best_score</span> <span class="o">=</span> <span class="n">_score</span><span class="p">(</span><span class="n">indices</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iter</span><span class="p">):</span>
<span class="n">new_indices</span> <span class="o">=</span> <span class="n">_neighbor</span><span class="p">(</span><span class="n">indices</span><span class="p">)</span>
<span class="n">new_indices_score</span> <span class="o">=</span> <span class="n">_score</span><span class="p">(</span><span class="n">new_indices</span><span class="p">)</span>
<span class="k">if</span> <span class="p">(</span><span class="n">new_indices_score</span> <span class="o">>=</span> <span class="n">best_score</span> <span class="ow">or</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><=</span> <span class="n">_P</span><span class="p">(</span><span class="n">best_score</span><span class="p">,</span> <span class="n">new_indices_score</span><span class="p">,</span> <span class="n">_T</span><span class="p">(</span><span class="n">i</span><span class="p">))):</span>
<span class="n">best_score</span> <span class="o">=</span> <span class="n">new_indices_score</span>
<span class="n">indices</span> <span class="o">=</span> <span class="n">new_indices</span>
<span class="k">if</span> <span class="n">best_score</span> <span class="o">>=</span> <span class="n">score_threshold</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">return</span> <span class="p">[</span><span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">indices</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">guess_var_type</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="s">"""Use heuristics to guess at a variable's statistical type"""</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">==</span> <span class="nb">list</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">if</span> <span class="n">x</span><span class="p">.</span><span class="n">dtype</span> <span class="o">==</span> <span class="s">'O'</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span>
<span class="k">except</span> <span class="nb">ValueError</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">np</span><span class="p">.</span><span class="n">issubdtype</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">dtype</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">number</span><span class="p">):</span>
<span class="k">return</span> <span class="n">VarType</span><span class="p">.</span><span class="n">CATEGORICAL</span>
<span class="k">if</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o"><=</span> <span class="p">.</span><span class="mi">2</span><span class="p">:</span>
<span class="k">return</span> <span class="n">VarType</span><span class="p">.</span><span class="n">CATEGORICAL</span>
<span class="k">return</span> <span class="n">VarType</span><span class="p">.</span><span class="n">CONTINUOUS</span>
<span class="k">def</span> <span class="nf">score</span><span class="p">(</span><span class="n">partitions</span><span class="p">,</span> <span class="n">var_types</span><span class="p">):</span>
<span class="s">"""Score the balance of a particular split of a dataset"""</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="nb">min</span><span class="p">([</span>
<span class="n">score_var</span><span class="p">([</span><span class="n">_get_accessor</span><span class="p">(</span><span class="n">partition</span><span class="p">)[:,</span> <span class="n">i</span><span class="p">]</span>
<span class="k">for</span> <span class="n">partition</span> <span class="ow">in</span> <span class="n">partitions</span><span class="p">],</span> <span class="n">var_types</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">var_types</span><span class="p">))</span>
<span class="p">])</span>
<span class="k">def</span> <span class="nf">score_var</span><span class="p">(</span><span class="n">var_partitions</span><span class="p">,</span> <span class="n">var_type</span><span class="p">):</span>
<span class="s">"""Score the balance of a single variable in a certain split of a dataset"""</span>
<span class="k">if</span> <span class="n">var_type</span> <span class="o">==</span> <span class="n">VarType</span><span class="p">.</span><span class="n">CATEGORICAL</span><span class="p">:</span>
<span class="n">unique_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">var_partitions</span><span class="p">))</span>
<span class="n">value_counts</span> <span class="o">=</span> <span class="n">count_values</span><span class="p">(</span><span class="n">var_partitions</span><span class="p">,</span> <span class="n">unique_values</span><span class="p">)</span>
<span class="k">return</span> <span class="n">chi2_contingency</span><span class="p">(</span><span class="n">value_counts</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">pvalues</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">var_partitions</span><span class="p">)):</span>
<span class="n">other_partitions</span> <span class="o">=</span> <span class="p">[</span><span class="n">var_partitions</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">var_partitions</span><span class="p">))</span> <span class="k">if</span> <span class="n">j</span> <span class="o">!=</span> <span class="n">i</span><span class="p">]</span>
<span class="n">pvalues</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">ks_2samp</span><span class="p">(</span><span class="n">var_partitions</span><span class="p">[</span><span class="n">i</span><span class="p">],</span>
<span class="n">np</span><span class="p">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">other_partitions</span><span class="p">))[</span><span class="mi">1</span><span class="p">])</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">pvalues</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">count_values</span><span class="p">(</span><span class="n">var_partitions</span><span class="p">,</span> <span class="n">unique_values</span><span class="p">):</span>
<span class="s">"""Count the number of appearances of each unique value in each list"""</span>
<span class="n">value2index</span> <span class="o">=</span> <span class="p">{</span><span class="n">v</span><span class="p">:</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">enumerate</span><span class="p">(</span><span class="n">unique_values</span><span class="p">)).</span><span class="n">items</span><span class="p">()}</span>
<span class="n">counts</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">var_partitions</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">unique_values</span><span class="p">)))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">var_partitions</span><span class="p">)):</span>
<span class="k">for</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">var_partitions</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span>
<span class="n">counts</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">value2index</span><span class="p">[</span><span class="n">value</span><span class="p">]]</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">counts</span></code></pre></figure>
<p>To summarize:</p>
<ul>
<li>The search process starts with an initial random split, and generates neighbors (similar splits with a pair of indices swapped).</li>
<li>Solutions are scored based on the minimum p-value of the difference between each variable’s distribution among the groups, using the <a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test">Kolmogorov-Smirnov test</a> for continuous variables and the <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a> for categorical variables (the variable types are determined using simple heuristics).</li>
<li>Each neighbor is compared to the current solution; if it’s better it is immediately accepted and set as the current best solution. Otherwise it is accepted with a probability that depends on the difference in score and the current iteration, using the temperature mechanism of simulated annealing.</li>
<li>This continues for a fixed number of iterations or until we have a good enough split.</li>
</ul>
<h3 id="comparing-the-optimized-split-to-a-random-split">Comparing the optimized split to a random split</h3>
<p>Here are 3 runs of a random dataset generation, and comparison of the optimized split with a random split:
<img src="/assets/random-vs-balanced-splits/random_vs_balanced1.png" alt="Random vs Balanced split 1" />
<img src="/assets/random-vs-balanced-splits/random_vs_balanced2.png" alt="Random vs Balanced split 2" />
<img src="/assets/random-vs-balanced-splits/random_vs_balanced3.png" alt="Random vs Balanced split 3" /></p>
<p>We see that the optimized splits are indeed quite balanced, and visibly more balanced than the random splits. Regarding the random splits - they
are pretty OK, in these instances. If I ran this example a thousand more times, I would definitely get instances with much greater imbalance in the random split. Whether or not this is a problem entirely depends on context. At any rate, the optimized split should be much more consistent.</p>
<h2 id="implication-for-experiment-design">Implication for experiment design</h2>
<p><a href="https://en.wikipedia.org/wiki/Randomized_controlled_trial">Randomized controlled trials</a> are a type of experiment which relies on random splitting to reduce bias. For any single trial it is unlikely that a random split will create an imbalance in exactly the “right” aspect and direction to significantly change the conclusions. But it’s certainly <em>possible</em>, and in aggregate, over thousands of trials, it’s much more likely to happen sometimes.</p>
<h3 id="meta-experiment-simulation">Meta-experiment simulation</h3>
<p>To get a feel for whether and how much splitting strategy could affect the conclusions of randomized trials, I ran a meta-experiment simulation where each experiment had the following set-up:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sample size ~ uniform(50, 200)
n_features ~ uniform(3, 7)
target variable (measured at end of trial) ~ normal(0, 1)
intervention effect size on target variable:
50%: 0
50%: ~ normal(1, .5)
each feature's effect size on target variable:
80%: 0
10%: ~ normal(1, .5)
10%: ~ normal(-1, .5)
generate random dataset, features ~ normal(0, 1)
split dataset to control and intervention based on splitting strategy
resolve for each subject final target variable (base + intervention + features)
accept or reject the null-hypothesis
</code></pre></div></div>
<p>The null hypothesis (that the treatment is ineffective) is rejected if the p-value of a <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">t-test</a> on the target value is less than or equal to 5%.</p>
<p>For each splitting strategy (random or optimized) I ran 10000 experiment simulations, counting occurrences of false positives and false negatives.
A false positive is when the null hypothesis was rejected although the intervention effect was 0; a false negative is when the null hypothesis was accepted although the intervention effect was nonzero.</p>
<h3 id="results">Results</h3>
<p>Using a random split, 1172 experiments (out of 10k) arrived at the “wrong” conclusion - 113 false positives and 1059 false negatives.
Using the optimized split, 1088 experiments arrived at the wrong conclusion, with 63 false positives and 1025 false negative.
We see a significant reduction (almost 50%) in the false positive rate, which confirms that splitting strategy could affect an experiment’s results. Remember that this is a toy simulation and the numbers can depend a lot on the specific experiment set-up simulation - the key takeaway is that splitting strategy can affect the conclusions <em>at all</em>.</p>
<h2 id="the-bottom-line">The bottom line</h2>
<p>This could easily seem like a minor point - most of the time, random splits are perfectly good. But the ongoing <a href="https://en.wikipedia.org/wiki/Replication_crisis">replication crisis</a>, which involves many fields in which small-n experiments are quite common, is pushing us to double-check many assumptions and currently-held best practices. Random splits are very common, and performing them in a more balanced fashion doesn’t require much effort. As the crisis probably stems from many different factors, I think it’s a good idea to start adopting various practices aimed at making experiments more robust, and balanced splits seem to be a good candidate.</p>
<h2 id="balanced-splits-python-library">balanced-splits python library</h2>
<p>To help facilitate balanced splitting, I created a python library - <a href="https://pypi.org/project/balanced-splits/"><code class="language-plaintext highlighter-rouge">balanced-splits</code></a> (<a href="https://github.com/andersource/balanced-splits">github</a>) which does just that:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">balanced_splits.split</span> <span class="kn">import</span> <span class="n">optimized_split</span>
<span class="n">sample_size</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span>
<span class="s">'age'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">45</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mf">7.</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">sample_size</span><span class="p">),</span>
<span class="s">'skill'</span><span class="p">:</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">power</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">sample_size</span><span class="p">),</span>
<span class="s">'type'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">([</span><span class="s">'T1'</span><span class="p">,</span> <span class="s">'T2'</span><span class="p">,</span> <span class="s">'T3'</span><span class="p">],</span> <span class="n">size</span><span class="o">=</span><span class="n">sample_size</span><span class="p">)</span>
<span class="p">})</span>
<span class="n">A</span><span class="p">,</span> <span class="n">B</span> <span class="o">=</span> <span class="n">optimized_split</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Partition 1</span><span class="se">\n</span><span class="s">===========</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">describe</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="s">'type'</span><span class="p">].</span><span class="n">value_counts</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n\n</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Partition 2</span><span class="se">\n</span><span class="s">===========</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">describe</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">B</span><span class="p">[</span><span class="s">'type'</span><span class="p">].</span><span class="n">value_counts</span><span class="p">())</span></code></pre></figure>
<p>If you have any questions regarding its use or suggestions for improvement, <a href="mailto:hi@andersource.dev">feel free to contact me</a>.</p>
<p>Happy splitting!</p>tl;dr: Random splits are common, but maybe not balanced enough for some use cases. I made a python library for balanced splitting.A random night sky2020-01-19T07:00:00+00:002020-01-19T07:00:00+00:00andersource.github.io/2020/01/19/a-random-night-sky<link rel="stylesheet" type="text/css" href="/assets/night-sky/index.css" />
<div id="night-container">
<canvas height="100%" width="100%"></canvas>
<button id="repaint">REPAINT</button>
<button id="fullscreen">FULL SCREEN</button>
</div>
<script src="/assets/night-sky/index.js"></script>REPAINT FULL SCREENF-score Deep Dive2019-09-30T09:00:00+00:002019-09-30T09:00:00+00:00andersource.github.io/2019/09/30/f-score-deep-dive<p>Recently at work we had a project where we used genetic algorithms to evolve a model for a classification task. Our key metrics were <a href="https://en.wikipedia.org/wiki/Precision_and_recall">precision and recall</a>, with precision being somewhat more important than recall (we didn’t know exactly how much more important at the start). At first we considered using multi-objective optimization to find the <a href="https://en.wikipedia.org/wiki/Pareto_efficiency">Pareto front</a> and then choose the desired trade-off, but it proved impractical due to performance issues. So we had to define a single metric to optimize. <br />
Since we were using derivative-free optimization we could use any scoring function we wanted, so the <a href="https://en.wikipedia.org/wiki/F1_score">F-score</a> was a natural candidate.
It ended up working quite well, but there were some tricky parts along the way.</p>
<h2 id="general-background">General background</h2>
<p>Accuracy (% correct predictions) is a classical metric for measuring the quality of a classifier. But it’s problematic for many classification tasks, most prominently when the classes
aren’t balanced or when we want to differently penalize false positives vs. false negatives.<br />
Precision and recall separate the model quality measurement to two metrics, focusing on false positives and false negatives, respectively. But then comparing models becomes less trivial -
is 80% precision, 60% recall better or worse than 99% precision, 40% recall?<br />
Taking the average is a possibility; let’s see how it does:</p>
<p><img src="/assets/f-score/mean.png" alt="Averaging precision and recall" /></p>
<p>So if we have a model with 0% precision and 100% recall, the average is a score of 50%. Such a model is completely trivial from a prediction point of view (always predict positive),
so ideally it should have a score of 0%. More generally, we see that the average exhibits a linear tradeoff policy: you can stay on the same score by simultaneously increasing one metric and decreasing the other by the same amount. When the metrics are close this could make sense, but when there’s a big difference it starts to deviate from intuition.</p>
<h2 id="f-score-to-the-rescue">F-score to the rescue</h2>
<p>The F<sub>1</sub>-score is defined as the <a href="https://en.wikipedia.org/wiki/Harmonic_mean">harmonic mean</a> of precision and recall:</p>
<script type="math/tex; mode=display">F_1 = \frac{2}{\frac{1}{p} + \frac{1}{r}}</script>
<p>Let’s visualize it:</p>
<p><img src="/assets/f-score/f1.png" alt="F<sub>1</sub> score visualization" /></p>
<p>This seems much more appropriate for our needs: when there’s a relatively small difference between precision and recall (e.g. along the <code class="language-plaintext highlighter-rouge">y = x</code> line), the score behaves like the average.
But as the difference gets bigger, the score gets more and more dominated by the weaker metric, and further improvement on the already strong metric doesn’t improve it much.<br />
So this is a step in the right direction. But now how do we adjust it to prefer some desired tradeoff between precision and recall?</p>
<h3 id="some-history-and-the-beta-parameter">Some history and the beta parameter</h3>
<p>As far as I understand, the F-score was derived from the book <a href="http://www.dcs.gla.ac.uk/Keith/Preface.html">Information Retrieval by C. J. van Rijsbergen</a>, and popularized in a <a href="https://en.wikipedia.org/wiki/Message_Understanding_Conference">Message Understanding Conference</a> in 1992. More details on the derivation can be found <a href="https://www.toyota-ti.ac.jp/Lab/Denshi/COIN/people/yutaka.sasaki/F-measure-YS-26Oct07.pdf">here</a>. The full derivation of the measure includes a parameter, beta, to control exactly what we’re looking for - how much we prefer one of the metrics over the other. This is also what the ‘1’ in F<sub>1</sub> stands for - no preference for either (a value between <code class="language-plaintext highlighter-rouge">0</code> and <code class="language-plaintext highlighter-rouge">1</code> indicates a preference towards precision, and a value larger than <code class="language-plaintext highlighter-rouge">1</code> indicates a preference towards recall). Here is the full definition:</p>
<script type="math/tex; mode=display">F_\beta = (1 + \beta^2) \cdot \frac{precision \cdot recall}{\beta^2 \cdot precision + recall}</script>
<h3 id="visualizing-the-f-score">Visualizing the F-score</h3>
<p>First, to develop some intuition regarding the effect of beta on the score, here’s an interactive plot to visualize the F-score for different values of beta. Play with the “bands” parameter to explore how different betas create different areas of (relative) equivalence in score.</p>
<html>
<head>
<title>F-score exploration</title>
<style>
canvas { margin: 0 auto; }
#main { margin: 0 auto; text-align: center;}
input[type=range] { margin: 0 auto; }
</style>
</head>
<body>
<div id="main" style="font-family: monospace; font-size: 0.8em;">
<canvas></canvas><br />
Beta: 0.01 <input type="range" id="beta" min="-2" max="2" value="0" step=".1" oninput="on_input_change(this)" /> 100 <span id="beta_value"></span> <br />
Bands: 5 <input type="range" id="bands" min="5" max="100" value="15" step="5" oninput="on_input_change(this)" /> 100 <span id="bands_value"></span>
</div>
<script src="/assets/f-score/index.js"></script>
</body>
</html>
<h3 id="choosing-a-beta">Choosing a beta</h3>
<p>According to the derivation, a choice of beta equal to the desired ratio between recall and precision should be optimal. In this case, if I understood the math correctly, optimality is defined as following: take the F-score function for some beta, which is simply a function with two variables. Find its partial derivatives with respect to recall and precision. Now find a place where those partial derivatives are equal, that is, a point on the precision-recall plane where a change in one metric is equivalent to (will lead to the same change as) a change in the other metric. The F-score function is structured in such a way that when <code class="language-plaintext highlighter-rouge">beta = recall / precision</code>, this point of equivalence lies on the straight line passing through the origin with a slope of <code class="language-plaintext highlighter-rouge">recall / precision</code>. In other words, when the ratio between recall and precision is equal to the desired ratio, a change in one metric will have the same effect as an equal change in the other. I sort of get the intuition behind this definition, but I’m not convinced it captures the essence of optimality anyone using the F-score might find useful.</p>
<h3 id="taking-a-closer-look">Taking a closer look</h3>
<p>When trying to set <code class="language-plaintext highlighter-rouge">beta = desired ratio</code>, the results seemed a little off from what I would expect, and I wanted to make sure the value we’ve chosen for beta really was optimal for our use case. I went on a limb here, and the next part is rather hand-wavy, so I’m not convinced this was the right approach. But here it is anyway.<br />
Imagine the optimizer: crunching numbers, navigating a vast, multidimensional space of classifiers. The navigation is guided by a short-sighted mechanism of offsprings and mutations, with each individual classifier being mapped to the 2d plane of precision and recall, and from there to the 1d axis of the F-score. Better classifiers propagate to future generations, slowly moving the optimizer to better sections of the solution space.<br />
Now imagine this navigation on the precision-recall plane. The outcome is governed by two main factors: the topology of the solution space (how hard it is to achieve a certain combination of precision and recall) and the gradients of the F-score (how “good” it is to achieve a certain combination of precision and recall). We can imagine the solution topology as an uneven terrain on which balls (solutions) are rolling and the F-score as a slight wind pushing the balls in desired directions. We would then like the wind to always push in the direction bringing solutions to our desired ratio.
Let’s try to investigate the F-score under this imaginative and wildly unrigorous intuition: we have no idea how the solution topology looks like (though if we did multi-objective optimization we could get a rough sketch, e.g. by looking at the Pareto front at each generation), so we’ll focus on the direction of the F-score “wind”. To do that we’ll need to find the partial derivatives of the F-score w.r.t. precision and recall:</p>
<script type="math/tex; mode=display">\frac{\partial F}{\partial r} = (1 + \beta^2) \cdot \frac{p(\beta^2 p + r) - pr \cdot (1)}{(\beta^2 p + r)^2} =
(1 + \beta^2)\cdot \frac{\beta^2 p^2 + p r - p r}{(\beta^2 p + r)^2} =
\frac{(1 + \beta^2)}{(\beta^2 p + r)^2} \cdot \beta^2p^2</script>
<script type="math/tex; mode=display">\frac{\partial F}{\partial p} = (1 + \beta^2) \cdot \frac{r(\beta^2p + r) - pr \cdot (\beta^2)}{(\beta^2 p + r)^2} =
(1 + \beta^2) \cdot \frac{\beta^2pr + r^2 - \beta^2pr}{(\beta^2 p + r)^2} =
\frac{(1 + \beta^2)}{(\beta^2 p + r)^2} \cdot r^2</script>
<p>We got very similar-looking partial derivatives: let’s take a look at the “slope” to which the score is pushing at any given point:</p>
<script type="math/tex; mode=display">\frac{^{\partial F}/_{\partial r}}{^{\partial F}/_{\partial p}} = \frac{\beta^2p^2}{r^2} = (\beta \cdot \frac{p}{r})^2</script>
<p>Interesting: the direction at which the score is pushing is <em>constant</em> along straight lines from the origin (though the direction itself usually isn’t along the line).
And we can think of one such line where we <em>would</em> like the direction to be along that line: the line where <code class="language-plaintext highlighter-rouge">r / p = R</code>, our desired ratio. On that line the slope should be equal to <code class="language-plaintext highlighter-rouge">R</code> as well, so we get:</p>
<script type="math/tex; mode=display">R = \frac{\beta^2}{R^2} \\
\beta^2 = R^3 \\
\beta = \sqrt{R^3}</script>
<p>So we have a different definition of optimality which yields a different ideal value for beta.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I’m not sure how important this deep plunge to the maths of the F-score is to cases where you don’t have an unusual desired tradeoff between precision and recall, or when you’re just using the F-score to measure a classifier that’s trained by a different loss function. Usually you’re probably safe with going with F<sub>1</sub>, F<sub>0.5</sub> or F<sub>2</sub>.<br />
But I certainly feel I have a better understanding of how and why the F-score works, and how to better adjust it for a given scenario.</p>Recently at work we had a project where we used genetic algorithms to evolve a model for a classification task. Our key metrics were precision and recall, with precision being somewhat more important than recall (we didn’t know exactly how much more important at the start). At first we considered using multi-objective optimization to find the Pareto front and then choose the desired trade-off, but it proved impractical due to performance issues. So we had to define a single metric to optimize. Since we were using derivative-free optimization we could use any scoring function we wanted, so the F-score was a natural candidate. It ended up working quite well, but there were some tricky parts along the way.Uncertainty Principle in software R&D2019-09-21T10:00:00+00:002019-09-21T10:00:00+00:00andersource.github.io/2019/09/21/rnd-uncertainty-principle<p><a href="https://en.wikipedia.org/wiki/Uncertainty_principle">Heisenberg’s Uncertainty Principle</a> is an important result in physics, expressing a limit regarding the measurement of certain pairs of particles’ physical properties. In essence, it states that the uncertainty of any measurement of these pairs of properties at the same time has a lower bound. For example, if we’re measuring a particle’s position and velocity, and want to be more certain about the particle’s <em>position</em> (measure the position more precisely),
at some point we would inevitably start becoming less certain about the particle’s <em>velocity</em>, regardless of the measurement tools we use. This limitation doesn’t come from any technical
properties of how we measure those properties. Rather, it points to a loss of mathematical meaning as the measurements get “too precise”.</p>
<p>I believe a similar phenomenon exists in the world of research and development. It seems trivial, but too many times I’ve seen it forgotten (or ignored) when it was inconvenient.</p>
<p>Pick a random project management book or article, and you’ll probably see projects depicted as triangles representing the projects’ constraints in some form. Two of the primary constraints
would be equivalents of <em>time</em> and <em>result</em>: we know what we want, and we know when we want it. In practice we are usually not overly concerned with calculating confidence intervals
for those variables.</p>
<p>But the more <em>novel</em> a project (or subtask) is, the more inherent uncertainty it has. This means that if we’re trying to take on something that no-one in-house has experience with
(and we’re not consulting someone with experience), the error bars on <em>both</em> time and result should be quite large. And if we’re tackling something entirely new (as far as we can tell
from preliminary research), it’s almost meaningless to assign an expected value to both the project’s duration and the result. This is important because after a certain threshold, a change of scope is warranted: as a manager, at some point you stop framing the project as “I want X by Y”, and start framing it as one of either:</p>
<ul>
<li>“I want X and I don’t care how long it takes.”</li>
<li>“I’m willing to give this project until Y, no matter the results.”</li>
</ul>
<p>Of course both of these framings are problematic from the business perspective. But the way I see it, assigning too-small error bars just to make a project’s premise feasible
business-wise is a risky endeavor at best.</p>
<p>Note that even when a project is not very novel, <a href="https://erikbern.com/2019/04/15/why-software-projects-take-longer-than-you-think-a-statistical-model.html">we are not great at making practical estimates</a>. Even when we would expect uncertainty to be controlled it comes back to bite us - all the more reason to be extra-careful of underestimating it.</p>Heisenberg’s Uncertainty Principle is an important result in physics, expressing a limit regarding the measurement of certain pairs of particles’ physical properties. In essence, it states that the uncertainty of any measurement of these pairs of properties at the same time has a lower bound. For example, if we’re measuring a particle’s position and velocity, and want to be more certain about the particle’s position (measure the position more precisely), at some point we would inevitably start becoming less certain about the particle’s velocity, regardless of the measurement tools we use. This limitation doesn’t come from any technical properties of how we measure those properties. Rather, it points to a loss of mathematical meaning as the measurements get “too precise”.Using a mobile device as a rotation controller2019-09-17T09:00:00+00:002019-09-17T09:00:00+00:00andersource.github.io/2019/09/17/device-as-rotation-controller<h2 id="demo">Demo</h2>
<p>Use a QR code scanner with a mobile device to scan this code, and start moving the Earth!</p>
<div id="demo_body" style="text-align: center;">
<div id="qrcode"></div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/108/three.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/qrcode-generator/1.4.3/qrcode.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/peerjs/1.0.4/peerjs.min.js"></script>
<script src="/assets/device-rotation-controller/globe.js"></script>
</div>
<h2 id="putting-it-together">Putting it together</h2>
<p>This actually was a classic case of stacking existing components like lego.</p>
<ul>
<li>The <a href="https://www.w3.org/TR/orientation-event/">DeviceOrientation</a> event is part of the W3C standards, and while it’s still an experimental feature, many browsers already support it.
The documentation even helps you out converting Euler angles (the event’s representation of the device orientation) to quaternions, which are generally useful when dealing with rotations and orientations.</li>
<li><a href="https://threejs.org">three.js</a> is a powerful 3D javascript library; the globe was adapted from <a href="https://threejs.org/examples/software_geometry_earth.html">this example</a>.</li>
<li><a href="https://peerjs.com">PeerJS</a> is a javascript p2p library wrapping WebRTC with a very easy-to-use API, and they even provide a default, free broker server for the initial connection.</li>
<li>I used <a href="https://github.com/kazuhikoarase/qrcode-generator#readme">qrcode-generator</a> to generate the QR code.</li>
</ul>
<h3 id="code">Code</h3>
<p>The code is available on Github:
<a href="https://github.com/andersource/andersource.github.io/blob/master/_includes/device-rotation-controller/globe.html">globe.html</a>,
<a href="https://github.com/andersource/andersource.github.io/blob/master/assets/device-rotation-controller/globe.js">globe.js</a>,
<a href="https://github.com/andersource/andersource.github.io/blob/master/static/rotation-controller-client.html">client.html</a>,
<a href="https://github.com/andersource/andersource.github.io/blob/master/assets/device-rotation-controller/client.js">client.js</a>.</p>Demo Use a QR code scanner with a mobile device to scan this code, and start moving the Earth!Sampling arbitrary probability distributions2019-09-01T21:00:00+00:002019-09-01T21:00:00+00:00andersource.github.io/2019/09/01/sampling-arbitrary-distributions<p>The universe we live in is, to the best of our (current) computational capabilities, wildly non-deterministic.
Until the advent of computers, any desire for determinism had to be sated with the imagination, by defining and manipulating mathematical objects.
Then came along machines that enabled us to specify processes that would carry on with unprecedented determinism, and we <em>loved</em> it.
But even in those machines we couldn’t do without a sprinkle of non-determinism, so we added <a href="https://en.wikipedia.org/wiki/Pseudorandom_number_generator">pseudorandom number generators</a>
and <a href="https://www.random.org">true random number generators</a> (and also <a href="https://xkcd.com/221/">this</a>).</p>
<h3 id="manipulating-randomness">Manipulating randomness</h3>
<p>While most programming languages provide primitives for sampling from random distributions, sampling from your distribution of choice might require some work.</p>
<p>For example, C has <code class="language-plaintext highlighter-rouge">rand()</code> which generates an integer between 0 and <code class="language-plaintext highlighter-rouge">RAND_MAX</code>. To generate an integer within the constrained range <code class="language-plaintext highlighter-rouge">(min, max)</code> we use
<code class="language-plaintext highlighter-rouge">rand() % (max - min + 1) + min</code>. This is a trivial example, but it wasn’t trivial to me when I first learned it, and the fascination with transforming random numbers has stuck.</p>
<p>Many languages and libraries provide functions for sampling non-uniform distributions, such as the normal distribution. These functions all rely on a source of uniform random numbers,
and use some method to convert the uniform distribution to the desired distribution. One of the most general methods to convert uniformly-generated numbers in the range <code class="language-plaintext highlighter-rouge">[0, 1]</code>
to any probability distribution (both discrete and continuous) is <a href="https://en.wikipedia.org/wiki/Inverse_transform_sampling">inverse transform sampling</a>.
We’ll get to how it works right after the fun part.</p>
<h3 id="the-fun-part">The fun part</h3>
<p>This is actually the reason for the post. Here you can draw whatever discrete probability distribution you like, and sample from it!
Just draw like in a paint program (the dynamic is a bit different because we’re drawing a function). You can choose from several initial distributions.
(This part is best viewed on desktop).</p>
<div>
<style>
canvas {
margin: 10px;
}
button {
border: none;
padding: 10px;
margin: 2px 10px;
background-color: #88A5F0;
color: white;
}
</style>
<div style="text-align: center;">
<canvas id="draw_distribution" width="600px" height="300px"></canvas>
<div>
<button onclick="setDistribution(UNIFORM);">Uniform</button>
<button onclick="setDistribution(NORMAL);">Normal</button>
<button onclick="setDistribution(SKYLINE);">Skyline</button>
</div>
<div>
<button onclick="single_sample()">Single sample</button>
<p id="single_sample_result" style="display: inline-block;">0</p>
</div>
<div>
<span>Sample size: 1000</span>
<input type="range" id="sample_n" min="0" max="3.5" value="1.5" step="0.01" />
<span>~3M</span>
<button onclick="multi_sample()">Multi sample</button>
</div>
<canvas id="multi_sample_result" width="600px" height="300px"></canvas>
</div>
<script src="/assets/arbitrary-distribution-sampler/sampler.js"></script>
</div>
<h3 id="inverse-transform-sampling">Inverse transform sampling</h3>
<p>Let’s develop the idea behind this sampling technique.</p>
<p>First, suppose you want to randomly select one out of four objects, <code class="language-plaintext highlighter-rouge">A, B, C, D</code>, uniformly. Easy: just sample a uniform random number in the range <code class="language-plaintext highlighter-rouge">[0, 1]</code>.
If it’s between 0 and 0.25, select <code class="language-plaintext highlighter-rouge">A</code>; if it’s between 0.25 and 0.5, select <code class="language-plaintext highlighter-rouge">B</code>; etc.</p>
<p>Now suppose we have different probabilities for each object, for example <code class="language-plaintext highlighter-rouge">A: 0.7, B: 0.2, C: 0.08, D: 0.02</code>. Again we can use a uniform random number; if it’s between
0 and 0.7, select <code class="language-plaintext highlighter-rouge">A</code>; if it’s between 0.7 and 0.9, select <code class="language-plaintext highlighter-rouge">B</code>; if it’s between 0.9 and 0.98, select <code class="language-plaintext highlighter-rouge">C</code>; otherwise select <code class="language-plaintext highlighter-rouge">D</code>.</p>
<p>Notice how the test boundaries correspond to cumulative sum elements of the probability distribution? This cumulative sum series is called a CDF - cumulative distribution function.
Its value at a certain point, <em>x</em>, represents the probability that a random sample from that distribution will be less than or equal to <em>x</em>.</p>
<p><em>Inverse sampling</em> the CDF means asking, for a given probability <em>y</em>, at what <em>x</em> does the CDF have a value of <em>y</em>?</p>
<h4 id="example">Example</h4>
<p>We have this probability distribution:
<img src="/assets/arbitrary-distribution-sampler/pdf.jpeg" alt="Some probability distribution" /></p>
<p>Then its CDF would be:
<img src="/assets/arbitrary-distribution-sampler/cdf1.jpeg" alt="Above distribution's CDF" /></p>
<p>To sample a random number from this distribution, we randomly place a horizontal line, and take the <em>x</em> value where it intersects the CDF:
<img src="/assets/arbitrary-distribution-sampler/cdf2.jpeg" alt="Inverse sampling the CDF" /></p>
<p>Finding the corresponding x for a sampled probability can be done relatively efficiently (<em>O(logn)</em>) with a binary search, as the CDF is a non-decreasing series.</p>
<h3 id="effect-size-and-sample-size">Effect size and sample size</h3>
<p>Choose the “skyline” distribution, and play with the sample size a bit. Try to find, for each skyline feature, the minimum sample size required to distinguish that feature.
We see that the smaller the feature is, the larger the sample size required to distinguish that feature.</p>
<p>To me this really illustrates the necessity for large sample sizes when measuring weak effects: when the sample size is too small,
the noise is about as large as (or larger than) the effect.</p>
<p>Code for the interactive part of this post can be found <a href="https://github.com/andersource/arbitrary-distribution-sampler">here</a>.</p>The universe we live in is, to the best of our (current) computational capabilities, wildly non-deterministic. Until the advent of computers, any desire for determinism had to be sated with the imagination, by defining and manipulating mathematical objects. Then came along machines that enabled us to specify processes that would carry on with unprecedented determinism, and we loved it. But even in those machines we couldn’t do without a sprinkle of non-determinism, so we added pseudorandom number generators and true random number generators (and also this).Fun with Matrix Exponentiation2019-08-25T14:21:33+00:002019-08-25T14:21:33+00:00andersource.github.io/2019/08/25/fun-with-matrix-exponentiation<p>Well, <em>fun</em> might be a bit of a stretch, but I’ll let you decide for yourself.</p>
<p>Linear algebra was always an integral part of computer science in many fields, including simulation, computer graphics, image processing, cryptography,
machine learning, any many more. As a result most modern computing platforms contain efficient matrix operation libraries, and a lot of hardware exists to make these operations even faster.
These platforms are often very accessible and easy to integrate in most development environments.</p>
<p>This means that whenever a problem can be framed in terms of linear algebra, the solution’s performance will usually be better than the naive implementation, especially
in interpreted environments which use specialized linear algebra libraries, such as Python with Numpy, which automatically uses standard linear algebra libraries if they are available.</p>
<p>Of course the fact that a problem <em>can</em> be framed in terms of linear algebra doesn’t mean it <em>should</em> be: there is a development overhead for implementing the solution
in linear algebra terms, and of course maintaining the solution would require additional knowledge not all maintainers necessarily have. This is a classic pitfall for premature optimization.
But sometimes an algorithm’s bottleneck is some computation which could be reduced to a set of matrix operations, making the entire algorithm run faster.</p>
<p>In this post we’ll examine two problems for which a linear algebra approach offers great performance improvement: unit conversion and hierarchical aggregations. Specifically we’ll use the operation of <em>matrix exponentation</em>:
raising a matrix to some power via repeated multiplication.</p>
<h3 id="matrices-and-graphs">Matrices and graphs</h3>
<p>One thing our two problems share in common is the fact that they both conceptually involve graph operations. Graphs can be represented very naturally as <a href="https://en.wikipedia.org/wiki/Adjacency_matrix">adjacency matrices</a>, and it turns out that basic matrix operations, such as multiplication, translate to basic graph operations, such as a single iteration of <a href="https://en.wikipedia.org/wiki/Breadth-first_search">breadth-first search</a>. In this manner we can “translate” the algorithm from an explicit implementation to matrix operation terms.</p>
<h4 id="matrix-multiplication-as-a-bfs-iteration">Matrix multiplication as a BFS iteration</h4>
<p>Let’s look at this graph:</p>
<div>
<svg width="300px" height="500px">
<defs>
<marker id="arrow" markerWidth="10" markerHeight="10" refX="0" refY="3" orient="auto" markerUnits="strokeWidth">
<path d="M0,0 L0,6 L6,3 z" fill="#000" />
</marker>
</defs>
<line x1="265" y1="57" x2="211" y2="145" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="227" y="119" text-anchor="middle" stroke="black"></text>
<line x1="190" y1="181" x2="129" y2="282" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="149" y="249" text-anchor="middle" stroke="black"></text>
<line x1="109" y1="318" x2="55" y2="406" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="71" y="380" text-anchor="middle" stroke="black"></text>
<circle stoke="black" fill="#AAEEBB" r="35" cx="265" cy="57"></circle><text x="265" y="57" text-anchor="middle" stroke="black">A</text>
<circle stoke="black" fill="#AAEEBB" r="35" cx="190" cy="181"></circle><text x="190" y="181" text-anchor="middle" stroke="black">B</text>
<circle stoke="black" fill="#AAEEBB" r="35" cx="109" cy="318"></circle><text x="109" y="318" text-anchor="middle" stroke="black">C</text>
<circle stoke="black" fill="#AAEEBB" r="35" cx="34" cy="442"></circle><text x="34" y="442" text-anchor="middle" stroke="black">D</text>
</svg>
</div>
<p>An all-to-all BFS will find several nontrivial paths (consisting of more than one edge) - A to C, B to D, and A to D.</p>
<p>Now let’s look at the graph’s adjacency matrix: we have a row and a column for each node. Cell <code class="language-plaintext highlighter-rouge">(i, j)</code> (row <code class="language-plaintext highlighter-rouge">i</code>, column <code class="language-plaintext highlighter-rouge">j</code>) is <code class="language-plaintext highlighter-rouge">1</code> if there’s an edge from node <code class="language-plaintext highlighter-rouge">i</code> to node <code class="language-plaintext highlighter-rouge">j</code>, and <code class="language-plaintext highlighter-rouge">0</code> otherwise.
Since the graph is directed the matrix is not (necessarily) symmetric. Additionally we’ll put <code class="language-plaintext highlighter-rouge">1</code>’s in the main diagonal cells (for reasons which will become clear soon).</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
1 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 \\
0 & 0 & 1 & 1 \\
0 & 0 & 0 & 1 \\
\end{pmatrix} %]]></script>
<p>The node names aren’t included in the matrix, but instead are implied by the index, assuming some consistent ordering of nodes, in this case <code class="language-plaintext highlighter-rouge">A, B, C, D</code>.</p>
<p>This matrix describes, through non-zero elements, all the trivial paths in the graph - A to B, B to C and C to D.</p>
<p>Let’s see what happens when we multiply the matrix by itself - i.e. raise it to a power of 2
(you might want to brush up on <a href="https://en.wikipedia.org/wiki/Matrix_multiplication">matrix multiplication</a>):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
1 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 \\
0 & 0 & 1 & 1 \\
0 & 0 & 0 & 1 \\
\end{pmatrix} \cdot
\begin{pmatrix}
1 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 \\
0 & 0 & 1 & 1 \\
0 & 0 & 0 & 1 \\
\end{pmatrix} =
\begin{pmatrix}
1 & 2 & 1 & 0 \\
0 & 1 & 2 & 1 \\
0 & 0 & 1 & 2 \\
0 & 0 & 0 & 1 \\
\end{pmatrix} %]]></script>
<p>And we see two new non-zero cells, representing two paths of length 2: A to C and B to D.</p>
<p>Let’s multiply again by the original matrix, practically taking the 3rd power of the matrix:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
1 & 2 & 1 & 0 \\
0 & 1 & 2 & 1 \\
0 & 0 & 1 & 2 \\
0 & 0 & 0 & 1 \\
\end{pmatrix} \cdot
\begin{pmatrix}
1 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 \\
0 & 0 & 1 & 1 \\
0 & 0 & 0 & 1 \\
\end{pmatrix} =
\begin{pmatrix}
1 & 3 & 3 & 1 \\
0 & 1 & 3 & 3 \\
0 & 0 & 1 & 3 \\
0 & 0 & 0 & 1 \\
\end{pmatrix} %]]></script>
<p>And we got another non-zero element, representing the path (of length 3) A to D.</p>
<p>For any graph, continuing in this manner until a multiplication doesn’t turn any zero element to non-zero will indicate the exact connectivity of the
graph. That is, from a given node we can know all possible destination nodes for which there’s a path in the graph.</p>
<p>The reason we added the identity matrix to the adjacency matrix is that if we used the original adjacency matrix, each successive multiplication
would only reveal the “new” nodes, and we would need to maintain another matrix to represent the graph’s connectivity.</p>
<p>There are several additional considerations which I will touch only briefly but deserve attention:</p>
<ul>
<li>The “interesting” property of the elements (in this case) was whether or not they were zero. So, unless the result of the multiplication is actually interesting
(which might be the case), we can use a binary matrix to automatically “check” if an element is positive and accordingly place a <code class="language-plaintext highlighter-rouge">0</code> or a <code class="language-plaintext highlighter-rouge">1</code> in the result element.</li>
<li>If we’re interested in the length of the path to a certain node, we can examine at each iteration which elements changed from zero to non-zero.
The iteration at which an element changed is the length of the path that the element represents, as in our example.</li>
<li>Recovering the path itself is a little trickier but certainly possible. First we need to recall that the original BFS offers path recovery by maintaining a “previous”
mapping, noting for each node which node came before it in the path. This mapping is in the context of a single source node. In our all-to-all version the mapping for
any node is done in the context of every possible starting node. We can do this in the following way: whenever we recognize a new path (in the form of an element
turning from zero to non-zero), we multiply, element-wise, the two vectors that were multiplied (with a dot product) to produce said element. Any non-zero node in
the resulting vector can be used as the previous node in the context of the path.</li>
<li>Conversely, if we’re just interested in path existence, and not length or recovery, we can “take bigger steps”: instead of multiplying the original matrix by itself every iteration,
we can take higher powers. This could potentially make the calculation even faster for tools that optimize matrix exponentiation.</li>
</ul>
<p>Next we’ll examine two problems where matrix exponentiation, as a tool for all-to-all BFS, could be useful.</p>
<h3 id="unit-conversion">Unit conversion</h3>
<p>Suppose we are writing a dynamic program for unit conversion: it takes as initial input some known conversions between units, and allows a user to (try to) convert an amount from one unit
to another. The conversions supplied to the program don’t have to be complete, and some conversions might not be possible (e.g. seconds to meters). And of course, we don’t
want to explicitly state all legal conversions - if a user specifies a conversion from seconds to minutes and from minutes to hours, the program should be able to convert seconds to hours.
Note that these conversions aren’t entirely fixed; for example in general there is no conversion from grams (mass) to ml (volume), but if we’re dealing with, say, water, then
<code class="language-plaintext highlighter-rouge">1ml water = 1g water</code>.</p>
<h4 id="graph-representation">Graph representation</h4>
<p>Say we are given these conversions:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">tbsp</span> <span class="o">-></span> <span class="mi">3</span> <span class="n">tsp</span>
<span class="n">cup</span> <span class="o">-></span> <span class="mi">16</span> <span class="n">tbsp</span>
<span class="n">kg</span> <span class="o">-></span> <span class="mi">1000</span> <span class="n">g</span></code></pre></figure>
<p>We can represent the units as nodes in a graph, and the given conversions as directed and weighted edges. Like this:</p>
<div>
<svg width="300px" height="500px">
<defs>
<marker id="arrow" markerWidth="10" markerHeight="10" refX="0" refY="3" orient="auto" markerUnits="strokeWidth">
<path d="M0,0 L0,6 L6,3 z" fill="#000" />
</marker>
</defs>
<line x1="148" y1="330" x2="233" y2="412" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="203" y="384" text-anchor="middle" stroke="black">3</text>
<line x1="36" y1="226" x2="121" y2="305" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="92" y="278" text-anchor="middle" stroke="black">16</text>
<line x1="29" y1="113" x2="123" y2="68" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="92" y="83" text-anchor="middle" stroke="black">1000</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="148" cy="330"></circle><text x="148" y="330" text-anchor="middle" stroke="black">tbsp</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="259" cy="438"></circle><text x="259" y="438" text-anchor="middle" stroke="black">tsp</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="36" cy="226"></circle><text x="36" y="226" text-anchor="middle" stroke="black">cup</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="29" cy="113"></circle><text x="29" y="113" text-anchor="middle" stroke="black">kg</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="156" cy="53"></circle><text x="156" y="53" text-anchor="middle" stroke="black">g</text>
</svg>
</div>
<p>Every conversion query is in fact requesting a path in the graph between two nodes, where the conversion ratio is
the multiplication of the weight edges along the path. For example, converting tablespoons to teaspoons is a path
with a single edge, and the ratio is <code class="language-plaintext highlighter-rouge">3</code>. Converting cups to teaspoons is represented by the path <code class="language-plaintext highlighter-rouge">cup->tbsp->tsp</code>, and
the ratio is <code class="language-plaintext highlighter-rouge">16 * 3 = 48</code>. There is no path from kg to cups so we cannot perform that conversion.</p>
<p>Note that given the input the graph we should actually construct
a graph that contains also the inverse edges for the given conversions, with inverse weights.
So converting tablespoons to cups is also possible, with a ratio of <code class="language-plaintext highlighter-rouge">1/16</code>.</p>
<h4 id="naive-solution">Naive solution</h4>
<p>Here is my naive implementation for an all-to-all BFS for this specific problem. The conversions are parsed
from the format above and passed as a list of conversions of the form <code class="language-plaintext highlighter-rouge">(from-unit, to-unit, ratio)</code>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
</pre></td><td class="code"><pre><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="k">def</span> <span class="nf">add_conversions</span><span class="p">(</span><span class="n">mapping</span><span class="p">,</span> <span class="n">conversions</span><span class="p">):</span>
<span class="k">for</span> <span class="n">from_unit</span><span class="p">,</span> <span class="n">to_unit</span><span class="p">,</span> <span class="n">amount</span> <span class="ow">in</span> <span class="n">conversions</span><span class="p">:</span>
<span class="n">mapping</span><span class="p">[</span><span class="n">from_unit</span><span class="p">][</span><span class="n">to_unit</span><span class="p">]</span> <span class="o">=</span> <span class="n">amount</span>
<span class="n">mapping</span><span class="p">[</span><span class="n">to_unit</span><span class="p">][</span><span class="n">from_unit</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.</span> <span class="o">/</span> <span class="n">amount</span>
<span class="k">def</span> <span class="nf">expand_conversions</span><span class="p">(</span><span class="n">mapping</span><span class="p">):</span>
<span class="n">conversions</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># If we can go from A to B, and from B to C,
</span> <span class="c1"># then we can get from A to C
</span> <span class="k">for</span> <span class="n">from_unit</span> <span class="ow">in</span> <span class="n">mapping</span><span class="p">.</span><span class="n">keys</span><span class="p">():</span>
<span class="k">for</span> <span class="n">to_unit</span> <span class="ow">in</span> <span class="n">mapping</span><span class="p">[</span><span class="n">from_unit</span><span class="p">].</span><span class="n">keys</span><span class="p">():</span>
<span class="k">for</span> <span class="n">potential_to_unit</span> <span class="ow">in</span> <span class="n">mapping</span><span class="p">[</span><span class="n">to_unit</span><span class="p">].</span><span class="n">keys</span><span class="p">():</span>
<span class="k">if</span> <span class="p">(</span><span class="n">potential_to_unit</span> <span class="o">==</span> <span class="n">from_unit</span> <span class="ow">or</span>
<span class="n">potential_to_unit</span> <span class="ow">in</span> <span class="n">mapping</span><span class="p">[</span><span class="n">from_unit</span><span class="p">]):</span>
<span class="k">continue</span>
<span class="n">new_ratio</span> <span class="o">=</span> <span class="p">(</span><span class="n">mapping</span><span class="p">[</span><span class="n">from_unit</span><span class="p">][</span><span class="n">to_unit</span><span class="p">]</span> <span class="o">*</span>
<span class="n">mapping</span><span class="p">[</span><span class="n">to_unit</span><span class="p">][</span><span class="n">potential_to_unit</span><span class="p">])</span>
<span class="n">conversions</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="n">from_unit</span><span class="p">,</span>
<span class="n">potential_to_unit</span><span class="p">,</span>
<span class="n">new_ratio</span><span class="p">))</span>
<span class="k">return</span> <span class="n">conversions</span>
<span class="k">def</span> <span class="nf">make_converter</span><span class="p">(</span><span class="n">conversions</span><span class="p">):</span>
<span class="n">mapping</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="k">lambda</span><span class="p">:</span> <span class="p">{})</span>
<span class="c1"># As long as we are discovering new conversions
</span> <span class="c1"># (including the input conversions)
</span> <span class="k">while</span> <span class="n">conversions</span><span class="p">:</span>
<span class="n">add_conversions</span><span class="p">(</span><span class="n">mapping</span><span class="p">,</span> <span class="n">conversions</span><span class="p">)</span>
<span class="n">conversions</span> <span class="o">=</span> <span class="n">expand_conversions</span><span class="p">(</span><span class="n">mapping</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">convert</span><span class="p">(</span><span class="n">from_unit</span><span class="p">,</span> <span class="n">to_unit</span><span class="p">,</span> <span class="n">amount</span><span class="p">):</span>
<span class="k">if</span> <span class="n">from_unit</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">mapping</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">if</span> <span class="n">to_unit</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">mapping</span><span class="p">[</span><span class="n">from_unit</span><span class="p">]:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">return</span> <span class="n">amount</span> <span class="o">*</span> <span class="n">mapping</span><span class="p">[</span><span class="n">from_unit</span><span class="p">][</span><span class="n">to_unit</span><span class="p">]</span>
<span class="k">return</span> <span class="n">convert</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>We simply iterate as long as we discover new conversions. At every iteration, after adding the new conversions
to our mapping dictionary (which includes adding the inverse conversions), we search for new potential conversions: for every node A, we iterate all nodes B for which a path <code class="language-plaintext highlighter-rouge">A->B</code>
exists. For each such node B, we similarly iterate over all nodes C where a path from B to C exists. We then check if a path from A to C
already exists, and if it doesn’t, we add it with the total ratio as the multiplication of each of the separate conversions’ ratios.</p>
<h4 id="linear-algebra-solution">Linear algebra solution</h4>
<p>Since the solution can be reduced to BFS, we can use matrix multiplication as described previously to calculate the conversion matrix.
This time, though, we also need to take into account the edge weights. In this case we need to multiply them, which is perfect for matrix
multiplication. However, another complication will arise from this.</p>
<p>Here is the weighted adjacency matrix of the unit conversion graph above
(including the <code class="language-plaintext highlighter-rouge">1</code>’s on the main diagonal), for the ordering <code class="language-plaintext highlighter-rouge">cup, tbsp, tsp, kg, g</code>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
1 & 16 & 0 & 0 & 0 \\
\frac{1}{16} & 1 & 3 & 0 & 0 \\
0 & \frac{1}{3} & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 1000 \\
0 & 0 & 0 & \frac{1}{1000} & 1
\end{pmatrix} %]]></script>
<p>When multiplied by itself the matrix gives:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
1 & 16 & 0 & 0 & 0 \\
\frac{1}{16} & 1 & 3 & 0 & 0 \\
0 & \frac{1}{3} & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 1000 \\
0 & 0 & 0 & \frac{1}{1000} & 1
\end{pmatrix}^{2} =
\begin{pmatrix}
2 & 32 & 48 & 0 & 0 \\
\frac{1}{8} & 3 & 6 & 0 & 0 \\
\frac{1}{48} & \frac{2}{3} & 2 & 0 & 0 \\
0 & 0 & 0 & 2 & 2000 \\
0 & 0 & 0 & \frac{1}{500} & 2
\end{pmatrix} %]]></script>
<p>We see two new non-zero elements, representing the conversions <code class="language-plaintext highlighter-rouge">cup->tsp</code> and <code class="language-plaintext highlighter-rouge">tsp->cup</code>, with the correct ratios. Hooray!
However, unfortunately we also see that all other non-zero elements have been scaled up by a factor of 2 or more, introducing incorrect ratios to the matrix.
This happens because the matrix multiplication process takes the <em>sum</em> of the element-wise product of a row and a column, resulting in each
element containing the sum of conversions from all possible conversion paths. We can solve this by dividing the matrix (element-wise) by a “helper” matrix which counts
how many paths exist between every two units. This matrix is calculated in exactly the same fashion as the conversion matrix, except it’s initialized
with all edge weights as <code class="language-plaintext highlighter-rouge">1</code>.</p>
<p>In this case, the helper matrix (and its multiplication by itself) would look like this:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
1 & 1 & 0 & 0 & 0 \\
1 & 1 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 1 \\
0 & 0 & 0 & 1 & 1
\end{pmatrix}^{2} =
\begin{pmatrix}
2 & 2 & 1 & 0 & 0 \\
2 & 3 & 2 & 0 & 0 \\
1 & 2 & 2 & 0 & 0 \\
0 & 0 & 0 & 2 & 2 \\
0 & 0 & 0 & 2 & 2
\end{pmatrix} %]]></script>
<p>Then we perform the element-wise division, taking care to ignore elements where the denominator is <code class="language-plaintext highlighter-rouge">0</code>
(here we are using the standard symbol for <a href="https://en.wikipedia.org/wiki/Hadamard_product_(matrices)">Hadamard division</a>,
which is the formal name of element-wise division):</p>
<p><script type="math/tex">% <![CDATA[
\begin{pmatrix}
1 & 16 & 0 & 0 & 0 \\
\frac{1}{16} & 1 & 3 & 0 & 0 \\
0 & \frac{1}{3} & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 1000 \\
0 & 0 & 0 & \frac{1}{1000} & 1
\end{pmatrix}^{2} \oslash \begin{pmatrix}
1 & 1 & 0 & 0 & 0 \\
1 & 1 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 1 \\
0 & 0 & 0 & 1 & 1
\end{pmatrix}^{2} = %]]></script>
<script type="math/tex">% <![CDATA[
= \begin{pmatrix}
2 & 32 & 48 & 0 & 0 \\
\frac{1}{8} & 3 & 6 & 0 & 0 \\
\frac{1}{48} & \frac{2}{3} & 2 & 0 & 0 \\
0 & 0 & 0 & 2 & 2000 \\
0 & 0 & 0 & \frac{1}{500} & 2
\end{pmatrix} \oslash \begin{pmatrix}
2 & 2 & 1 & 0 & 0 \\
2 & 3 & 2 & 0 & 0 \\
1 & 2 & 2 & 0 & 0 \\
0 & 0 & 0 & 2 & 2 \\
0 & 0 & 0 & 2 & 2
\end{pmatrix} =
\begin{pmatrix}
1 & 16 & 48 & 0 & 0 \\
\frac{1}{16} & 1 & 3 & 0 & 0 \\
\frac{1}{48} & \frac{1}{3} & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 1000 \\
0 & 0 & 0 & \frac{1}{1000} & 1
\end{pmatrix} %]]></script></p>
<p>And voilà! That’s exactly the matrix we wanted.</p>
<p>Here is the implementation:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">numpy.linalg</span> <span class="kn">import</span> <span class="n">matrix_power</span>
<span class="k">def</span> <span class="nf">make_converter</span><span class="p">(</span><span class="n">conversions</span><span class="p">):</span>
<span class="c1"># Establish consistent unit <-> index mappings
</span> <span class="n">index2unit</span> <span class="o">=</span> <span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="nb">enumerate</span><span class="p">(</span><span class="nb">set</span><span class="p">([</span><span class="n">c</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">conversions</span><span class="p">]).</span>
<span class="n">union</span><span class="p">([</span><span class="n">c</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">conversions</span><span class="p">]))))</span>
<span class="n">unit2index</span> <span class="o">=</span> <span class="p">{</span><span class="n">v</span><span class="p">:</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">index2unit</span><span class="p">.</span><span class="n">items</span><span class="p">()}</span>
<span class="n">conversion_matrix</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">eye</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">unit2index</span><span class="p">)))</span>
<span class="c1"># Add known conversions
</span> <span class="k">for</span> <span class="n">from_unit</span><span class="p">,</span> <span class="n">to_unit</span><span class="p">,</span> <span class="n">amount</span> <span class="ow">in</span> <span class="n">conversions</span><span class="p">:</span>
<span class="n">conversion_matrix</span><span class="p">[</span><span class="n">unit2index</span><span class="p">[</span><span class="n">from_unit</span><span class="p">],</span>
<span class="n">unit2index</span><span class="p">[</span><span class="n">to_unit</span><span class="p">]]</span> <span class="o">=</span> <span class="n">amount</span>
<span class="n">conversion_matrix</span><span class="p">[</span><span class="n">unit2index</span><span class="p">[</span><span class="n">to_unit</span><span class="p">],</span>
<span class="n">unit2index</span><span class="p">[</span><span class="n">from_unit</span><span class="p">]]</span> <span class="o">=</span> <span class="mf">1.</span><span class="o">/</span><span class="n">amount</span>
<span class="n">helper_matrix</span> <span class="o">=</span> <span class="p">(</span><span class="n">conversion_matrix</span> <span class="o">></span> <span class="mi">0</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">prev_helper_matrix</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">helper_matrix</span><span class="p">))</span>
<span class="c1"># While we are still discovering new paths
</span> <span class="k">while</span> <span class="p">(</span><span class="n">prev_helper_matrix</span> <span class="o">!=</span> <span class="n">helper_matrix</span><span class="p">).</span><span class="nb">any</span><span class="p">():</span>
<span class="n">POWER_STEP</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">prev_helper_matrix</span> <span class="o">=</span> <span class="n">helper_matrix</span>
<span class="n">helper_matrix</span> <span class="o">=</span> <span class="n">matrix_power</span><span class="p">(</span><span class="n">helper_matrix</span><span class="p">,</span> <span class="n">POWER_STEP</span><span class="p">)</span>
<span class="n">conversion_matrix</span> <span class="o">=</span> \
<span class="p">(</span><span class="n">matrix_power</span><span class="p">(</span><span class="n">conversion_matrix</span><span class="p">,</span> <span class="n">POWER_STEP</span><span class="p">)</span> <span class="o">/</span>
<span class="n">np</span><span class="p">.</span><span class="n">maximum</span><span class="p">(</span><span class="mf">1.</span><span class="p">,</span> <span class="n">helper_matrix</span><span class="p">))</span>
<span class="n">helper_matrix</span> <span class="o">=</span> <span class="p">(</span><span class="n">conversion_matrix</span> <span class="o">></span> <span class="mi">0</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">convert</span><span class="p">(</span><span class="n">from_unit</span><span class="p">,</span> <span class="n">to_unit</span><span class="p">,</span> <span class="n">amount</span><span class="p">):</span>
<span class="n">conversion</span> <span class="o">=</span> <span class="n">conversion_matrix</span><span class="p">[</span><span class="n">unit2index</span><span class="p">[</span><span class="n">from_unit</span><span class="p">],</span>
<span class="n">unit2index</span><span class="p">[</span><span class="n">to_unit</span><span class="p">]]</span>
<span class="k">if</span> <span class="n">conversion</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">return</span> <span class="n">conversion</span> <span class="o">*</span> <span class="n">amount</span>
<span class="k">return</span> <span class="n">convert</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>We first create a mapping of node name to index (and the inverse mapping), since we are
going to work with row and column indices to represent different nodes. Then we create
the initial conversion matrix, starting from the identity matrix to include <code class="language-plaintext highlighter-rouge">1</code>’s in the main
diagonal.</p>
<h4 id="comparison">Comparison</h4>
<p>The two implementations are about the same length (counting lines of code).</p>
<p>I like how with the naive implementation, a maintainer doesn’t even need to know anything about formal graphs to understand both <em>how</em> and <em>why</em> the solution works.</p>
<p>In contrast, to understand the second implementation you need to have an idea of how matrix multiplication works,
know about graphs, why unit conversion is equivalent to pathfinding, and how matrix multiplication can be used as a BFS step. Quite a baggage.</p>
<p>To compare performance, I downloaded a currency conversion XML, chose a couple of “key” currencies, and included conversions of all other
currencies in terms of those key currencies. The full conversion table contains 148 currencies; I also created partial tables with 52 and 12 currencies.
I ran both implementations 5 times on each file, measuring the time to construct the converter (and of course validating it afterwards).
Here are the results:</p>
<table>
<thead>
<tr>
<th>↓ # currencies / avg. runtime (sec) →</th>
<th>naive</th>
<th>linalg</th>
<th>linalg faster by</th>
</tr>
</thead>
<tbody>
<tr>
<td>12</td>
<td>0.000251</td>
<td>0.0016</td>
<td>0.15 (linalg is slower here)</td>
</tr>
</tbody>
<tbody>
<tr>
<td>52</td>
<td>0.012</td>
<td>0.0028</td>
<td>4.2</td>
</tr>
</tbody>
<tbody>
<tr>
<td>148</td>
<td>0.27</td>
<td>0.015</td>
<td>18</td>
</tr>
</tbody>
</table>
<p>While there’s an initial overhead to using all the matrix representation and operations,
the linear algebra approach seems to scale better than the naive approach.</p>
<p>I think there could be a more efficient implementation of the naive approach, but I suspect that the linear algebra
implementation would still be faster, both asymptotically and practically for relatively large graphs, due to fast matrix multiplication techniques.</p>
<h3 id="hierarchical-aggregations">Hierarchical aggregations</h3>
<p>Let’s consider another task. We have Yummly’s <a href="https://www.kaggle.com/c/whats-cooking/data">“What’s Cooking?”</a> public dataset, containing some 40k recipes.
Each recipe is classified to a cuisine, and additionally has a list of the recipe’s ingredients. In order to better organize the large dataset,
we construct two hierarchies: a cuisine hierarchy and an ingredient hierarchy (containing only “common” ingredients, which appear in at least 100 recipes).</p>
<p>Here is the cuisine hierarchy:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">american</span>
<span class="n">north</span> <span class="n">american</span>
<span class="n">southern_us</span>
<span class="n">cajun_creole</span>
<span class="n">mexican</span>
<span class="n">caribbean</span>
<span class="n">jamaican</span>
<span class="n">south</span> <span class="n">american</span>
<span class="n">brazilian</span>
<span class="n">asian</span>
<span class="n">east</span> <span class="n">asian</span>
<span class="n">chinese</span>
<span class="n">japanese</span>
<span class="n">korean</span>
<span class="n">south</span> <span class="n">asian</span>
<span class="n">indian</span>
<span class="n">southeast</span> <span class="n">asian</span>
<span class="n">thai</span>
<span class="n">vietnamese</span>
<span class="n">filipino</span>
<span class="n">european</span>
<span class="n">southern</span> <span class="n">european</span>
<span class="n">greek</span>
<span class="n">spanish</span>
<span class="n">italian</span>
<span class="n">eastern</span> <span class="n">european</span>
<span class="n">russian</span>
<span class="n">northern</span> <span class="n">european</span>
<span class="n">british</span>
<span class="n">irish</span>
<span class="n">western</span> <span class="n">european</span>
<span class="n">french</span>
<span class="n">african</span>
<span class="n">moroccan</span></code></pre></figure>
<p>And here’s a snippet of the ingredient hierarchy (the full hierarchy contains about 650 nodes):</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">dairy</span>
<span class="n">cheese</span>
<span class="n">shredded</span> <span class="n">cheese</span>
<span class="n">cream</span> <span class="n">cheese</span>
<span class="n">cream</span> <span class="n">cheese</span><span class="p">,</span> <span class="n">soften</span>
<span class="n">feta</span> <span class="n">cheese</span>
<span class="n">feta</span> <span class="n">cheese</span> <span class="n">crumbles</span>
<span class="n">cheddar</span> <span class="n">cheese</span>
<span class="n">sharp</span> <span class="n">cheddar</span> <span class="n">cheese</span>
<span class="n">shredded</span> <span class="n">cheddar</span> <span class="n">cheese</span>
<span class="n">shredded</span> <span class="n">sharp</span> <span class="n">cheddar</span> <span class="n">cheese</span>
<span class="n">provolone</span> <span class="n">cheese</span>
<span class="n">parmesan</span> <span class="n">cheese</span>
<span class="n">fresh</span> <span class="n">parmesan</span> <span class="n">cheese</span>
<span class="n">grated</span> <span class="n">parmesan</span> <span class="n">cheese</span>
<span class="n">freshly</span> <span class="n">grated</span> <span class="n">parmesan</span>
<span class="n">mozzarella</span> <span class="n">cheese</span>
<span class="n">part</span><span class="o">-</span><span class="n">skim</span> <span class="n">mozzarella</span> <span class="n">cheese</span>
<span class="n">shredded</span> <span class="n">mozzarella</span> <span class="n">cheese</span>
<span class="n">monterey</span> <span class="n">jack</span>
<span class="n">jack</span> <span class="n">cheese</span>
<span class="n">shredded</span> <span class="n">Monterey</span> <span class="n">Jack</span> <span class="n">cheese</span>
<span class="n">mascarpone</span>
<span class="n">Mexican</span> <span class="n">cheese</span> <span class="n">blend</span>
<span class="n">romano</span> <span class="n">cheese</span>
<span class="n">pecorino</span> <span class="n">romano</span> <span class="n">cheese</span>
<span class="n">parmigiano</span> <span class="n">reggiano</span> <span class="n">cheese</span>
<span class="n">ricotta</span> <span class="n">cheese</span>
<span class="n">ricotta</span>
<span class="n">part</span><span class="o">-</span><span class="n">skim</span> <span class="n">ricotta</span> <span class="n">cheese</span>
<span class="n">goat</span> <span class="n">cheese</span>
<span class="n">fontina</span> <span class="n">cheese</span>
<span class="n">cottage</span> <span class="n">cheese</span>
<span class="n">queso</span> <span class="n">fresco</span>
<span class="n">paneer</span>
<span class="n">cotija</span></code></pre></figure>
<p>These hierarchies allow us to generalize some concepts and group them together.</p>
<p>We now want to count cuisine-ingredient combinations, i.e. how many recipes belong to a certain cuisine and contain a certain ingredient.
This aggregation should be done hierarchically: an Italian recipe is also South-European and European, and the ingredient “diced tomatoes”
also counts as “tomatoes” and “vegetables”.</p>
<h4 id="naive-solution-1">Naive solution</h4>
<p>Once again, the naive solution doesn’t involve explicitly representing the problem in graph terms.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">aggregate</span><span class="p">(</span><span class="n">recipes</span><span class="p">,</span> <span class="n">cuisine_hier</span><span class="p">,</span> <span class="n">ingredient_hier</span><span class="p">):</span>
<span class="c1"># Initialize empty aggregations
</span> <span class="n">res</span> <span class="o">=</span> <span class="p">{</span><span class="n">cuisine</span><span class="p">:</span> <span class="p">{</span><span class="n">ingredient</span><span class="p">:</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">ingredient</span> <span class="ow">in</span> <span class="n">ingredient_hier</span><span class="p">.</span><span class="n">keys</span><span class="p">()}</span>
<span class="k">for</span> <span class="n">cuisine</span> <span class="ow">in</span> <span class="n">cuisine_hier</span><span class="p">.</span><span class="n">keys</span><span class="p">()}</span>
<span class="k">for</span> <span class="n">recipe</span> <span class="ow">in</span> <span class="n">recipes</span><span class="p">:</span>
<span class="n">aggregate_recipe</span><span class="p">(</span><span class="n">recipe</span><span class="p">,</span> <span class="n">res</span><span class="p">,</span>
<span class="n">cuisine_hier</span><span class="p">,</span> <span class="n">ingredient_hier</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="n">cuisine</span><span class="p">,</span> <span class="n">ingredient</span><span class="p">):</span>
<span class="k">return</span> <span class="n">res</span><span class="p">[</span><span class="n">cuisine</span><span class="p">][</span><span class="n">ingredient</span><span class="p">]</span>
<span class="k">return</span> <span class="n">query</span>
<span class="k">def</span> <span class="nf">aggregate_recipe</span><span class="p">(</span><span class="n">recipe</span><span class="p">,</span> <span class="n">res</span><span class="p">,</span> <span class="n">cuisine_hier</span><span class="p">,</span> <span class="n">ingredient_hier</span><span class="p">):</span>
<span class="n">cuisine</span> <span class="o">=</span> <span class="n">recipe</span><span class="p">[</span><span class="s">'cuisine'</span><span class="p">]</span>
<span class="k">for</span> <span class="n">ingredient</span> <span class="ow">in</span> <span class="nb">set</span><span class="p">(</span><span class="n">recipe</span><span class="p">[</span><span class="s">'ingredients'</span><span class="p">]):</span>
<span class="n">aggregate_ingredient</span><span class="p">(</span><span class="n">res</span><span class="p">,</span> <span class="n">cuisine</span><span class="p">,</span> <span class="n">ingredient</span><span class="p">,</span>
<span class="n">cuisine_hier</span><span class="p">,</span> <span class="n">ingredient_hier</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">aggregate_ingredient</span><span class="p">(</span><span class="n">res</span><span class="p">,</span> <span class="n">cuisine</span><span class="p">,</span> <span class="n">ingredient</span><span class="p">,</span>
<span class="n">cuisine_hier</span><span class="p">,</span> <span class="n">ingredient_hier</span><span class="p">):</span>
<span class="k">if</span> <span class="n">ingredient</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ingredient_hier</span><span class="p">:</span>
<span class="k">return</span>
<span class="c1"># For every cuisine up the hierarchy, for every ingredient up
</span> <span class="c1"># the hierarchy, add 1 to the aggregated count
</span> <span class="n">curr_cuisine</span> <span class="o">=</span> <span class="n">cuisine</span>
<span class="k">while</span> <span class="n">curr_cuisine</span> <span class="ow">in</span> <span class="n">cuisine_hier</span><span class="p">:</span>
<span class="n">curr_ingredient</span> <span class="o">=</span> <span class="n">ingredient</span>
<span class="k">while</span> <span class="n">curr_ingredient</span> <span class="ow">in</span> <span class="n">ingredient_hier</span><span class="p">:</span>
<span class="n">res</span><span class="p">[</span><span class="n">curr_cuisine</span><span class="p">][</span><span class="n">curr_ingredient</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">curr_ingredient</span> <span class="o">=</span> <span class="n">ingredient_hier</span><span class="p">[</span><span class="n">curr_ingredient</span><span class="p">]</span>
<span class="n">curr_cuisine</span> <span class="o">=</span> <span class="n">cuisine_hier</span><span class="p">[</span><span class="n">curr_cuisine</span><span class="p">]</span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>The solution is pretty straightforward: for each recipe we iteratively count up the hierarchies.</p>
<h4 id="graph-representation-1">Graph representation</h4>
<p>Before we look at the linear algebra approach, let’s see how this problem translates to graph terms.
The hierarchies are simply trees, with an edge from each node to its parent:</p>
<div>
<svg width="500px" height="500px">
<defs>
<marker id="arrow" markerWidth="10" markerHeight="10" refX="0" refY="3" orient="auto" markerUnits="strokeWidth">
<path d="M0,0 L0,6 L6,3 z" fill="#000" />
</marker>
</defs>
<line x1="321" y1="343" x2="297" y2="274" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="303" y="291" text-anchor="middle" stroke="black"></text>
<line x1="429" y1="366" x2="356" y2="350" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="375" y="354" text-anchor="middle" stroke="black"></text>
<line x1="349" y1="440" x2="330" y2="377" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="335" y="391" text-anchor="middle" stroke="black"></text>
<line x1="238" y1="411" x2="293" y2="365" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="279" y="377" text-anchor="middle" stroke="black"></text>
<line x1="174" y1="229" x2="250" y2="236" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="230" y="234" text-anchor="middle" stroke="black"></text>
<line x1="69" y1="219" x2="138" y2="225" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="121" y="224" text-anchor="middle" stroke="black"></text>
<line x1="343" y1="145" x2="304" y2="209" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="314" y="192" text-anchor="middle" stroke="black"></text>
<line x1="276" y1="62" x2="320" y2="116" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="309" y="103" text-anchor="middle" stroke="black"></text>
<line x1="391" y1="55" x2="359" y2="113" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="367" y="100" text-anchor="middle" stroke="black"></text>
<line x1="454" y1="143" x2="378" y2="144" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="398" y="144" text-anchor="middle" stroke="black"></text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="321" cy="343"></circle><text x="321" y="343" text-anchor="middle" stroke="black"><tspan x="321" dy="-.3em">east</tspan><tspan x="321" dy=".9em">asian</tspan></text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="286" cy="240"></circle><text x="286" y="240" text-anchor="middle" stroke="black">asian</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="429" cy="366"></circle><text x="429" y="366" text-anchor="middle" stroke="black">chinese</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="349" cy="440"></circle><text x="349" y="440" text-anchor="middle" stroke="black">japanese</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="238" cy="411"></circle><text x="238" y="411" text-anchor="middle" stroke="black">korean</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="174" cy="229"></circle><text x="174" y="229" text-anchor="middle" stroke="black"><tspan x="174" dy="-.3em">south</tspan><tspan x="174" dy=".9em">asian</tspan></text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="69" cy="219"></circle><text x="69" y="219" text-anchor="middle" stroke="black">indian</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="343" cy="145"></circle><text x="343" y="145" text-anchor="middle" stroke="black"><tspan x="343" dy="-.3em">southeast</tspan><tspan x="343" dy=".9em">asian</tspan></text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="276" cy="62"></circle><text x="276" y="62" text-anchor="middle" stroke="black">thai</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="391" cy="55"></circle><text x="391" y="55" text-anchor="middle" stroke="black">vietnamese</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="454" cy="143"></circle><text x="454" y="143" text-anchor="middle" stroke="black">filipino</text>
</svg>
</div>
<p>Each cuisine-ingredient pair is directly related to two nodes, and the aggregation involves all nodes to which we can
arrive from those directly related nodes.</p>
<h4 id="ancestry-matrices">Ancestry matrices</h4>
<p>Let’s try to calculate that “all nodes to which we can arrive” from a certain node. In the context of hierarchies,
these paths can be interpreted as the “ancestry lineage” of a certain node, i.e. all nodes appearing on the path
from a certain node in the hierarchy tree. This is another instance of a (binary) BFS, which means we can use matrix
exponentiation to find the ancestry matrix. The initial matrix will be the hierarchy matrix: a direct matrix representation
of the hierarchy tree, added to the identity matrix (as in the unit conversion case).</p>
<p>Given the following tree:</p>
<div>
<svg width="500px" height="500px">
<defs>
<marker id="arrow" markerWidth="10" markerHeight="10" refX="0" refY="3" orient="auto" markerUnits="strokeWidth">
<path d="M0,0 L0,6 L6,3 z" fill="#000" />
</marker>
</defs>
<line x1="117" y1="249" x2="266" y2="249" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="209" y="249" text-anchor="middle" stroke="black"></text>
<line x1="427" y1="249" x2="338" y2="249" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="364" y="249" text-anchor="middle" stroke="black"></text>
<line x1="43" y1="57" x2="104" y2="215" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="80" y="153" text-anchor="middle" stroke="black"></text>
<line x1="43" y1="442" x2="104" y2="282" stroke="black" stroke-width="1" marker-end="url(#arrow)"></line><text x="80" y="345" text-anchor="middle" stroke="black"></text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="117" cy="249"></circle><text x="117" y="249" text-anchor="middle" stroke="black">B</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="302" cy="249"></circle><text x="302" y="249" text-anchor="middle" stroke="black">A</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="427" cy="249"></circle><text x="427" y="249" text-anchor="middle" stroke="black">C</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="43" cy="57"></circle><text x="43" y="57" text-anchor="middle" stroke="black">D</text>
<circle stoke="black" fill="#AAEEBB" r="30" cx="43" cy="442"></circle><text x="43" y="442" text-anchor="middle" stroke="black">E</text>
</svg>
</div>
<p>The hierarchy matrix (using node order <code class="language-plaintext highlighter-rouge">A, B, C, D, E</code>) will look like:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
1 & 0 & 0 & 0 & 0 \\
1 & 1 & 0 & 0 & 0 \\
1 & 0 & 1 & 0 & 0 \\
0 & 1 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 & 1 \\
\end{pmatrix} %]]></script>
<p>Here the rows represent children nodes and the columns parent nodes: thus the element at row 4 (node <code class="language-plaintext highlighter-rouge">D</code>), column 2 (node <code class="language-plaintext highlighter-rouge">B</code>) is <code class="language-plaintext highlighter-rouge">1</code>,
because node <code class="language-plaintext highlighter-rouge">D</code> is a child of node <code class="language-plaintext highlighter-rouge">B</code>.</p>
<p>Multiplying the hierarchy matrix by itself yields (with binary multiplication):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
1 & 0 & 0 & 0 & 0 \\
1 & 1 & 0 & 0 & 0 \\
1 & 0 & 1 & 0 & 0 \\
0 & 1 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 & 1 \\
\end{pmatrix}^{2} =
\begin{pmatrix}
1 & 0 & 0 & 0 & 0 \\
1 & 1 & 0 & 0 & 0 \\
1 & 0 & 1 & 0 & 0 \\
1 & 1 & 0 & 1 & 0 \\
1 & 1 & 0 & 0 & 1 \\
\end{pmatrix} %]]></script>
<p>The resulting matrix additionally contains the information that node <code class="language-plaintext highlighter-rouge">A</code> is an ancestor of nodes <code class="language-plaintext highlighter-rouge">D</code> and <code class="language-plaintext highlighter-rouge">E</code>.
Since the longest path in the tree is of length 2, in this case we are done; in general, as before,
we continue until multiplications don’t cause further changes in the matrix.</p>
<h4 id="linear-algebra-solution-1">Linear algebra solution</h4>
<p>Our solution is going to take the following form:</p>
<ol>
<li>Convert the recipe representation to matrix form, generating two matrices: recipe cuisines and recipe ingredients.</li>
<li>Create ancestry matrices for the two hierarchies.</li>
<li>Using matrix multiplication to calculate the final aggregation.</li>
</ol>
<p>The conversion to matrix representation will create a matrix with each recipe represented as a row; in
the cuisine matrix, there will be a column for each cuisine and each recipe row will have a <code class="language-plaintext highlighter-rouge">1</code> in the relevant cuisine.
Similarly, in the ingredient matrix, there will be a column for each ingredient, and each recipe row will have <code class="language-plaintext highlighter-rouge">1</code>’s in all
relevant ingredients (there could be more than one).</p>
<p>In this case, since we expect the matrices to be sparse, we’ll use <a href="https://docs.scipy.org/doc/scipy/reference/sparse.html">scipy’s sparse matrices</a>.
For this reason we’ll use the <code class="language-plaintext highlighter-rouge">**</code> operator to take the matrix power instead of numpy’s <code class="language-plaintext highlighter-rouge">matrix_power</code>, as numpy functions often don’t work well with sparse matrices.</p>
<p>After converting the representation and creating the ancestry matrices, what’s left is a few final multiplications.
If we call the cuisine ancestry matrix <code class="language-plaintext highlighter-rouge">C</code>, ingredient ancestry matrix <code class="language-plaintext highlighter-rouge">I</code>, recipe cuisines matrix <code class="language-plaintext highlighter-rouge">Rc</code> and recipe ingredients <code class="language-plaintext highlighter-rouge">Ri</code>,
then we can make the following observations:</p>
<ul>
<li>Multiplying <code class="language-plaintext highlighter-rouge">Rc</code> by <code class="language-plaintext highlighter-rouge">C</code> will yield, for each recipe, all the cuisines it belongs to (including ancestors).</li>
<li>Multiplying <code class="language-plaintext highlighter-rouge">Ri</code> by <code class="language-plaintext highlighter-rouge">I</code> will yield, for each recipe, all the ingredients in the recipe (including ancestors).</li>
<li>Multiplying the above two matrices (transposing the first) yields, for each cuisine and ingredient pair, how many recipes belong
to that cuisine and contain that ingredient - which is exactly what we want!</li>
</ul>
<p>So in conclusion the final calculation is:</p>
<script type="math/tex; mode=display">(Rc \cdot C)^{T} \cdot (Ri \cdot I)</script>
<p>And without further ado, here’s the full code:</p>
<p>(The <code class="language-plaintext highlighter-rouge">@</code> operator in Python 3 denotes matrix dot product, though in this case it’s not strictly necessary as sparse matrices overload the <code class="language-plaintext highlighter-rouge">*</code> operator for dot product as well.)</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">scipy.sparse.lil</span> <span class="kn">import</span> <span class="n">lil_matrix</span>
<span class="kn">from</span> <span class="nn">scipy.sparse.csr</span> <span class="kn">import</span> <span class="n">csr_matrix</span>
<span class="k">def</span> <span class="nf">aggregate</span><span class="p">(</span><span class="n">recipes</span><span class="p">,</span> <span class="n">cuisine_hier</span><span class="p">,</span> <span class="n">ingredient_hier</span><span class="p">):</span>
<span class="c1"># Establish consistent node <-> index mappings
</span> <span class="c1"># for both hierarchies
</span> <span class="n">index2cuisine</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">enumerate</span><span class="p">(</span><span class="n">cuisine_hier</span><span class="p">.</span><span class="n">keys</span><span class="p">()))</span>
<span class="n">cuisine2index</span> <span class="o">=</span> <span class="p">{</span><span class="n">v</span><span class="p">:</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">index2cuisine</span><span class="p">.</span><span class="n">items</span><span class="p">()}</span>
<span class="n">index2ingredient</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">enumerate</span><span class="p">(</span><span class="n">ingredient_hier</span><span class="p">.</span><span class="n">keys</span><span class="p">()))</span>
<span class="n">ingredient2index</span> <span class="o">=</span> <span class="p">{</span><span class="n">v</span><span class="p">:</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">index2ingredient</span><span class="p">.</span><span class="n">items</span><span class="p">()}</span>
<span class="c1"># Map recipes to cuisine matrix and ingredient matrix
</span> <span class="n">recipe2cuisine</span> <span class="o">=</span> \
<span class="n">recipe_cuisines</span><span class="p">(</span><span class="n">recipes</span><span class="p">,</span> <span class="n">cuisine2index</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">recipe2ingredient</span> <span class="o">=</span> \
<span class="n">recipe_ingredients</span><span class="p">(</span><span class="n">recipes</span><span class="p">,</span> <span class="n">ingredient2index</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="c1"># Create cuisine ancestry matrix
</span> <span class="n">cuisine_hier_mat</span> <span class="o">=</span> \
<span class="n">construct_hierarchy_matrix</span><span class="p">(</span><span class="n">cuisine_hier</span><span class="p">,</span> <span class="n">cuisine2index</span><span class="p">)</span>
<span class="n">cuisine_ancestry_mat</span> <span class="o">=</span> \
<span class="n">construct_ancestry_matrix</span><span class="p">(</span><span class="n">cuisine_hier_mat</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="c1"># Create ingredient ancestry matrix
</span> <span class="n">ingredient_hier_mat</span> <span class="o">=</span> \
<span class="n">construct_hierarchy_matrix</span><span class="p">(</span><span class="n">ingredient_hier</span><span class="p">,</span> <span class="n">ingredient2index</span><span class="p">)</span>
<span class="n">ingredient_ancestry_mat</span> <span class="o">=</span> \
<span class="n">construct_ancestry_matrix</span><span class="p">(</span><span class="n">ingredient_hier_mat</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="c1"># Aggregate
</span> <span class="n">counts</span> <span class="o">=</span> <span class="p">(</span><span class="n">recipe2cuisine</span> <span class="o">@</span> <span class="n">cuisine_ancestry_mat</span><span class="p">).</span><span class="n">T</span> <span class="o">@</span> \
<span class="p">(</span><span class="n">recipe2ingredient</span> <span class="o">@</span> <span class="n">ingredient_ancestry_mat</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="n">cuisine</span><span class="p">,</span> <span class="n">ingredient</span><span class="p">):</span>
<span class="k">return</span> <span class="n">counts</span><span class="p">[</span><span class="n">cuisine2index</span><span class="p">[</span><span class="n">cuisine</span><span class="p">],</span>
<span class="n">ingredient2index</span><span class="p">[</span><span class="n">ingredient</span><span class="p">]]</span>
<span class="k">return</span> <span class="n">query</span>
<span class="k">def</span> <span class="nf">construct_hierarchy_matrix</span><span class="p">(</span><span class="n">hierarchy</span><span class="p">,</span> <span class="n">node2index</span><span class="p">):</span>
<span class="n">N</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">hierarchy</span><span class="p">)</span>
<span class="n">hier_mat</span> <span class="o">=</span> <span class="n">lil_matrix</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">eye</span><span class="p">(</span><span class="n">N</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">bool</span><span class="p">)</span>
<span class="k">for</span> <span class="n">child</span><span class="p">,</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">hierarchy</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="k">if</span> <span class="n">parent</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">hier_mat</span><span class="p">[</span><span class="n">node2index</span><span class="p">[</span><span class="n">child</span><span class="p">],</span> <span class="n">node2index</span><span class="p">[</span><span class="n">parent</span><span class="p">]]</span> <span class="o">=</span> <span class="mf">1.</span>
<span class="k">return</span> <span class="n">csr_matrix</span><span class="p">(</span><span class="n">hier_mat</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">construct_ancestry_matrix</span><span class="p">(</span><span class="n">hierarchy_matrix</span><span class="p">):</span>
<span class="n">ancestry_matrix</span> <span class="o">=</span> <span class="n">hierarchy_matrix</span>
<span class="n">POWER_STEP</span> <span class="o">=</span> <span class="mi">5</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="n">new_ancestry_matrix</span> <span class="o">=</span> <span class="n">ancestry_matrix</span> <span class="o">**</span> <span class="n">POWER_STEP</span>
<span class="k">if</span> <span class="ow">not</span> <span class="p">(</span><span class="n">new_ancestry_matrix</span> <span class="o">!=</span> <span class="n">ancestry_matrix</span><span class="p">).</span><span class="nb">max</span><span class="p">():</span>
<span class="k">return</span> <span class="n">new_ancestry_matrix</span>
<span class="n">ancestry_matrix</span> <span class="o">=</span> <span class="n">new_ancestry_matrix</span>
<span class="k">def</span> <span class="nf">recipe_cuisines</span><span class="p">(</span><span class="n">recipes</span><span class="p">,</span> <span class="n">cuisine2index</span><span class="p">):</span>
<span class="n">recipe2cuisine</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">recipes</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">cuisine2index</span><span class="p">)))</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">recipe</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">recipes</span><span class="p">):</span>
<span class="n">recipe2cuisine</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">cuisine2index</span><span class="p">[</span><span class="n">recipe</span><span class="p">[</span><span class="s">'cuisine'</span><span class="p">]]]</span> <span class="o">=</span> <span class="mf">1.</span>
<span class="k">return</span> <span class="n">csr_matrix</span><span class="p">(</span><span class="n">recipe2cuisine</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">bool</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">recipe_ingredients</span><span class="p">(</span><span class="n">recipes</span><span class="p">,</span> <span class="n">ingredient2index</span><span class="p">):</span>
<span class="n">recipe2ingredients</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">recipes</span><span class="p">),</span>
<span class="nb">len</span><span class="p">(</span><span class="n">ingredient2index</span><span class="p">)))</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">recipe</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">recipes</span><span class="p">):</span>
<span class="k">for</span> <span class="n">ingredient</span> <span class="ow">in</span> <span class="n">recipe</span><span class="p">[</span><span class="s">'ingredients'</span><span class="p">]:</span>
<span class="k">if</span> <span class="n">ingredient</span> <span class="ow">in</span> <span class="n">ingredient2index</span><span class="p">:</span>
<span class="n">recipe2ingredients</span><span class="p">[</span><span class="n">i</span><span class="p">,</span>
<span class="n">ingredient2index</span><span class="p">[</span><span class="n">ingredient</span><span class="p">]]</span> <span class="o">=</span> <span class="mf">1.</span>
<span class="k">return</span> <span class="n">csr_matrix</span><span class="p">(</span><span class="n">recipe2ingredients</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">bool</span><span class="p">)</span>
</pre></td></tr></tbody></table></code></pre></figure>
<h4 id="comparison-1">Comparison</h4>
<p>First, we see that in this instance the linear algebra solution requires much more code to implement,
and is again much less intuitively understandable.</p>
<p>Regarding performance, on my machine, the naive approach (on the full training set available on Kaggle) takes about 0.9 seconds on average,
while the linear algebra approach takes about 0.48 seconds. An improvement indeed, but the factor is not very impressive.
However, I also separately timed the matrix operations (excluding the representation conversion), and they took only about 0.045 seconds on average.
So most of the overhead in the linear algebra approach can be eliminated if we maintain the data in the appropriate format,
to get, in this case, an improvement factor of about 20x. Neat!</p>
<h4 id="other-use-cases">Other use cases</h4>
<p>This method is useful in several other similar situations:</p>
<ul>
<li>When we already have a matrix with the aggregated amounts for leaf nodes, and we just want to aggregate to non-leaf nodes.</li>
<li>When we have a more complicated relationship graph which can be represented as a DAG (directed acyclic graph). In this case the initial hierarchy matrix (used for
calculating the ancestry matrix) should be the graph representation of the DAG, in a similar manner to the tree representation.</li>
</ul>
<h3 id="conclusions">Conclusions</h3>
<p>In this post we explored how matrix multiplication can be used to calculate graph BFS operations, and examined two cases where using matrix operations speeds up
computation (at the expense of clarity): unit conversion and hierarchical aggregations. In both cases, for large enough scales (and using the proper representation)
the speedup is by an order of magnitude. In addition, algorithms based on matrix multiplication can further be scaled up with hardware - utilizing GPUs and parallelizing computation.</p>
<h3 id="more-stuff">More stuff</h3>
<p>All the code for the examples can be found <a href="https://github.com/andersource/matrix-exponentiation-fun">here</a>.</p>
<p><a href="http://graphblas.org/index.php?title=Graph_BLAS_Forum">Graph BLAS</a> is a large-scale open effort at creating standardized primitives for graph algorithms in the language of linear algebra.</p>
<p><a href="https://bookstore.ams.org/stml-53">This book</a> details many applications of linear algebra in computer science and other areas of mathematics.
Interestingly, some of the algorithmic applications offer the best known polynomial runtime for the given tasks.</p>Well, fun might be a bit of a stretch, but I’ll let you decide for yourself.