<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Sagar's blog</title>
        <link>http://www.sagargv.com/blog/</link>
        <description>Recent content on Sagar's blog</description>
        <generator>feedrender.py -- www.sagargv.com</generator>
        <language>en-us</language>
        <lastBuildDate>Mon, 05 Jul 2021 20:35:56 +0000</lastBuildDate>

        <atom:link href="http://www.sagargv.com/blog/atom.xml" rel="self" type="application/rss+xml" />

        
        <item>
            <title>Reinforcement Learning with PyBullet in Colab</title>
            <link>http://www.sagargv.com/blog/rl-with-pybullet/</link>
            <pubDate>Tue, 06 Jul 2021 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/rl-with-pybullet/</guid>
            <description><![CDATA[
<p>Some tutorials on reinforment inforcement learning in PyBullet using soft actor-critic with a simple custom robot that can move in only one plane:</p>
<ul>
<li><a href="https://colab.research.google.com/drive/1Ww-oKm-WTsH4PyBHGm5VMUCsD88bv9os?usp=sharing">2D Reacher - Reach a point on the table</a></li>
<li><a href="https://colab.research.google.com/drive/1N8JrLelcV1jtJq61KeE5Gc7otIH0tzuI?usp=sharing">2D Pusher - Push an object on the table</a></li>
<li><a href="https://colab.research.google.com/drive/1bumvdASL-GDqGirUngJjoZCtiw0yl2Kp?usp=sharing">2D Reacher from Pixels - Reach the marked point on the table</a></li>
<li><a href="https://colab.research.google.com/drive/1N8JrLelcV1jtJq61KeE5Gc7otIH0tzuI?usp=sharing">2D Pusher with Inverse Kinematics - Push an object on the table</a></li>
</ul>
<p>This can serve as a template to get started with a custom robot and might help in debugging some learning algorithms that directly control the actuators of a robot.</p>]]></description>
        </item>
        
        <item>
            <title>Tutorials on using PyBullet in Colab</title>
            <link>http://www.sagargv.com/blog/pybullet-colab-tutorials/</link>
            <pubDate>Tue, 06 Jul 2021 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/pybullet-colab-tutorials/</guid>
            <description><![CDATA[
<p>Some tutorials on using PyBullet in Google Colab:</p>
<ul>
<li><a href="https://colab.research.google.com/drive/1bLYj1s8oiOxtEc3T4UTgXk-wLpE_nDui?usp=sharing">Intro to inverse kinematics in PyBullet with the Kuka robot</a></li>
<li><a href="https://colab.research.google.com/drive/1Xiwda3c5c4-5xY6c4ghtnrlhoYm2FYKn?usp=sharing">Pick and Place using virtual suction gripper on Kuka robot</a></li>
<li><a href="https://colab.research.google.com/drive/1eXq-Tl3QKzmbXGSKU2hDk0u_EHdfKVd0?usp=sharing">Pick and Place using two fingered jaw gripper on Kuka robot</a></li>
<li><a href="https://colab.research.google.com/drive/1w9U_vbLk4vIKyQKqgHgSjwqk-hoyiiyv?usp=sharing">Build your own robot with only one joint</a></li>
<li><a href="https://colab.research.google.com/drive/1i6ITJn3aggzyJTCew1xXDRtyY9szvkr4?usp=sharing">How to build a simple custom robot that can move in only one plane</a></li>
</ul>]]></description>
        </item>
        
        <item>
            <title>Tutorials on Computer Vision using Keras in Colab</title>
            <link>http://www.sagargv.com/blog/cv-keras-tutorials/</link>
            <pubDate>Tue, 06 Jul 2021 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/cv-keras-tutorials/</guid>
            <description><![CDATA[
<p>Some tutorials on computer vision using Tensorflow / Keras in Colab:</p>
<ul>
<li><a href="https://colab.research.google.com/drive/1cAWKGPeFiXl4krIAEe24cVKeYNo_yAsW?usp=sharing">Plain old sliding window object detector</a></li>
<li><a href="https://colab.research.google.com/drive/1xizq12Kw0O4bVyf5zGLUszsRqWj6QwcE?usp=sharing">Siamese net for one-shot image classification</a></li>
<li><a href="https://colab.research.google.com/drive/1Wp-DZlgoNulS-sHneIX5kgSyVB26p_Mi?usp=sharing">One-shot sliding window object detector using Siamese net</a></li>
<li><a href="https://colab.research.google.com/drive/1AGsrzDP1XXDgQrmRlYwx5xX4dWCRIUUP?usp=sharing">Differentiable one-shot object localization using Siamese net</a></li>
<li><a href="https://colab.research.google.com/drive/1DSLi4Y7jWtl7YxO5hHSc-ojXRLm8stjm?usp=sharing">Differentiable one-shot object localization of object specified by visual cue using Siamese net</a></li>
<li><a href="https://colab.research.google.com/drive/1fuxmgMBrvsOcA4_7uKgHWxhPORMuxmoj?usp=sharing">Differentiable localization in the presence of multiple instances of objects</a></li>
<li><a href="https://colab.research.google.com/drive/1e_kISZeC21PDpNaMvEaYkZmjq4ZWF-xy?usp=sharing">Differentiable one-shot object localization of object specified by pointing at it using Siamese net</a></li>
</ul>]]></description>
        </item>
        
        <item>
            <title>How to read a query plan from Postgres EXPLAIN</title>
            <link>http://www.sagargv.com/blog/how-to-read-postgres-explain/</link>
            <pubDate>Thu, 31 Dec 2020 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/how-to-read-postgres-explain/</guid>
            <description><![CDATA[
<p>Got a web app responding too slowly because of a database query?
The query plan is what you'd want to look at.
Let's look at the execution plan that Postgres shows you when you
use the EXPLAIN statement and see how to interpret that.</p>
<p>The <a href="https://www.postgresql.org/docs/9.4/using-explain.html">Postgres documentation on using EXPLAIN</a>
is excellent, but I thought writing a concise version will serve as a note to my future self.
To start off, let's run with a simple schema of three tables: (a) Users,
(b) Articles, and (c) Orders.</p>
<pre><code>users:
 id |      email
----+------------------
  1 | jon@example.com
  2 | Jane@example.com
</code></pre>

<pre><code>articles:
 id |    name
----+-------------
  1 | Playstation
  2 | Xbox
</code></pre>

<pre><code>orders:
 id | user_id | article_id | qty
----+---------+------------+-----
  1 |       1 |          1 |   2
  2 |       1 |          2 |   1
  3 |       2 |          1 |   4
  4 |       2 |          2 |   3
</code></pre>

<p>Suppose we want to query the <code>users</code> table by the email address, the plan that
postgres generates is:</p>
<pre><code>EXPLAIN SELECT * FROM users WHERE email='jon@example.com';
---------------------------------------------------------------------------------
 Index Scan using users_email_index on users  (cost=0.15..8.17 rows=1 width=222)
   Index Cond: ((email)::text = 'jon@example.com'::text)
</code></pre>

<p>Since we have an index on the email column, postgres has decided to use that
to search the users table. If we didn't have that index, it would do a sequential
scan over the entire table:</p>
<pre><code>EXPLAIN SELECT * FROM users WHERE email='jon@example.com';
----------------------------------------------------------
 Seq Scan on users  (cost=0.00..14.00 rows=2 width=222)
   Filter: ((email)::text = 'jon@example.com'::text)
</code></pre>

<p>To get all the orders of a user, the query plan is:</p>
<pre><code> EXPLAIN SELECT *
    FROM orders
    INNER JOIN users
        ON orders.user_id=users.id
    WHERE users.email='jon@example.com';
------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.30..40.57 rows=6 width=238)
   -&gt;  Index Scan using users_email_index on users  (cost=0.15..8.17 rows=1 width=222)
         Index Cond: ((email)::text = 'jon@example.com'::text)
   -&gt;  Index Scan using orders_userid_index on orders  (cost=0.15..32.31 rows=9 width=16)
         Index Cond: (user_id = users.id)
</code></pre>

<p>What's going on here? First, postgres is using the index on the email column
to search the users table for users for email jon@example.com.
For each matching row (we expect a single matching row in this example),
an index scan is performed over the orders table get all orders for that user id.
Translated to pseudocode:</p>
<pre><code>for user in users.filter(email='jon@example.com'): # uses the email index
    for order in orders.filter(user_id=user.id): # uses the userid index
        yield (user, order)
</code></pre>

<p>Now, let's try joining all the three tables to get the full details of all the orders of a user:</p>
<pre><code>EXPLAIN SELECT *
    FROM orders
    INNER JOIN users
        ON orders.user_id=users.id
    INNER JOIN articles
        ON orders.article_id=articles.id
    WHERE users.email='jon@example.com';
------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.45..41.71 rows=6 width=460)
   -&gt;  Nested Loop  (cost=0.30..40.57 rows=6 width=238)
         -&gt;  Index Scan using users_email_index on users  (cost=0.15..8.17 rows=1 width=222)
               Index Cond: ((email)::text = 'jon@example.com'::text)
         -&gt;  Index Scan using orders_userid_index on orders  (cost=0.15..32.31 rows=9 width=16)
               Index Cond: (user_id = users.id)
   -&gt;  Index Scan using articles_pkey on articles  (cost=0.15..0.19 rows=1 width=222)
         Index Cond: (id = orders.article_id)
</code></pre>

<p>This is similar to joining the two tables, but we have another loop to join the third table.
In pseudocode, this is what's going on:</p>
<pre><code>for user in users.filter(email='jon@example.com'): # uses the email index
    for order in orders.filter(user_id=user.id): # uses the userid index
        for article in articles.filter(id=order.article_id): # uses article_id primary key
            yield (user, order, article)
</code></pre>

<p>Instead of an index scan, you might come across a bitmap heap scan like this:</p>
<pre><code> EXPLAIN SELECT *
    FROM orders
    INNER JOIN users
        ON orders.user_id=users.id
    WHERE users.email='jon@example.com';
----------------------------------------------------------------------------------------
 Nested Loop  (cost=4.37..23.02 rows=6 width=238)
   -&gt;  Index Scan using users_email_index on users  (cost=0.15..8.17 rows=1 width=222)
         Index Cond: ((email)::text = 'jon@example.com'::text)
   -&gt;  Bitmap Heap Scan on orders  (cost=4.22..14.76 rows=9 width=16)
         Recheck Cond: (user_id = users.id)
         -&gt;  Bitmap Index Scan on orders_userid_index  (cost=0.00..4.22 rows=9 width=0)
               Index Cond: (user_id = users.id)
</code></pre>

<p>In a plain index scan, postgres fetches a row pointer from the index and immediately
fetches that row from the table. But with a bitmap index scan, all the tuple pointers
that match the filtering condition are gathered and an in-memory bitmap data structure
is created. From this, the actual tuples in the table are visited in physical location order.
This improves locality of accesses to the table and matters a lot for spinning rust HDDs but
can also help with SSDs too. What's that "recheck condition"? If the bitmap gets large,
postgres converts it to a lossy bitmap that stores only physical pages instead of individual
tuples. So when the tuples are read from the physical pages, postgres has to recheck which
tuples in that page match the filtering condition. Bitmap scans also do well when there
are multiple filtering conditions using ORs and ANDs since the bitmap data structure supports
these operations efficiently.</p>
<p>Postgres collects statistics about the content of the tables. If it expects very few rows
to match the filtering condition, an index scan is preferred. If many rows are expected to
satisfy the filtering condition, a bitmap scan is preferred. If a substantial portion of the table
is likely to be fetched, the sequential scan wins. One of the authors of Postgres, Tom Lane, has
an email thread on <a href="https://www.postgresql.org/message-id/12553.1135634231@sss.pgh.pa.us">this topic</a>.</p>
<p>The <a href="https://stackoverflow.com/a/49024533">three types of joins</a> you're likely to come across in a
query plan are: (a) Nested loop join, (b) Hash join, and (c) Merge join. For example, here is a hash join:</p>
<pre><code> EXPLAIN SELECT *
    FROM orders
    INNER JOIN users
        ON orders.user_id=users.id
    WHERE users.email='jon@example.com';
--------------------------------------------------------------------
 Hash Join  (cost=14.03..47.45 rows=12 width=238)
   Hash Cond: (orders.user_id = users.id)
   -&gt;  Seq Scan on orders  (cost=0.00..28.50 rows=1850 width=16)
   -&gt;  Hash  (cost=14.00..14.00 rows=2 width=222)
         -&gt;  Seq Scan on users  (cost=0.00..14.00 rows=2 width=222)
               Filter: ((email)::text = 'jon@example.com'::text)
</code></pre>

<p>The hash join can be used only when the join condition is the equality operator.
Postgres constructs an in-memory hash table of the filtered users and scans the
orders table and retains those tuples which have a matching user id in the
constructed hash table.</p>
<p>The nested loop is the preferred option when at least one side of the join has very
few matching tuples. Hash join is used when both sides of the join have a large number of tuples.
Merge join is preferred when both sides of the join are large but can be sorted on the joining
condition using an index.</p>
<p>All the SQL statements used in this article are <a href="explain.sql">here</a>.</p>]]></description>
        </item>
        
        <item>
            <title>Variational Video Prediction</title>
            <link>http://www.sagargv.com/blog/variational-video-prediction/</link>
            <pubDate>Sun, 01 Sep 2019 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/variational-video-prediction/</guid>
            <description><![CDATA[
<p>Just like how your smartphone's keyboard can predict the next word
you're likely to type based on the last few words you entered, one can
predict future frames of a video by looking at the current frame.
This is really useful in <a href="https://arxiv.org/abs/1605.07157">model based re-inforcement learning</a>
where it endows an agent with the ability to predict the future and plan a sequence of
actions based on those predictions.
It helps to dramatically cut down the number of samples needed for training.</p>
<p><img alt="Neural video predictor" src="det.png"></p>
<p>But there is a problem. What if the dynamics of the environment has some
randomness or things you cannot easily model? When you push a pen across
the table, it might move a little faster because you applied more force than you intended to.
Sometimes the pen moves by 1 cm, sometimes 1.5 cm and so on. If a deterministic neural
network is used to model this phenomenon (you try to minimize the least squares error),
the randomness is modeled as blur. The network averages out the different possible
outcomes. This is problematic because the blur gets worse the further you predict into the future.</p>
<p><img alt="Variational Predictor" src="detvar.png"></p>
<p><a href="https://arxiv.org/abs/1710.11252">Variational inference</a> can address this problem.
Suppose that the white square in the picture can move by either 2 px or 3 px in one frame.
If we're told at training time whether the pixel moved by 2 px or 3 px (via a one-hot vector),
this can be an additional input to the network. With this, the neural network can
learn to move the white square by the right number of pixels without any blur.
During inference, the one-hot vector can be chosen randomly, which would result in the white square
moving by either 2 px or 3 px. But we don't actually know by how many pixels the white square
moved during training. Another neural network to the rescue! This encoder network looks at the input and
the label and predicts the probability of choosing either of the one-hot vectors as input to the video predictor.
The <a href="https://arxiv.org/abs/1611.01144">gumbel-softmax reparametrization</a> can be used to sample from this
distribution during training.</p>
<p>The code for this network in Keras is <a href="var_translate_pred.py">here</a>.</p>]]></description>
        </item>
        
        <item>
            <title>Learning to play Pong using PPO in PyTorch</title>
            <link>http://www.sagargv.com/blog/pong-ppo/</link>
            <pubDate>Thu, 23 May 2019 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/pong-ppo/</guid>
            <description><![CDATA[
<p>The rules of Atari Pong are simple enough. You get a point if you put the ball past your opponent, and
your opponent gets a point if the ball goes past you. How do we train a neural network to look at the pixels
on the screen and decide whether to go up or down?</p>
<p><img alt="Atari Pong" src="atari-pong.jpg"></p>
<p>Unlike supervised learning, no labels are available. So, we turn to reinforcement learning.
Policy gradients are one way to update the weights of the network
to maximize the reward. The idea is to start with random initialization, i.e., the network predicts about 50% probability
for both up and down regardless of the observation and to roll out the policy (play the game).
At each time step, the network looks at the frame and predicts the probability of going up and down.
We sample from this distribution and take the sampled action. At the end of the episode, the weights of
the network are updated to increase the probability of taking a certain action if that action led to a positive reward
and decrease the probability of taking an action if it led to a negative reward. This is how plain policy gradient works.
It is similar to supervised learning, but with each sample in the cross entropy loss function weighted by the reward for that episode (the
labels are the actions that were sampled during the policy roll out). Here's the math:</p>
<p><img alt="Policy Gradient Equations" src="pg-eqn.png"></p>
<p>Policy gradients as described above suffers from the problem that the weight update after a policy roll out might
change the probability of taking a certain action by a large amount. This is undesirable because the gradients are noisy
and making large changes to the network after every policy roll out causes convergence problems. Why not reduce the step size?
This can work but if the step size is reduced too much, then learning will be hopelessly slow. So, plain policy gradients are
sensitive to the step size. One solution to this problem is to limit (constrain) the KL divergence between the probability of actions
before and after the weight update. That's what <a href="https://arxiv.org/pdf/1502.05477">Trust Region Policy Optimization (TRPO)</a> does, but it needs conjugate gradients. <a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization (PPO)</a> is a simplification that adds a penalty to the loss function to penalize large probability changes. This has an effect similar to TRPO and works well in practice.</p>
<p><img alt="ratio" src="r.png">
<img alt="Clipped loss function" src="l-clip.png"></p>
<p>Here is code implementing PPO in PyTorch (also in this <a href="https://gist.github.com/s-gv/b13974f896c7baf81ea3a83cf1af4a66">Gist</a>).</p>
<pre><code>import random
import gym
import numpy as np
from PIL import Image
import torch
from torch.nn import functional as F
from torch import nn

class Policy(nn.Module):
    def __init__(self):
        super(Policy, self).__init__()

        self.gamma = 0.99
        self.eps_clip = 0.1

        self.layers = nn.Sequential(
            nn.Linear(6000, 512), nn.ReLU(),
            nn.Linear(512, 2),
        )

    def state_to_tensor(self, I):
        &quot;&quot;&quot; prepro 210x160x3 uint8 frame into 6000 (75x80) 1D float vector. See Karpathy's post: http://karpathy.github.io/2016/05/31/rl/ &quot;&quot;&quot;
        if I is None:
            return torch.zeros(1, 6000)
        I = I[35:185] # crop - remove 35px from start &amp; 25px from end of image in x, to reduce redundant parts of image (i.e. after ball passes paddle)
        I = I[::2,::2,0] # downsample by factor of 2.
        I[I == 144] = 0 # erase background (background type 1)
        I[I == 109] = 0 # erase background (background type 2)
        I[I != 0] = 1 # everything else (paddles, ball) just set to 1. this makes the image grayscale effectively
        return torch.from_numpy(I.astype(np.float32).ravel()).unsqueeze(0)

    def pre_process(self, x, prev_x):
        return self.state_to_tensor(x) - self.state_to_tensor(prev_x)

    def convert_action(self, action):
        return action + 2

    def forward(self, d_obs, action=None, action_prob=None, advantage=None, deterministic=False):
        if action is None:
            with torch.no_grad():
                logits = self.layers(d_obs)
                if deterministic:
                    action = int(torch.argmax(logits[0]).detach().cpu().numpy())
                    action_prob = 1.0
                else:
                    c = torch.distributions.Categorical(logits=logits)
                    action = int(c.sample().cpu().numpy()[0])
                    action_prob = float(c.probs[0, action].detach().cpu().numpy())
                return action, action_prob
        '''
        # policy gradient (REINFORCE)
        logits = self.layers(d_obs)
        loss = F.cross_entropy(logits, action, reduction='none') * advantage
        return loss.mean()
        '''

        # PPO
        vs = np.array([[1., 0.], [0., 1.]])
        ts = torch.FloatTensor(vs[action.cpu().numpy()])

        logits = self.layers(d_obs)
        r = torch.sum(F.softmax(logits, dim=1) * ts, dim=1) / action_prob
        loss1 = r * advantage
        loss2 = torch.clamp(r, 1-self.eps_clip, 1+self.eps_clip) * advantage
        loss = -torch.min(loss1, loss2)
        loss = torch.mean(loss)

        return loss

env = gym.make('PongNoFrameskip-v4')
env.reset()

policy = Policy()

opt = torch.optim.Adam(policy.parameters(), lr=1e-3)

reward_sum_running_avg = None
for it in range(100000):
    d_obs_history, action_history, action_prob_history, reward_history = [], [], [], []
    for ep in range(10):
        obs, prev_obs = env.reset(), None
        for t in range(190000):
            #env.render()

            d_obs = policy.pre_process(obs, prev_obs)
            with torch.no_grad():
                action, action_prob = policy(d_obs)

            prev_obs = obs
            obs, reward, done, info = env.step(policy.convert_action(action))

            d_obs_history.append(d_obs)
            action_history.append(action)
            action_prob_history.append(action_prob)
            reward_history.append(reward)

            if done:
                reward_sum = sum(reward_history[-t:])
                reward_sum_running_avg = 0.99*reward_sum_running_avg + 0.01*reward_sum if reward_sum_running_avg else reward_sum
                print('Iteration %d, Episode %d (%d timesteps) - last_action: %d, last_action_prob: %.2f, reward_sum: %.2f, running_avg: %.2f' % (it, ep, t, action, action_prob, reward_sum, reward_sum_running_avg))
                break

    # compute advantage
    R = 0
    discounted_rewards = []

    for r in reward_history[::-1]:
        if r != 0: R = 0 # scored/lost a point in pong, so reset reward sum
        R = r + policy.gamma * R
        discounted_rewards.insert(0, R)

    discounted_rewards = torch.FloatTensor(discounted_rewards)
    discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / discounted_rewards.std()

    # update policy
    for _ in range(5):
        n_batch = 24576
        idxs = random.sample(range(len(action_history)), n_batch)
        d_obs_batch = torch.cat([d_obs_history[idx] for idx in idxs], 0)
        action_batch = torch.LongTensor([action_history[idx] for idx in idxs])
        action_prob_batch = torch.FloatTensor([action_prob_history[idx] for idx in idxs])
        advantage_batch = torch.FloatTensor([discounted_rewards[idx] for idx in idxs])

        opt.zero_grad()
        loss = policy(d_obs_batch, action_batch, action_prob_batch, advantage_batch)
        loss.backward()
        opt.step()

    if it % 5 == 0:
        torch.save(policy.state_dict(), 'params.ckpt')

env.close()
</code></pre>

<p>After training for 4000 episodes, the policy network consistently beat the "computer player" with an average reward of +14.
Here is a video of the agent playing (the agent controls the green paddle to the right).</p>
<p><img alt="Neural Network playing Pong" src="https://www.youtube.com/embed/qDVqNXrZRbo"></p>]]></description>
        </item>
        
        <item>
            <title>Meta Learning in PyTorch</title>
            <link>http://www.sagargv.com/blog/meta-learning-in-pytorch/</link>
            <pubDate>Wed, 07 Nov 2018 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/meta-learning-in-pytorch/</guid>
            <description><![CDATA[
<p>Got an image recognition problem? A pre-trained ResNet is
probably a good starting point. Transfer learning, where
the weights of a pre-trained network are fine tuned for the
task at hand, is widely used because it can drastically reduce
both the amount of data to be collected and the total time
spent training the network. But ResNet wasn't trained
with the intention of being a good starting point for transfer
learning. It just so happens that it works well. But what if a
network is trained specifically to obtain weights that are good
for generalizing to a new task? That's what meta learning aims to do.</p>
<p>The usual setting in meta learning involves a distribution of tasks.
During training, a large number of tasks, but with only a few
labeled examples per task, are available. At "test time", a new,
previously unseen, task is provided with a few examples. Using only
these few examples, the network must learn to generalize to new
examples of the same task. In meta learning, this is accomplished
by running a few steps of gradient descent on the examples of the new
task provided during test. So, the goal of the training process
is to discover similarities between tasks and find network weights that
serve as a good starting point for gradient descent at test time on a
new task.</p>
<h2>Model Agnostic Meta Learning (MAML)</h2>
<p><a href="https://arxiv.org/abs/1703.03400">MAML</a> differentiates through the
stochastic gradient descent (SGD) update steps and learns weights that
are a good starting point for SGD at test time. i.e.., gradient descent-ception.
This is what the training loop looks like:</p>
<pre><code>- randomly initialize network weights W
for it in range(num_iterations):
    - Sample a task from the training set and get a few
      labeled examples for that task
    - Compute loss L using current weights W
    - Wn = W - inner_lr * dL/dW
    - Compute loss Ln using tuned weights Wn
    - Update W = W - outer_lr * dLn/dW
</code></pre>

<p>To compute the loss <code>Ln</code>, the tuned weights <code>Wn</code> are used.
But, notice that gradients of the loss with respect to the
original weights <code>dLn/dW</code> are needed. Computing this involves
finding higher-order derivatives of the loss with respect to the
original weights <code>W</code>.</p>
<p>At test time:</p>
<pre><code>- Given trained weights W and a few examples of a new task
- Compute loss L using weights W
- Wn = W - inner_lr * dL/dW
- Use Wn to make predictions for that task
</code></pre>

<p>Let's try learning to generate a sine wave from only 4 data points.
To keep it simple, let's fix the amplitude and frequency but randomly
select the phase between 0 and 180 degrees. At test time, the model
must figure out what the phase is and generate the sine wave from
only 4 example data points.</p>
<pre><code>import math
import random
import torch # v0.4.1
from torch import nn
from torch.nn import functional as F
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt

def net(x, params):
    x = F.linear(x, params[0], params[1])
    x = F.relu(x)

    x = F.linear(x, params[2], params[3])
    x = F.relu(x)

    x = F.linear(x, params[4], params[5])
    return x

params = [
    torch.Tensor(32, 1).uniform_(-1., 1.).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(32, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(1, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(1).zero_().requires_grad_(),
]

opt = torch.optim.SGD(params, lr=1e-2)
n_inner_loop = 5
alpha = 3e-2

for it in range(275000):
    b = 0 if random.choice([True, False]) else math.pi

    x = torch.rand(4, 1)*4*math.pi - 2*math.pi
    y = torch.sin(x + b)

    v_x = torch.rand(4, 1)*4*math.pi - 2*math.pi
    v_y = torch.sin(v_x + b)

    opt.zero_grad()

    new_params = params
    for k in range(n_inner_loop):
        f = net(x, new_params)
        loss = F.l1_loss(f, y)

        # create_graph=True because computing grads here is part of the forward pass.
        # We want to differentiate through the SGD update steps and get higher order
        # derivatives in the backward pass.
        grads = torch.autograd.grad(loss, new_params, create_graph=True)
        new_params = [(new_params[i] - alpha*grads[i]) for i in range(len(params))]

        if it % 100 == 0: print 'Iteration %d -- Inner loop %d -- Loss: %.4f' % (it, k, loss)

    v_f = net(v_x, new_params)
    loss2 = F.l1_loss(v_f, v_y)
    loss2.backward()

    opt.step()

    if it % 100 == 0: print 'Iteration %d -- Outer Loss: %.4f' % (it, loss2)

t_b = math.pi #0

t_x = torch.rand(4, 1)*4*math.pi - 2*math.pi
t_y = torch.sin(t_x + t_b)

opt.zero_grad()

t_params = params
for k in range(n_inner_loop):
    t_f = net(t_x, t_params)
    t_loss = F.l1_loss(t_f, t_y)

    grads = torch.autograd.grad(t_loss, t_params, create_graph=True)
    t_params = [(t_params[i] - alpha*grads[i]) for i in range(len(params))]


test_x = torch.arange(-2*math.pi, 2*math.pi, step=0.01).unsqueeze(1)
test_y = torch.sin(test_x + t_b)

test_f = net(test_x, t_params)

plt.plot(test_x.data.numpy(), test_y.data.numpy(), label='sin(x)')
plt.plot(test_x.data.numpy(), test_f.data.numpy(), label='net(x)')
plt.plot(t_x.data.numpy(), t_y.data.numpy(), 'o', label='Examples')
plt.legend()
plt.savefig('maml-sine.png')
</code></pre>

<p>Here is the sine wave the network constructs after looking at
only 4 points at test time:</p>
<p><img alt="MAML Demo" src="maml-sine.png"></p>
<p>There's a variant of the MAML algorithm called FO-MAML (first-order MAML)
that ignores higher-order derivatives.
<a href="https://arxiv.org/abs/1803.02999">Reptile</a> is a similar algorithm
proposed by OpenAI that's simpler to implement. Check out their
<a href="https://blog.openai.com/reptile/">javascript demo</a>.</p>
<h2>Domain Adaptive Meta Learning (DAML)</h2>
<p><a href="https://arxiv.org/abs/1802.01557">DAML</a> uses meta learning to
tune the parameters of the network to accommodate large domain
shifts in the input. This method also doesn't need labels in
the source domain!</p>
<p>Consider a neural network that takes <code>x</code> as
input and produces <code>y = net(x)</code>. The source domain is a distribution
from which the input <code>x</code> maybe drawn from. Likewise, the target
domain is another distribution of inputs. Domain
adaptation is what has to be done to get the network to work
when the distribution of the input is changed from the source
domain to the target domain. The idea in DAML is to use meta learning
to tune the weights of the network based on examples in the source
domain so that the network can do well on examples drawn from the
target domain. During training, unlabeled examples from the source
domain and the corresponding examples with labels in the target domain
are available. This is the training loop of DAML:</p>
<pre><code>- randomly initialize network weights W and the adaptation
  loss network weights W_adap
for it in range(num_iterations):
    - Sample a task from the training set
    - Compute adaptation loss (L_adap) using (W, W_adap) and 
      unlabeled training data in the source domain
    - Wn = W - inner_lr * dL_adap/dW
    - Compute training loss (Ln) from labeled training data
      in the target domain using the tuned weights Wn
    - (W, W_adap) = (W, W_adap) - outer_lr * dLn/d(W, W_adap)
</code></pre>

<p>Since we don't have labeled data in the source domain,
we must also learn a loss function <code>L_adap</code> parameterized by <code>W_adap</code>.</p>
<p>At test time:</p>
<pre><code>- Given trained weights (W, W_adap) and a few unlabeled
  examples of a new task
- Compute adaptation loss (L_adap) using weights (W, W_adap) and
  unlabeled examples in the source domain
- Wn = W - inner_lr * dL_adap/dW
- Use Wn to make predictions for that task for new inputs in
  the target domain
</code></pre>

<p>Once again, let's try learning to generate sine waves.
In the target domain, the input, <code>x</code>, to the network is drawn from a
uniform distribution <code>[-2*PI, 2*PI]</code>, and the network has to
predict <code>y = sin(x)</code> or <code>y = sin(x + PI)</code>. Whether the network must
predict <code>y = sin(x)</code> or <code>y = sin(x + PI)</code> has to be inferred from a single
unlabeled input in the source domain. In the source domain, the input, <code>x</code>,
to the network will be drawn uniformly from <code>[PI/4, PI/2]</code> to specify that
zero phase is what we want and an input drawn from <code>[-PI/2, -PI/4]</code> shall
specify that a 180 degree phase is desired. The source domain input is used
to find gradients of weights with respect to the learnt adaptation loss,
and a few steps of gradient descent tunes the weights of the network. Once
we have the tuned weights, they can be used in the target domain to
predict a sine wave of the desired phase.</p>
<pre><code>import math
import random
import torch # v0.4.1
from torch import nn
from torch.nn import functional as F
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt

def net(x, params):
    x = F.linear(x, params[0], params[1])
    x1 = F.relu(x)

    x = F.linear(x1, params[2], params[3])
    x2 = F.relu(x)

    y = F.linear(x2, params[4], params[5])

    return y, x2, x1

def adap_net(y, x2, x1, params):
    x = torch.cat([y, x2, x1], dim=1)

    x = F.linear(x, params[0], params[1])
    x = F.relu(x)

    x = F.linear(x, params[2], params[3])
    x = F.relu(x)

    x = F.linear(x, params[4], params[5])

    return x

params = [
    torch.Tensor(32, 1).uniform_(-1., 1.).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(32, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(1, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(1).zero_().requires_grad_(),
]

adap_params = [
    torch.Tensor(32, 1+32+32).uniform_(-1./math.sqrt(65), 1./math.sqrt(65)).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(32, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(1, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(1).zero_().requires_grad_(),
]

opt = torch.optim.SGD(params + adap_params, lr=1e-2)
n_inner_loop = 5
alpha = 3e-2

for it in range(275000):
    b = 0 if random.choice([True, False]) else math.pi

    v_x = torch.rand(4, 1)*4*math.pi - 2*math.pi
    v_y = torch.sin(v_x + b)

    opt.zero_grad()

    new_params = params
    for k in range(n_inner_loop):
        f, f2, f1 = net(torch.FloatTensor([[random.uniform(math.pi/4, math.pi/2) if b == 0 else random.uniform(-math.pi/2, -math.pi/4)]]), new_params)
        h = adap_net(f, f2, f1, adap_params)
        adap_loss = F.l1_loss(h, torch.zeros(1, 1))

        # create_graph=True because computing grads here is part of the forward pass.
        # We want to differentiate through the SGD update steps and get higher order
        # derivatives in the backward pass.
        grads = torch.autograd.grad(adap_loss, new_params, create_graph=True)
        new_params = [(new_params[i] - alpha*grads[i]) for i in range(len(params))]

        if it % 100 == 0: print 'Iteration %d -- Inner loop %d -- Loss: %.4f' % (it, k, adap_loss)

    v_f, _, _ = net(v_x, new_params)
    loss = F.l1_loss(v_f, v_y)
    loss.backward()

    opt.step()

    if it % 100 == 0: print 'Iteration %d -- Outer Loss: %.4f' % (it, loss)

t_b = math.pi # 0

opt.zero_grad()

t_params = params
for k in range(n_inner_loop):
    t_f, t_f2, t_f1 = net(torch.FloatTensor([[random.uniform(math.pi/4, math.pi/2) if t_b == 0 else random.uniform(-math.pi/2, -math.pi/4)]]), t_params)
    t_h = adap_net(t_f, t_f2, t_f1, adap_params)
    t_adap_loss = F.l1_loss(t_h, torch.zeros(1, 1))

    grads = torch.autograd.grad(t_adap_loss, t_params, create_graph=True)
    t_params = [(t_params[i] - alpha*grads[i]) for i in range(len(params))]

test_x = torch.arange(-2*math.pi, 2*math.pi, step=0.01).unsqueeze(1)
test_y = torch.sin(test_x + t_b)

test_f, _, _ = net(test_x, t_params)

plt.plot(test_x.data.numpy(), test_y.data.numpy(), label='sin(x)')
plt.plot(test_x.data.numpy(), test_f.data.numpy(), label='net(x)')
plt.legend()
plt.savefig('daml-sine.png')
</code></pre>

<p>This is the sine wave contructed by the network after domain adaptation:</p>
<p><img alt="DAML Demo" src="daml-sine.png"></p>]]></description>
        </item>
        
        <item>
            <title>How malloc gets memory from the OS</title>
            <link>http://www.sagargv.com/blog/how-malloc-gets-memory-from-os/</link>
            <pubDate>Sun, 22 Apr 2018 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/how-malloc-gets-memory-from-os/</guid>
            <description><![CDATA[
<p>In the old days of 8086, 16-bit programs accessed physical memory directly.
This would be valid code and would work:</p>
<pre><code>int main()
{
    int *p = (int *)0x02ad;
    return *p;
}
</code></pre>

<p>x86 processors still boot into 16-bit real mode where this is fine, but the
OS switches the processor into protected mode which enables virtual memory.
Once virtual memory is enabled, each process has its own virtual memory
that the OS has to map (to physical memory, files on the hard drive, device registers, etc.).
If the program tries to access unmapped memory, a segfault happens.</p>
<p>When Linux starts a process and loads the executable to memory, the layout of
the virtual address space looks something like this:</p>
<pre><code>---------------
|             |
|    stack    |
|             |
--------------- 0x7ffc725866b4
|             |
|             |
|             |
|   unmapped  |
|    space    |
|             |
|             |
|             |
--------------- 0x000001773000
|             |
| data (bss)  |
|             |
---------------
|             |
|    data     |
|             |
---------------
|             |
|    text     |
|             |
---------------
</code></pre>

<p>The <code>text</code> segment contains the binary code of the executable, the <code>data</code> segment
has initialized static variables, the <code>bss</code> segment has uninitialized static variables
(zeroed out before main() function is called), and the <code>stack</code> segment contains the stack
(There's also space for the environment variables, and the OS kernel space is also mapped
for performance reasons, but I've skipped these in the diagram.) The adresses of 
the these segments is randomized when the executable is loaded as a security measure (ASLR).</p>
<p>When <code>malloc()</code> is called, it tries to allocate memory from previously freed memory that
is still mapped to the process. But if there is insufficient free memory, <code>malloc()</code> must
make one of these system calls to request the OS to map additional memory:</p>
<ul>
<li>
<p>The <code>brk</code> / <code>sbrk</code> system calls enlarge the data segment. In the diagram above, calling <code>sbrk(8)</code>
would move the end of the data segment from <code>0x1773000</code> to <code>0x1773008</code>. If the process wants
to free the memory and return it to the OS, the data segment can be shrunk with the same syscalls.</p>
</li>
<li>
<p>The <code>mmap</code> syscall can map pages anywhere in the virtual address space (the equivalent syscall
in Windows is <code>VirtualAlloc</code>).</p>
</li>
</ul>
<p>The <code>malloc</code> implementation in glibc uses <code>sbrk</code> when it needs small amounts of memory (~32K) and <code>mmap</code>
when it needs large amounts of memory. The reason <code>mmap</code> is preferred for large objects is to prevent
losing too much memory to fragmentation in the data segment; if a small object is allocated with <code>sbrk</code>
after a large object and then, if the large object is freed, that memory cannot be freed
until the small object is freed as well.</p>]]></description>
        </item>
        
        <item>
            <title>Host Your Own Private Git Repos</title>
            <link>http://www.sagargv.com/blog/host-your-own-private-git-repos/</link>
            <pubDate>Sat, 31 Mar 2018 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/host-your-own-private-git-repos/</guid>
            <description><![CDATA[
<p>Hosting git repos on your own server is actually quite easy.
Login to the server, create a new directory, and initialize a bare repo:</p>
<pre><code>mkdir foo.git
cd foo.git
git init --bare
</code></pre>

<p>That's it! Now, from the client, clone this repo with:</p>
<pre><code>git clone username@example.com:path/to/foo.git
</code></pre>

<p>Having a dedicated user for git repos on the server makes it easier share access to the repo.
Create a new user <code>git</code> with a login shell restricted to git commands:</p>
<pre><code>sudo adduser --shell $(which git-shell) git
</code></pre>

<p>Now create a repo in the home directory of the <code>git</code> user:</p>
<pre><code>cd /home/git
sudo -u git mkdir bar.git
cd bar.git
sudo -u git git init --bare
</code></pre>

<p>As before, clone the new repo from the client using:</p>
<pre><code>git clone git@example.com:bar
</code></pre>

<h2>Backup the repos</h2>
<p>This is my script to take daily backups of all the git repos on the server to Amazon S3.</p>
<pre><code>#!/bin/bash

set -e

GITDIR=/home/git
TMPDIR=/tmp/gitbackup

renice -n 15 $$

trap &quot;rm -f /tmp/gitbackup/*.git.tar.gz&quot; EXIT

mkdir -p ${TMPDIR}
cd ${TMPDIR}

for proj in ${GITDIR}/*.git; do
    base=$(basename $proj)
    tar -C $GITDIR -zcf ${base}.tar.gz $base
done

export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=yyyyy
export AWS_DEFAULT_REGION=us-west-2

aws s3 cp ${TMPDIR}/*.git.tar.gz s3://mygitbucket/
</code></pre>

<p>If the repos are large, it might be worthwhile checking whether
the hash of the gzipped repo has changed before uploading.
It's also good idea to use <code>envdir</code> to manage the access keys rather
than putting them in the backup script.</p>
<h2>Web front-end using cgit and nginx</h2>
<p>Sometimes it's useful to view source code and commits on a
web browser. <code>cgit</code> is an awesome light-weight webapp for this.
Unlike heavy apps like GitLab, <code>cgit</code> needs no database, which
reduces the administrative burden.</p>
<p>Install cgit, nginx, fcgiwrap, and apache-tools (to create a <code>.htpasswd</code> file).</p>
<pre><code>sudo apt install cgit nginx fcgiwrap apache2-utils
</code></pre>

<p>Specify the location of the git repos and static assets in the 
<code>cgit</code> config at <code>/etc/cgitrc</code>.</p>
<pre><code>css=/cgit-static/cgit.css
logo=/cgit-static/cgit.png
favicon=/cgit-static/favicon.ico

#source-filter=/usr/lib/cgit/filters/syntax-highlighting.py

scan-path=/home/git/
</code></pre>

<p>To get syntax highlighting, install <code>python-pygments</code> and uncomment the source-filter option.</p>
<p>If you'd like to password protect access to <code>www.example.com/git/</code>, create a <code>.htpasswd</code> file:</p>
<pre><code>sudo htpasswd /etc/nginx/.htpasswd &lt;username&gt;
</code></pre>

<p>This is my <code>nginx</code> conf file to serve <code>cgit</code> from <code>www.example.com/git/</code>.</p>
<pre><code>server {
    listen 80;
    listen [::]:80;

    server_name www.example.com;

    location /.well-known/acme-challenge/ {
        root /var/www/www.example.com;
    }
    location / {
        return 301 https://www.example.com$request_uri;
    }
}

server {
    listen 443 ssl;
    listen [::]:443 ssl;

    server_name www.example.com;

    ssl_certificate /etc/letsencrypt/live/www.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/www.example.com/privkey.pem;

    location /cgit-static/ {
        alias /usr/share/cgit/;
    }

    location /cgit/ {
        auth_basic &quot;Restricted&quot;;
        auth_basic_user_file /etc/nginx/.htpasswd;

        include fastcgi_params;
        fastcgi_split_path_info ^(/cgit)(.*)$;
        fastcgi_param   PATH_INFO        $fastcgi_path_info;
        fastcgi_param   SCRIPT_FILENAME  /usr/lib/cgit/cgit.cgi;
        fastcgi_param   QUERY_STRING     $args;
        fastcgi_param   HTTP_HOST        $server_name;
        fastcgi_pass    unix:/var/run/fcgiwrap.socket;
    }

    location / {
        root /var/www/www.example.com;
    }
}
</code></pre>

<p>You might also want to restrict repo access to only whitelisted IPs.</p>]]></description>
        </item>
        
        <item>
            <title>Pagination in SQL</title>
            <link>http://www.sagargv.com/blog/sql-pagination/</link>
            <pubDate>Sun, 27 Aug 2017 23:59:59 PST</pubDate>

            <guid>http://www.sagargv.com/blog/sql-pagination/</guid>
            <description><![CDATA[
<p>Here are two ways to paginate the results of a SQL query that work across all
the popular SQL database systems.</p>
<h2>Truncate the results</h2>
<p>Silly though it sounds, this might be a reasonable strategy. Suppose you want
to show 15 results per page. Then, show up to 20 pages, and stop there. This works
well when it's unlikely that anyone would want to see past the first few pages.
Incidentally, Google does something like this for web search results.</p>
<pre><code>SELECT * FROM users ORDER BY creation_date LIMIT 15 OFFSET 45;
</code></pre>

<p>This query is not efficient for large offsets because rows up to the offset have
to be read and discarded. But that's OK since the offset is limited to a few
hundred rows at most. It's a net win if only the first few pages are read most
of the time.</p>
<h2>Keep track of the first and last result in a page</h2>
<p>This is based on the idea that random access is not really needed and that it's often
necessary to only access the next page and the previous page from any given page.
When you're on the fourth page, accessing a random page, say page 3124, might
be inefficient. But, accessing the third and fifth pages are efficient if the
right indexes have been setup. This is accomplished by keeping track of the first
and last values of the column on which the results are ordered.</p>
<pre><code>SELECT * FROM users WHERE creation_date &gt; ? ORDER BY creation_date LIMIT 15;
</code></pre>

<p>When the next page is requested, the query is executed with the <code>creation_date</code>
of the last user in the current page. For the previous page, the <code>creation_date</code>
of the first user in the current page is used:</p>
<pre><code>SELECT * FROM users WHERE creation_date &lt; ? ORDER BY creation_date DESC LIMIT 15;
</code></pre>

<p>If the column by which the results are sorted is not unique, add additional columns or
the primary key to <code>ORDER BY</code> and keep track of the first and last values of those columns as well.</p>
<p>Another example of using this method for pagination is in the <a href="http://www.sqlite.org/cvstrac/wiki?p=ScrollingCursor">SQLite wiki</a>.</p>]]></description>
        </item>
        
    </channel>
</rss>