Sagar's blog

Reinforcement Learning with PyBullet in Colab

Tue, 06 Jul 2021 23:59:59 PST

Some tutorials on reinforment inforcement learning in PyBullet using soft actor-critic with a simple custom robot that can move in only one plane:

This can serve as a template to get started with a custom robot and might help in debugging some learning algorithms that directly control the actuators of a robot.

Tutorials on using PyBullet in Colab

Tue, 06 Jul 2021 23:59:59 PST

Some tutorials on using PyBullet in Google Colab:

Tutorials on Computer Vision using Keras in Colab

Tue, 06 Jul 2021 23:59:59 PST

Some tutorials on computer vision using Tensorflow / Keras in Colab:

How to read a query plan from Postgres EXPLAIN

Thu, 31 Dec 2020 23:59:59 PST

Got a web app responding too slowly because of a database query? The query plan is what you'd want to look at. Let's look at the execution plan that Postgres shows you when you use the EXPLAIN statement and see how to interpret that.

The Postgres documentation on using EXPLAIN is excellent, but I thought writing a concise version will serve as a note to my future self. To start off, let's run with a simple schema of three tables: (a) Users, (b) Articles, and (c) Orders.

users:
 id |      email
----+------------------
  1 | jon@example.com
  2 | Jane@example.com

articles:
 id |    name
----+-------------
  1 | Playstation
  2 | Xbox

orders:
 id | user_id | article_id | qty
----+---------+------------+-----
  1 |       1 |          1 |   2
  2 |       1 |          2 |   1
  3 |       2 |          1 |   4
  4 |       2 |          2 |   3

Suppose we want to query the users table by the email address, the plan that postgres generates is:

EXPLAIN SELECT * FROM users WHERE email='jon@example.com';
---------------------------------------------------------------------------------
 Index Scan using users_email_index on users  (cost=0.15..8.17 rows=1 width=222)
   Index Cond: ((email)::text = 'jon@example.com'::text)

Since we have an index on the email column, postgres has decided to use that to search the users table. If we didn't have that index, it would do a sequential scan over the entire table:

EXPLAIN SELECT * FROM users WHERE email='jon@example.com';
----------------------------------------------------------
 Seq Scan on users  (cost=0.00..14.00 rows=2 width=222)
   Filter: ((email)::text = 'jon@example.com'::text)

To get all the orders of a user, the query plan is:

 EXPLAIN SELECT *
    FROM orders
    INNER JOIN users
        ON orders.user_id=users.id
    WHERE users.email='jon@example.com';
------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.30..40.57 rows=6 width=238)
   ->  Index Scan using users_email_index on users  (cost=0.15..8.17 rows=1 width=222)
         Index Cond: ((email)::text = 'jon@example.com'::text)
   ->  Index Scan using orders_userid_index on orders  (cost=0.15..32.31 rows=9 width=16)
         Index Cond: (user_id = users.id)

What's going on here? First, postgres is using the index on the email column to search the users table for users for email jon@example.com. For each matching row (we expect a single matching row in this example), an index scan is performed over the orders table get all orders for that user id. Translated to pseudocode:

for user in users.filter(email='jon@example.com'): # uses the email index
    for order in orders.filter(user_id=user.id): # uses the userid index
        yield (user, order)

Now, let's try joining all the three tables to get the full details of all the orders of a user:

EXPLAIN SELECT *
    FROM orders
    INNER JOIN users
        ON orders.user_id=users.id
    INNER JOIN articles
        ON orders.article_id=articles.id
    WHERE users.email='jon@example.com';
------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.45..41.71 rows=6 width=460)
   ->  Nested Loop  (cost=0.30..40.57 rows=6 width=238)
         ->  Index Scan using users_email_index on users  (cost=0.15..8.17 rows=1 width=222)
               Index Cond: ((email)::text = 'jon@example.com'::text)
         ->  Index Scan using orders_userid_index on orders  (cost=0.15..32.31 rows=9 width=16)
               Index Cond: (user_id = users.id)
   ->  Index Scan using articles_pkey on articles  (cost=0.15..0.19 rows=1 width=222)
         Index Cond: (id = orders.article_id)

This is similar to joining the two tables, but we have another loop to join the third table. In pseudocode, this is what's going on:

for user in users.filter(email='jon@example.com'): # uses the email index
    for order in orders.filter(user_id=user.id): # uses the userid index
        for article in articles.filter(id=order.article_id): # uses article_id primary key
            yield (user, order, article)

Instead of an index scan, you might come across a bitmap heap scan like this:

 EXPLAIN SELECT *
    FROM orders
    INNER JOIN users
        ON orders.user_id=users.id
    WHERE users.email='jon@example.com';
----------------------------------------------------------------------------------------
 Nested Loop  (cost=4.37..23.02 rows=6 width=238)
   ->  Index Scan using users_email_index on users  (cost=0.15..8.17 rows=1 width=222)
         Index Cond: ((email)::text = 'jon@example.com'::text)
   ->  Bitmap Heap Scan on orders  (cost=4.22..14.76 rows=9 width=16)
         Recheck Cond: (user_id = users.id)
         ->  Bitmap Index Scan on orders_userid_index  (cost=0.00..4.22 rows=9 width=0)
               Index Cond: (user_id = users.id)

In a plain index scan, postgres fetches a row pointer from the index and immediately fetches that row from the table. But with a bitmap index scan, all the tuple pointers that match the filtering condition are gathered and an in-memory bitmap data structure is created. From this, the actual tuples in the table are visited in physical location order. This improves locality of accesses to the table and matters a lot for spinning rust HDDs but can also help with SSDs too. What's that "recheck condition"? If the bitmap gets large, postgres converts it to a lossy bitmap that stores only physical pages instead of individual tuples. So when the tuples are read from the physical pages, postgres has to recheck which tuples in that page match the filtering condition. Bitmap scans also do well when there are multiple filtering conditions using ORs and ANDs since the bitmap data structure supports these operations efficiently.

Postgres collects statistics about the content of the tables. If it expects very few rows to match the filtering condition, an index scan is preferred. If many rows are expected to satisfy the filtering condition, a bitmap scan is preferred. If a substantial portion of the table is likely to be fetched, the sequential scan wins. One of the authors of Postgres, Tom Lane, has an email thread on this topic.

The three types of joins you're likely to come across in a query plan are: (a) Nested loop join, (b) Hash join, and (c) Merge join. For example, here is a hash join:

 EXPLAIN SELECT *
    FROM orders
    INNER JOIN users
        ON orders.user_id=users.id
    WHERE users.email='jon@example.com';
--------------------------------------------------------------------
 Hash Join  (cost=14.03..47.45 rows=12 width=238)
   Hash Cond: (orders.user_id = users.id)
   ->  Seq Scan on orders  (cost=0.00..28.50 rows=1850 width=16)
   ->  Hash  (cost=14.00..14.00 rows=2 width=222)
         ->  Seq Scan on users  (cost=0.00..14.00 rows=2 width=222)
               Filter: ((email)::text = 'jon@example.com'::text)

The hash join can be used only when the join condition is the equality operator. Postgres constructs an in-memory hash table of the filtered users and scans the orders table and retains those tuples which have a matching user id in the constructed hash table.

The nested loop is the preferred option when at least one side of the join has very few matching tuples. Hash join is used when both sides of the join have a large number of tuples. Merge join is preferred when both sides of the join are large but can be sorted on the joining condition using an index.

All the SQL statements used in this article are here.

Variational Video Prediction

Sun, 01 Sep 2019 23:59:59 PST

Just like how your smartphone's keyboard can predict the next word you're likely to type based on the last few words you entered, one can predict future frames of a video by looking at the current frame. This is really useful in model based re-inforcement learning where it endows an agent with the ability to predict the future and plan a sequence of actions based on those predictions. It helps to dramatically cut down the number of samples needed for training.

But there is a problem. What if the dynamics of the environment has some randomness or things you cannot easily model? When you push a pen across the table, it might move a little faster because you applied more force than you intended to. Sometimes the pen moves by 1 cm, sometimes 1.5 cm and so on. If a deterministic neural network is used to model this phenomenon (you try to minimize the least squares error), the randomness is modeled as blur. The network averages out the different possible outcomes. This is problematic because the blur gets worse the further you predict into the future.

Variational inference can address this problem. Suppose that the white square in the picture can move by either 2 px or 3 px in one frame. If we're told at training time whether the pixel moved by 2 px or 3 px (via a one-hot vector), this can be an additional input to the network. With this, the neural network can learn to move the white square by the right number of pixels without any blur. During inference, the one-hot vector can be chosen randomly, which would result in the white square moving by either 2 px or 3 px. But we don't actually know by how many pixels the white square moved during training. Another neural network to the rescue! This encoder network looks at the input and the label and predicts the probability of choosing either of the one-hot vectors as input to the video predictor. The gumbel-softmax reparametrization can be used to sample from this distribution during training.

The code for this network in Keras is here.

Learning to play Pong using PPO in PyTorch

Thu, 23 May 2019 23:59:59 PST

The rules of Atari Pong are simple enough. You get a point if you put the ball past your opponent, and your opponent gets a point if the ball goes past you. How do we train a neural network to look at the pixels on the screen and decide whether to go up or down?

Unlike supervised learning, no labels are available. So, we turn to reinforcement learning. Policy gradients are one way to update the weights of the network to maximize the reward. The idea is to start with random initialization, i.e., the network predicts about 50% probability for both up and down regardless of the observation and to roll out the policy (play the game). At each time step, the network looks at the frame and predicts the probability of going up and down. We sample from this distribution and take the sampled action. At the end of the episode, the weights of the network are updated to increase the probability of taking a certain action if that action led to a positive reward and decrease the probability of taking an action if it led to a negative reward. This is how plain policy gradient works. It is similar to supervised learning, but with each sample in the cross entropy loss function weighted by the reward for that episode (the labels are the actions that were sampled during the policy roll out). Here's the math:

Policy gradients as described above suffers from the problem that the weight update after a policy roll out might change the probability of taking a certain action by a large amount. This is undesirable because the gradients are noisy and making large changes to the network after every policy roll out causes convergence problems. Why not reduce the step size? This can work but if the step size is reduced too much, then learning will be hopelessly slow. So, plain policy gradients are sensitive to the step size. One solution to this problem is to limit (constrain) the KL divergence between the probability of actions before and after the weight update. That's what Trust Region Policy Optimization (TRPO) does, but it needs conjugate gradients. Proximal Policy Optimization (PPO) is a simplification that adds a penalty to the loss function to penalize large probability changes. This has an effect similar to TRPO and works well in practice.

Here is code implementing PPO in PyTorch (also in this Gist).

import random
import gym
import numpy as np
from PIL import Image
import torch
from torch.nn import functional as F
from torch import nn

class Policy(nn.Module):
    def __init__(self):
        super(Policy, self).__init__()

        self.gamma = 0.99
        self.eps_clip = 0.1

        self.layers = nn.Sequential(
            nn.Linear(6000, 512), nn.ReLU(),
            nn.Linear(512, 2),
        )

    def state_to_tensor(self, I):
        """ prepro 210x160x3 uint8 frame into 6000 (75x80) 1D float vector. See Karpathy's post: http://karpathy.github.io/2016/05/31/rl/ """
        if I is None:
            return torch.zeros(1, 6000)
        I = I[35:185] # crop - remove 35px from start & 25px from end of image in x, to reduce redundant parts of image (i.e. after ball passes paddle)
        I = I[::2,::2,0] # downsample by factor of 2.
        I[I == 144] = 0 # erase background (background type 1)
        I[I == 109] = 0 # erase background (background type 2)
        I[I != 0] = 1 # everything else (paddles, ball) just set to 1. this makes the image grayscale effectively
        return torch.from_numpy(I.astype(np.float32).ravel()).unsqueeze(0)

    def pre_process(self, x, prev_x):
        return self.state_to_tensor(x) - self.state_to_tensor(prev_x)

    def convert_action(self, action):
        return action + 2

    def forward(self, d_obs, action=None, action_prob=None, advantage=None, deterministic=False):
        if action is None:
            with torch.no_grad():
                logits = self.layers(d_obs)
                if deterministic:
                    action = int(torch.argmax(logits[0]).detach().cpu().numpy())
                    action_prob = 1.0
                else:
                    c = torch.distributions.Categorical(logits=logits)
                    action = int(c.sample().cpu().numpy()[0])
                    action_prob = float(c.probs[0, action].detach().cpu().numpy())
                return action, action_prob
        '''
        # policy gradient (REINFORCE)
        logits = self.layers(d_obs)
        loss = F.cross_entropy(logits, action, reduction='none') * advantage
        return loss.mean()
        '''

        # PPO
        vs = np.array([[1., 0.], [0., 1.]])
        ts = torch.FloatTensor(vs[action.cpu().numpy()])

        logits = self.layers(d_obs)
        r = torch.sum(F.softmax(logits, dim=1) * ts, dim=1) / action_prob
        loss1 = r * advantage
        loss2 = torch.clamp(r, 1-self.eps_clip, 1+self.eps_clip) * advantage
        loss = -torch.min(loss1, loss2)
        loss = torch.mean(loss)

        return loss

env = gym.make('PongNoFrameskip-v4')
env.reset()

policy = Policy()

opt = torch.optim.Adam(policy.parameters(), lr=1e-3)

reward_sum_running_avg = None
for it in range(100000):
    d_obs_history, action_history, action_prob_history, reward_history = [], [], [], []
    for ep in range(10):
        obs, prev_obs = env.reset(), None
        for t in range(190000):
            #env.render()

            d_obs = policy.pre_process(obs, prev_obs)
            with torch.no_grad():
                action, action_prob = policy(d_obs)

            prev_obs = obs
            obs, reward, done, info = env.step(policy.convert_action(action))

            d_obs_history.append(d_obs)
            action_history.append(action)
            action_prob_history.append(action_prob)
            reward_history.append(reward)

            if done:
                reward_sum = sum(reward_history[-t:])
                reward_sum_running_avg = 0.99*reward_sum_running_avg + 0.01*reward_sum if reward_sum_running_avg else reward_sum
                print('Iteration %d, Episode %d (%d timesteps) - last_action: %d, last_action_prob: %.2f, reward_sum: %.2f, running_avg: %.2f' % (it, ep, t, action, action_prob, reward_sum, reward_sum_running_avg))
                break

    # compute advantage
    R = 0
    discounted_rewards = []

    for r in reward_history[::-1]:
        if r != 0: R = 0 # scored/lost a point in pong, so reset reward sum
        R = r + policy.gamma * R
        discounted_rewards.insert(0, R)

    discounted_rewards = torch.FloatTensor(discounted_rewards)
    discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / discounted_rewards.std()

    # update policy
    for _ in range(5):
        n_batch = 24576
        idxs = random.sample(range(len(action_history)), n_batch)
        d_obs_batch = torch.cat([d_obs_history[idx] for idx in idxs], 0)
        action_batch = torch.LongTensor([action_history[idx] for idx in idxs])
        action_prob_batch = torch.FloatTensor([action_prob_history[idx] for idx in idxs])
        advantage_batch = torch.FloatTensor([discounted_rewards[idx] for idx in idxs])

        opt.zero_grad()
        loss = policy(d_obs_batch, action_batch, action_prob_batch, advantage_batch)
        loss.backward()
        opt.step()

    if it % 5 == 0:
        torch.save(policy.state_dict(), 'params.ckpt')

env.close()

After training for 4000 episodes, the policy network consistently beat the "computer player" with an average reward of +14. Here is a video of the agent playing (the agent controls the green paddle to the right).

Meta Learning in PyTorch

Wed, 07 Nov 2018 23:59:59 PST

Got an image recognition problem? A pre-trained ResNet is probably a good starting point. Transfer learning, where the weights of a pre-trained network are fine tuned for the task at hand, is widely used because it can drastically reduce both the amount of data to be collected and the total time spent training the network. But ResNet wasn't trained with the intention of being a good starting point for transfer learning. It just so happens that it works well. But what if a network is trained specifically to obtain weights that are good for generalizing to a new task? That's what meta learning aims to do.

The usual setting in meta learning involves a distribution of tasks. During training, a large number of tasks, but with only a few labeled examples per task, are available. At "test time", a new, previously unseen, task is provided with a few examples. Using only these few examples, the network must learn to generalize to new examples of the same task. In meta learning, this is accomplished by running a few steps of gradient descent on the examples of the new task provided during test. So, the goal of the training process is to discover similarities between tasks and find network weights that serve as a good starting point for gradient descent at test time on a new task.

Model Agnostic Meta Learning (MAML)

MAML differentiates through the stochastic gradient descent (SGD) update steps and learns weights that are a good starting point for SGD at test time. i.e.., gradient descent-ception. This is what the training loop looks like:

- randomly initialize network weights W
for it in range(num_iterations):
    - Sample a task from the training set and get a few
      labeled examples for that task
    - Compute loss L using current weights W
    - Wn = W - inner_lr * dL/dW
    - Compute loss Ln using tuned weights Wn
    - Update W = W - outer_lr * dLn/dW

To compute the loss Ln, the tuned weights Wn are used. But, notice that gradients of the loss with respect to the original weights dLn/dW are needed. Computing this involves finding higher-order derivatives of the loss with respect to the original weights W.

At test time:

- Given trained weights W and a few examples of a new task
- Compute loss L using weights W
- Wn = W - inner_lr * dL/dW
- Use Wn to make predictions for that task

Let's try learning to generate a sine wave from only 4 data points. To keep it simple, let's fix the amplitude and frequency but randomly select the phase between 0 and 180 degrees. At test time, the model must figure out what the phase is and generate the sine wave from only 4 example data points.

import math
import random
import torch # v0.4.1
from torch import nn
from torch.nn import functional as F
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt

def net(x, params):
    x = F.linear(x, params[0], params[1])
    x = F.relu(x)

    x = F.linear(x, params[2], params[3])
    x = F.relu(x)

    x = F.linear(x, params[4], params[5])
    return x

params = [
    torch.Tensor(32, 1).uniform_(-1., 1.).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(32, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(1, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(1).zero_().requires_grad_(),
]

opt = torch.optim.SGD(params, lr=1e-2)
n_inner_loop = 5
alpha = 3e-2

for it in range(275000):
    b = 0 if random.choice([True, False]) else math.pi

    x = torch.rand(4, 1)*4*math.pi - 2*math.pi
    y = torch.sin(x + b)

    v_x = torch.rand(4, 1)*4*math.pi - 2*math.pi
    v_y = torch.sin(v_x + b)

    opt.zero_grad()

    new_params = params
    for k in range(n_inner_loop):
        f = net(x, new_params)
        loss = F.l1_loss(f, y)

        # create_graph=True because computing grads here is part of the forward pass.
        # We want to differentiate through the SGD update steps and get higher order
        # derivatives in the backward pass.
        grads = torch.autograd.grad(loss, new_params, create_graph=True)
        new_params = [(new_params[i] - alpha*grads[i]) for i in range(len(params))]

        if it % 100 == 0: print 'Iteration %d -- Inner loop %d -- Loss: %.4f' % (it, k, loss)

    v_f = net(v_x, new_params)
    loss2 = F.l1_loss(v_f, v_y)
    loss2.backward()

    opt.step()

    if it % 100 == 0: print 'Iteration %d -- Outer Loss: %.4f' % (it, loss2)

t_b = math.pi #0

t_x = torch.rand(4, 1)*4*math.pi - 2*math.pi
t_y = torch.sin(t_x + t_b)

opt.zero_grad()

t_params = params
for k in range(n_inner_loop):
    t_f = net(t_x, t_params)
    t_loss = F.l1_loss(t_f, t_y)

    grads = torch.autograd.grad(t_loss, t_params, create_graph=True)
    t_params = [(t_params[i] - alpha*grads[i]) for i in range(len(params))]


test_x = torch.arange(-2*math.pi, 2*math.pi, step=0.01).unsqueeze(1)
test_y = torch.sin(test_x + t_b)

test_f = net(test_x, t_params)

plt.plot(test_x.data.numpy(), test_y.data.numpy(), label='sin(x)')
plt.plot(test_x.data.numpy(), test_f.data.numpy(), label='net(x)')
plt.plot(t_x.data.numpy(), t_y.data.numpy(), 'o', label='Examples')
plt.legend()
plt.savefig('maml-sine.png')

Here is the sine wave the network constructs after looking at only 4 points at test time:

There's a variant of the MAML algorithm called FO-MAML (first-order MAML) that ignores higher-order derivatives. Reptile is a similar algorithm proposed by OpenAI that's simpler to implement. Check out their javascript demo.

Domain Adaptive Meta Learning (DAML)

DAML uses meta learning to tune the parameters of the network to accommodate large domain shifts in the input. This method also doesn't need labels in the source domain!

Consider a neural network that takes x as input and produces y = net(x). The source domain is a distribution from which the input x maybe drawn from. Likewise, the target domain is another distribution of inputs. Domain adaptation is what has to be done to get the network to work when the distribution of the input is changed from the source domain to the target domain. The idea in DAML is to use meta learning to tune the weights of the network based on examples in the source domain so that the network can do well on examples drawn from the target domain. During training, unlabeled examples from the source domain and the corresponding examples with labels in the target domain are available. This is the training loop of DAML:

- randomly initialize network weights W and the adaptation
  loss network weights W_adap
for it in range(num_iterations):
    - Sample a task from the training set
    - Compute adaptation loss (L_adap) using (W, W_adap) and 
      unlabeled training data in the source domain
    - Wn = W - inner_lr * dL_adap/dW
    - Compute training loss (Ln) from labeled training data
      in the target domain using the tuned weights Wn
    - (W, W_adap) = (W, W_adap) - outer_lr * dLn/d(W, W_adap)

Since we don't have labeled data in the source domain, we must also learn a loss function L_adap parameterized by W_adap.

At test time:

- Given trained weights (W, W_adap) and a few unlabeled
  examples of a new task
- Compute adaptation loss (L_adap) using weights (W, W_adap) and
  unlabeled examples in the source domain
- Wn = W - inner_lr * dL_adap/dW
- Use Wn to make predictions for that task for new inputs in
  the target domain

Once again, let's try learning to generate sine waves. In the target domain, the input, x, to the network is drawn from a uniform distribution [-2*PI, 2*PI], and the network has to predict y = sin(x) or y = sin(x + PI). Whether the network must predict y = sin(x) or y = sin(x + PI) has to be inferred from a single unlabeled input in the source domain. In the source domain, the input, x, to the network will be drawn uniformly from [PI/4, PI/2] to specify that zero phase is what we want and an input drawn from [-PI/2, -PI/4] shall specify that a 180 degree phase is desired. The source domain input is used to find gradients of weights with respect to the learnt adaptation loss, and a few steps of gradient descent tunes the weights of the network. Once we have the tuned weights, they can be used in the target domain to predict a sine wave of the desired phase.

import math
import random
import torch # v0.4.1
from torch import nn
from torch.nn import functional as F
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt

def net(x, params):
    x = F.linear(x, params[0], params[1])
    x1 = F.relu(x)

    x = F.linear(x1, params[2], params[3])
    x2 = F.relu(x)

    y = F.linear(x2, params[4], params[5])

    return y, x2, x1

def adap_net(y, x2, x1, params):
    x = torch.cat([y, x2, x1], dim=1)

    x = F.linear(x, params[0], params[1])
    x = F.relu(x)

    x = F.linear(x, params[2], params[3])
    x = F.relu(x)

    x = F.linear(x, params[4], params[5])

    return x

params = [
    torch.Tensor(32, 1).uniform_(-1., 1.).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(32, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(1, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(1).zero_().requires_grad_(),
]

adap_params = [
    torch.Tensor(32, 1+32+32).uniform_(-1./math.sqrt(65), 1./math.sqrt(65)).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(32, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(32).zero_().requires_grad_(),

    torch.Tensor(1, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
    torch.Tensor(1).zero_().requires_grad_(),
]

opt = torch.optim.SGD(params + adap_params, lr=1e-2)
n_inner_loop = 5
alpha = 3e-2

for it in range(275000):
    b = 0 if random.choice([True, False]) else math.pi

    v_x = torch.rand(4, 1)*4*math.pi - 2*math.pi
    v_y = torch.sin(v_x + b)

    opt.zero_grad()

    new_params = params
    for k in range(n_inner_loop):
        f, f2, f1 = net(torch.FloatTensor([[random.uniform(math.pi/4, math.pi/2) if b == 0 else random.uniform(-math.pi/2, -math.pi/4)]]), new_params)
        h = adap_net(f, f2, f1, adap_params)
        adap_loss = F.l1_loss(h, torch.zeros(1, 1))

        # create_graph=True because computing grads here is part of the forward pass.
        # We want to differentiate through the SGD update steps and get higher order
        # derivatives in the backward pass.
        grads = torch.autograd.grad(adap_loss, new_params, create_graph=True)
        new_params = [(new_params[i] - alpha*grads[i]) for i in range(len(params))]

        if it % 100 == 0: print 'Iteration %d -- Inner loop %d -- Loss: %.4f' % (it, k, adap_loss)

    v_f, _, _ = net(v_x, new_params)
    loss = F.l1_loss(v_f, v_y)
    loss.backward()

    opt.step()

    if it % 100 == 0: print 'Iteration %d -- Outer Loss: %.4f' % (it, loss)

t_b = math.pi # 0

opt.zero_grad()

t_params = params
for k in range(n_inner_loop):
    t_f, t_f2, t_f1 = net(torch.FloatTensor([[random.uniform(math.pi/4, math.pi/2) if t_b == 0 else random.uniform(-math.pi/2, -math.pi/4)]]), t_params)
    t_h = adap_net(t_f, t_f2, t_f1, adap_params)
    t_adap_loss = F.l1_loss(t_h, torch.zeros(1, 1))

    grads = torch.autograd.grad(t_adap_loss, t_params, create_graph=True)
    t_params = [(t_params[i] - alpha*grads[i]) for i in range(len(params))]

test_x = torch.arange(-2*math.pi, 2*math.pi, step=0.01).unsqueeze(1)
test_y = torch.sin(test_x + t_b)

test_f, _, _ = net(test_x, t_params)

plt.plot(test_x.data.numpy(), test_y.data.numpy(), label='sin(x)')
plt.plot(test_x.data.numpy(), test_f.data.numpy(), label='net(x)')
plt.legend()
plt.savefig('daml-sine.png')

This is the sine wave contructed by the network after domain adaptation:

How malloc gets memory from the OS

Sun, 22 Apr 2018 23:59:59 PST

In the old days of 8086, 16-bit programs accessed physical memory directly. This would be valid code and would work:

int main()
{
    int *p = (int *)0x02ad;
    return *p;
}

x86 processors still boot into 16-bit real mode where this is fine, but the OS switches the processor into protected mode which enables virtual memory. Once virtual memory is enabled, each process has its own virtual memory that the OS has to map (to physical memory, files on the hard drive, device registers, etc.). If the program tries to access unmapped memory, a segfault happens.

When Linux starts a process and loads the executable to memory, the layout of the virtual address space looks something like this:

---------------
|             |
|    stack    |
|             |
--------------- 0x7ffc725866b4
|             |
|             |
|             |
|   unmapped  |
|    space    |
|             |
|             |
|             |
--------------- 0x000001773000
|             |
| data (bss)  |
|             |
---------------
|             |
|    data     |
|             |
---------------
|             |
|    text     |
|             |
---------------

The text segment contains the binary code of the executable, the data segment has initialized static variables, the bss segment has uninitialized static variables (zeroed out before main() function is called), and the stack segment contains the stack (There's also space for the environment variables, and the OS kernel space is also mapped for performance reasons, but I've skipped these in the diagram.) The adresses of the these segments is randomized when the executable is loaded as a security measure (ASLR).

When malloc() is called, it tries to allocate memory from previously freed memory that is still mapped to the process. But if there is insufficient free memory, malloc() must make one of these system calls to request the OS to map additional memory:

The brk / sbrk system calls enlarge the data segment. In the diagram above, calling sbrk(8) would move the end of the data segment from 0x1773000 to 0x1773008. If the process wants to free the memory and return it to the OS, the data segment can be shrunk with the same syscalls.
The mmap syscall can map pages anywhere in the virtual address space (the equivalent syscall in Windows is VirtualAlloc).

The malloc implementation in glibc uses sbrk when it needs small amounts of memory (~32K) and mmap when it needs large amounts of memory. The reason mmap is preferred for large objects is to prevent losing too much memory to fragmentation in the data segment; if a small object is allocated with sbrk after a large object and then, if the large object is freed, that memory cannot be freed until the small object is freed as well.

Host Your Own Private Git Repos

Sat, 31 Mar 2018 23:59:59 PST

Hosting git repos on your own server is actually quite easy. Login to the server, create a new directory, and initialize a bare repo:

mkdir foo.git
cd foo.git
git init --bare

That's it! Now, from the client, clone this repo with:

git clone username@example.com:path/to/foo.git

Having a dedicated user for git repos on the server makes it easier share access to the repo. Create a new user git with a login shell restricted to git commands:

sudo adduser --shell $(which git-shell) git

Now create a repo in the home directory of the git user:

cd /home/git
sudo -u git mkdir bar.git
cd bar.git
sudo -u git git init --bare

As before, clone the new repo from the client using:

git clone git@example.com:bar

Backup the repos

This is my script to take daily backups of all the git repos on the server to Amazon S3.

#!/bin/bash

set -e

GITDIR=/home/git
TMPDIR=/tmp/gitbackup

renice -n 15 $$

trap "rm -f /tmp/gitbackup/*.git.tar.gz" EXIT

mkdir -p ${TMPDIR}
cd ${TMPDIR}

for proj in ${GITDIR}/*.git; do
    base=$(basename $proj)
    tar -C $GITDIR -zcf ${base}.tar.gz $base
done

export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=yyyyy
export AWS_DEFAULT_REGION=us-west-2

aws s3 cp ${TMPDIR}/*.git.tar.gz s3://mygitbucket/

If the repos are large, it might be worthwhile checking whether the hash of the gzipped repo has changed before uploading. It's also good idea to use envdir to manage the access keys rather than putting them in the backup script.

Web front-end using cgit and nginx

Sometimes it's useful to view source code and commits on a web browser. cgit is an awesome light-weight webapp for this. Unlike heavy apps like GitLab, cgit needs no database, which reduces the administrative burden.

Install cgit, nginx, fcgiwrap, and apache-tools (to create a .htpasswd file).

sudo apt install cgit nginx fcgiwrap apache2-utils

Specify the location of the git repos and static assets in the cgit config at /etc/cgitrc.

css=/cgit-static/cgit.css
logo=/cgit-static/cgit.png
favicon=/cgit-static/favicon.ico

#source-filter=/usr/lib/cgit/filters/syntax-highlighting.py

scan-path=/home/git/

To get syntax highlighting, install python-pygments and uncomment the source-filter option.

If you'd like to password protect access to www.example.com/git/, create a .htpasswd file:

sudo htpasswd /etc/nginx/.htpasswd

This is my nginx conf file to serve cgit from www.example.com/git/.

server {
    listen 80;
    listen [::]:80;

    server_name www.example.com;

    location /.well-known/acme-challenge/ {
        root /var/www/www.example.com;
    }
    location / {
        return 301 https://www.example.com$request_uri;
    }
}

server {
    listen 443 ssl;
    listen [::]:443 ssl;

    server_name www.example.com;

    ssl_certificate /etc/letsencrypt/live/www.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/www.example.com/privkey.pem;

    location /cgit-static/ {
        alias /usr/share/cgit/;
    }

    location /cgit/ {
        auth_basic "Restricted";
        auth_basic_user_file /etc/nginx/.htpasswd;

        include fastcgi_params;
        fastcgi_split_path_info ^(/cgit)(.*)$;
        fastcgi_param   PATH_INFO        $fastcgi_path_info;
        fastcgi_param   SCRIPT_FILENAME  /usr/lib/cgit/cgit.cgi;
        fastcgi_param   QUERY_STRING     $args;
        fastcgi_param   HTTP_HOST        $server_name;
        fastcgi_pass    unix:/var/run/fcgiwrap.socket;
    }

    location / {
        root /var/www/www.example.com;
    }
}

You might also want to restrict repo access to only whitelisted IPs.

Pagination in SQL

Sun, 27 Aug 2017 23:59:59 PST

Here are two ways to paginate the results of a SQL query that work across all the popular SQL database systems.

Truncate the results

Silly though it sounds, this might be a reasonable strategy. Suppose you want to show 15 results per page. Then, show up to 20 pages, and stop there. This works well when it's unlikely that anyone would want to see past the first few pages. Incidentally, Google does something like this for web search results.

SELECT * FROM users ORDER BY creation_date LIMIT 15 OFFSET 45;

This query is not efficient for large offsets because rows up to the offset have to be read and discarded. But that's OK since the offset is limited to a few hundred rows at most. It's a net win if only the first few pages are read most of the time.

Keep track of the first and last result in a page

This is based on the idea that random access is not really needed and that it's often necessary to only access the next page and the previous page from any given page. When you're on the fourth page, accessing a random page, say page 3124, might be inefficient. But, accessing the third and fifth pages are efficient if the right indexes have been setup. This is accomplished by keeping track of the first and last values of the column on which the results are ordered.

SELECT * FROM users WHERE creation_date > ? ORDER BY creation_date LIMIT 15;

When the next page is requested, the query is executed with the creation_date of the last user in the current page. For the previous page, the creation_date of the first user in the current page is used:

SELECT * FROM users WHERE creation_date < ? ORDER BY creation_date DESC LIMIT 15;

If the column by which the results are sorted is not unique, add additional columns or the primary key to ORDER BY and keep track of the first and last values of those columns as well.

Another example of using this method for pagination is in the SQLite wiki.