The usual setting in meta learning involves a distribution of tasks. During training, a large number of tasks, but with only a few labeled examples per task, are available. At "test time", a new, previously unseen, task is provided with a few examples. Using only these few examples, the network must learn to generalize to new examples of the same task. In meta learning, this is accomplished by running a few steps of gradient descent on the examples of the new task provided during test. So, the goal of the training process is to discover similarities between tasks and find network weights that serve as a good starting point for gradient descent at test time on a new task.
MAML differentiates through the stochastic gradient descent (SGD) update steps and learns weights that are a good starting point for SGD at test time. i.e.., gradient descent-ception. This is what the training loop looks like:
- randomly initialize network weights W
for it in range(num_iterations):
- Sample a task from the training set and get a few
labeled examples for that task
- Compute loss L using current weights W
- Wn = W - inner_lr * dL/dW
- Compute loss Ln using tuned weights Wn
- Update W = W - outer_lr * dLn/dW
To compute the loss Ln
, the tuned weights Wn
are used.
But, notice that gradients of the loss with respect to the
original weights dLn/dW
are needed. Computing this involves
finding higher-order derivatives of the loss with respect to the
original weights W
.
At test time:
- Given trained weights W and a few examples of a new task
- Compute loss L using weights W
- Wn = W - inner_lr * dL/dW
- Use Wn to make predictions for that task
Let's try learning to generate a sine wave from only 4 data points. To keep it simple, let's fix the amplitude and frequency but randomly select the phase between 0 and 180 degrees. At test time, the model must figure out what the phase is and generate the sine wave from only 4 example data points.
import math
import random
import torch # v0.4.1
from torch import nn
from torch.nn import functional as F
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
def net(x, params):
x = F.linear(x, params[0], params[1])
x = F.relu(x)
x = F.linear(x, params[2], params[3])
x = F.relu(x)
x = F.linear(x, params[4], params[5])
return x
params = [
torch.Tensor(32, 1).uniform_(-1., 1.).requires_grad_(),
torch.Tensor(32).zero_().requires_grad_(),
torch.Tensor(32, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
torch.Tensor(32).zero_().requires_grad_(),
torch.Tensor(1, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
torch.Tensor(1).zero_().requires_grad_(),
]
opt = torch.optim.SGD(params, lr=1e-2)
n_inner_loop = 5
alpha = 3e-2
for it in range(275000):
b = 0 if random.choice([True, False]) else math.pi
x = torch.rand(4, 1)*4*math.pi - 2*math.pi
y = torch.sin(x + b)
v_x = torch.rand(4, 1)*4*math.pi - 2*math.pi
v_y = torch.sin(v_x + b)
opt.zero_grad()
new_params = params
for k in range(n_inner_loop):
f = net(x, new_params)
loss = F.l1_loss(f, y)
# create_graph=True because computing grads here is part of the forward pass.
# We want to differentiate through the SGD update steps and get higher order
# derivatives in the backward pass.
grads = torch.autograd.grad(loss, new_params, create_graph=True)
new_params = [(new_params[i] - alpha*grads[i]) for i in range(len(params))]
if it % 100 == 0: print 'Iteration %d -- Inner loop %d -- Loss: %.4f' % (it, k, loss)
v_f = net(v_x, new_params)
loss2 = F.l1_loss(v_f, v_y)
loss2.backward()
opt.step()
if it % 100 == 0: print 'Iteration %d -- Outer Loss: %.4f' % (it, loss2)
t_b = math.pi #0
t_x = torch.rand(4, 1)*4*math.pi - 2*math.pi
t_y = torch.sin(t_x + t_b)
opt.zero_grad()
t_params = params
for k in range(n_inner_loop):
t_f = net(t_x, t_params)
t_loss = F.l1_loss(t_f, t_y)
grads = torch.autograd.grad(t_loss, t_params, create_graph=True)
t_params = [(t_params[i] - alpha*grads[i]) for i in range(len(params))]
test_x = torch.arange(-2*math.pi, 2*math.pi, step=0.01).unsqueeze(1)
test_y = torch.sin(test_x + t_b)
test_f = net(test_x, t_params)
plt.plot(test_x.data.numpy(), test_y.data.numpy(), label='sin(x)')
plt.plot(test_x.data.numpy(), test_f.data.numpy(), label='net(x)')
plt.plot(t_x.data.numpy(), t_y.data.numpy(), 'o', label='Examples')
plt.legend()
plt.savefig('maml-sine.png')
Here is the sine wave the network constructs after looking at only 4 points at test time:
There's a variant of the MAML algorithm called FO-MAML (first-order MAML) that ignores higher-order derivatives. Reptile is a similar algorithm proposed by OpenAI that's simpler to implement. Check out their javascript demo.
DAML uses meta learning to tune the parameters of the network to accommodate large domain shifts in the input. This method also doesn't need labels in the source domain!
Consider a neural network that takes x
as
input and produces y = net(x)
. The source domain is a distribution
from which the input x
maybe drawn from. Likewise, the target
domain is another distribution of inputs. Domain
adaptation is what has to be done to get the network to work
when the distribution of the input is changed from the source
domain to the target domain. The idea in DAML is to use meta learning
to tune the weights of the network based on examples in the source
domain so that the network can do well on examples drawn from the
target domain. During training, unlabeled examples from the source
domain and the corresponding examples with labels in the target domain
are available. This is the training loop of DAML:
- randomly initialize network weights W and the adaptation
loss network weights W_adap
for it in range(num_iterations):
- Sample a task from the training set
- Compute adaptation loss (L_adap) using (W, W_adap) and
unlabeled training data in the source domain
- Wn = W - inner_lr * dL_adap/dW
- Compute training loss (Ln) from labeled training data
in the target domain using the tuned weights Wn
- (W, W_adap) = (W, W_adap) - outer_lr * dLn/d(W, W_adap)
Since we don't have labeled data in the source domain,
we must also learn a loss function L_adap
parameterized by W_adap
.
At test time:
- Given trained weights (W, W_adap) and a few unlabeled
examples of a new task
- Compute adaptation loss (L_adap) using weights (W, W_adap) and
unlabeled examples in the source domain
- Wn = W - inner_lr * dL_adap/dW
- Use Wn to make predictions for that task for new inputs in
the target domain
Once again, let's try learning to generate sine waves.
In the target domain, the input, x
, to the network is drawn from a
uniform distribution [-2*PI, 2*PI]
, and the network has to
predict y = sin(x)
or y = sin(x + PI)
. Whether the network must
predict y = sin(x)
or y = sin(x + PI)
has to be inferred from a single
unlabeled input in the source domain. In the source domain, the input, x
,
to the network will be drawn uniformly from [PI/4, PI/2]
to specify that
zero phase is what we want and an input drawn from [-PI/2, -PI/4]
shall
specify that a 180 degree phase is desired. The source domain input is used
to find gradients of weights with respect to the learnt adaptation loss,
and a few steps of gradient descent tunes the weights of the network. Once
we have the tuned weights, they can be used in the target domain to
predict a sine wave of the desired phase.
import math
import random
import torch # v0.4.1
from torch import nn
from torch.nn import functional as F
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
def net(x, params):
x = F.linear(x, params[0], params[1])
x1 = F.relu(x)
x = F.linear(x1, params[2], params[3])
x2 = F.relu(x)
y = F.linear(x2, params[4], params[5])
return y, x2, x1
def adap_net(y, x2, x1, params):
x = torch.cat([y, x2, x1], dim=1)
x = F.linear(x, params[0], params[1])
x = F.relu(x)
x = F.linear(x, params[2], params[3])
x = F.relu(x)
x = F.linear(x, params[4], params[5])
return x
params = [
torch.Tensor(32, 1).uniform_(-1., 1.).requires_grad_(),
torch.Tensor(32).zero_().requires_grad_(),
torch.Tensor(32, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
torch.Tensor(32).zero_().requires_grad_(),
torch.Tensor(1, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
torch.Tensor(1).zero_().requires_grad_(),
]
adap_params = [
torch.Tensor(32, 1+32+32).uniform_(-1./math.sqrt(65), 1./math.sqrt(65)).requires_grad_(),
torch.Tensor(32).zero_().requires_grad_(),
torch.Tensor(32, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
torch.Tensor(32).zero_().requires_grad_(),
torch.Tensor(1, 32).uniform_(-1./math.sqrt(32), 1./math.sqrt(32)).requires_grad_(),
torch.Tensor(1).zero_().requires_grad_(),
]
opt = torch.optim.SGD(params + adap_params, lr=1e-2)
n_inner_loop = 5
alpha = 3e-2
for it in range(275000):
b = 0 if random.choice([True, False]) else math.pi
v_x = torch.rand(4, 1)*4*math.pi - 2*math.pi
v_y = torch.sin(v_x + b)
opt.zero_grad()
new_params = params
for k in range(n_inner_loop):
f, f2, f1 = net(torch.FloatTensor([[random.uniform(math.pi/4, math.pi/2) if b == 0 else random.uniform(-math.pi/2, -math.pi/4)]]), new_params)
h = adap_net(f, f2, f1, adap_params)
adap_loss = F.l1_loss(h, torch.zeros(1, 1))
# create_graph=True because computing grads here is part of the forward pass.
# We want to differentiate through the SGD update steps and get higher order
# derivatives in the backward pass.
grads = torch.autograd.grad(adap_loss, new_params, create_graph=True)
new_params = [(new_params[i] - alpha*grads[i]) for i in range(len(params))]
if it % 100 == 0: print 'Iteration %d -- Inner loop %d -- Loss: %.4f' % (it, k, adap_loss)
v_f, _, _ = net(v_x, new_params)
loss = F.l1_loss(v_f, v_y)
loss.backward()
opt.step()
if it % 100 == 0: print 'Iteration %d -- Outer Loss: %.4f' % (it, loss)
t_b = math.pi # 0
opt.zero_grad()
t_params = params
for k in range(n_inner_loop):
t_f, t_f2, t_f1 = net(torch.FloatTensor([[random.uniform(math.pi/4, math.pi/2) if t_b == 0 else random.uniform(-math.pi/2, -math.pi/4)]]), t_params)
t_h = adap_net(t_f, t_f2, t_f1, adap_params)
t_adap_loss = F.l1_loss(t_h, torch.zeros(1, 1))
grads = torch.autograd.grad(t_adap_loss, t_params, create_graph=True)
t_params = [(t_params[i] - alpha*grads[i]) for i in range(len(params))]
test_x = torch.arange(-2*math.pi, 2*math.pi, step=0.01).unsqueeze(1)
test_y = torch.sin(test_x + t_b)
test_f, _, _ = net(test_x, t_params)
plt.plot(test_x.data.numpy(), test_y.data.numpy(), label='sin(x)')
plt.plot(test_x.data.numpy(), test_f.data.numpy(), label='net(x)')
plt.legend()
plt.savefig('daml-sine.png')
This is the sine wave contructed by the network after domain adaptation:
]]>Unlike supervised learning, no labels are available. So, we turn to reinforcement learning. Policy gradients are one way to update the weights of the network to maximize the reward. The idea is to start with random initialization, i.e., the network predicts 50% probability for both up and down regardless of the observation and to roll out the policy (play the game). At each time step, the network looks at the frame and predicts the probability of going up and down. We sample from this distribution and take the sampled action. At the end of the episode, the weights of the network are updated to increase the probability of taking a certain action if that action led to a positive reward and decrease the probability of taking an action if it led to a negative reward. This is how plain policy gradient works. It is similar to supervised learning, but with each sample in the loss function weighted by the reward for that episode (the labels are the actions that were sampled during the policy roll out).
Policy gradients as described above suffers from the problem that the weight update after a policy roll out might change the probability of taking a certain action by a large amount. This is undesirable because the gradients are noisy and making large changes to the network after every policy roll out causes convergence problems. Why not reduce the step size? This can work but if the step size is reduced too much, then learning will be hopelessly slow. So, plain policy gradients are sensitive to the step size. One solution to this problem is to limit (constrain) the KL divergence between the probability of actions before and after the weight update. That's what Trust Region Policy Optimization (TRPO) does, but it needs conjugate gradients. Proximal Policy Optimization (PPO) is a simplification that adds a penalty to the loss function to penalize large probability changes. This has an effect similar to TRPO and works well in practice.
Here is code implementing PPO in PyTorch (also in this Gist).
import random
import gym
import numpy as np
from PIL import Image
import torch
from torch.nn import functional as F
from torch import nn
class Policy(nn.Module):
def __init__(self):
super(Policy, self).__init__()
self.gamma = 0.99
self.eps_clip = 0.1
self.layers = nn.Sequential(
nn.Linear(6000, 512), nn.ReLU(),
nn.Linear(512, 2),
)
def state_to_tensor(self, I):
""" prepro 210x160x3 uint8 frame into 6000 (75x80) 1D float vector. See Karpathy's post: http://karpathy.github.io/2016/05/31/rl/ """
if I is None:
return torch.zeros(1, 6000)
I = I[35:185] # crop - remove 35px from start & 25px from end of image in x, to reduce redundant parts of image (i.e. after ball passes paddle)
I = I[::2,::2,0] # downsample by factor of 2.
I[I == 144] = 0 # erase background (background type 1)
I[I == 109] = 0 # erase background (background type 2)
I[I != 0] = 1 # everything else (paddles, ball) just set to 1. this makes the image grayscale effectively
return torch.from_numpy(I.astype(np.float32).ravel()).unsqueeze(0)
def pre_process(self, x, prev_x):
return self.state_to_tensor(x) - self.state_to_tensor(prev_x)
def convert_action(self, action):
return action + 2
def forward(self, d_obs, action=None, action_prob=None, advantage=None, deterministic=False):
if action is None:
with torch.no_grad():
logits = self.layers(d_obs)
if deterministic:
action = int(torch.argmax(logits[0]).detach().cpu().numpy())
action_prob = 1.0
else:
c = torch.distributions.Categorical(logits=logits)
action = int(c.sample().cpu().numpy()[0])
action_prob = float(c.probs[0, action].detach().cpu().numpy())
return action, action_prob
'''
# policy gradient (REINFORCE)
logits = self.layers(d_obs)
loss = F.cross_entropy(logits, action, reduction='none') * advantage
return loss.mean()
'''
# PPO
vs = np.array([[1., 0.], [0., 1.]])
ts = torch.FloatTensor(vs[action.cpu().numpy()])
logits = self.layers(d_obs)
r = torch.sum(F.softmax(logits, dim=1) * ts, dim=1) / action_prob
loss1 = r * advantage
loss2 = torch.clamp(r, 1-self.eps_clip, 1+self.eps_clip) * advantage
loss = -torch.min(loss1, loss2)
loss = torch.mean(loss)
return loss
env = gym.make('PongNoFrameskip-v4')
env.reset()
policy = Policy()
opt = torch.optim.Adam(policy.parameters(), lr=1e-3)
reward_sum_running_avg = None
for it in range(100000):
d_obs_history, action_history, action_prob_history, reward_history = [], [], [], []
for ep in range(10):
obs, prev_obs = env.reset(), None
for t in range(190000):
#env.render()
d_obs = policy.pre_process(obs, prev_obs)
with torch.no_grad():
action, action_prob = policy(d_obs)
prev_obs = obs
obs, reward, done, info = env.step(policy.convert_action(action))
d_obs_history.append(d_obs)
action_history.append(action)
action_prob_history.append(action_prob)
reward_history.append(reward)
if done:
reward_sum = sum(reward_history[-t:])
reward_sum_running_avg = 0.99*reward_sum_running_avg + 0.01*reward_sum if reward_sum_running_avg else reward_sum
print('Iteration %d, Episode %d (%d timesteps) - last_action: %d, last_action_prob: %.2f, reward_sum: %.2f, running_avg: %.2f' % (it, ep, t, action, action_prob, reward_sum, reward_sum_running_avg))
break
# compute advantage
R = 0
discounted_rewards = []
for r in reward_history[::-1]:
if r != 0: R = 0 # scored/lost a point in pong, so reset reward sum
R = r + policy.gamma * R
discounted_rewards.insert(0, R)
discounted_rewards = torch.FloatTensor(discounted_rewards)
discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / discounted_rewards.std()
# update policy
for _ in range(5):
n_batch = 24576
idxs = random.sample(range(len(action_history)), n_batch)
d_obs_batch = torch.cat([d_obs_history[idx] for idx in idxs], 0)
action_batch = torch.LongTensor([action_history[idx] for idx in idxs])
action_prob_batch = torch.FloatTensor([action_prob_history[idx] for idx in idxs])
advantage_batch = torch.FloatTensor([discounted_rewards[idx] for idx in idxs])
opt.zero_grad()
loss = policy(d_obs_batch, action_batch, action_prob_batch, advantage_batch)
loss.backward()
opt.step()
if it % 5 == 0:
torch.save(policy.state_dict(), 'params.ckpt')
env.close()
After training for 4000 episodes, the policy network consistently beat the "computer player" with an average reward of +14. Here is a video of the agent playing (the agent controls the green paddle to the right).
]]>int main()
{
int *p = (int *)0x02ad;
return *p;
}
x86 processors still boot into 16-bit real mode where this is fine, but the OS switches the processor into protected mode which enables virtual memory. Once virtual memory is enabled, each process has its own virtual memory that the OS has to map (to physical memory, files on the hard drive, device registers, etc.). If the program tries to access unmapped memory, a segfault happens.
When Linux starts a process and loads the executable to memory, the layout of the virtual address space looks something like this:
---------------
| |
| stack |
| |
--------------- 0x7ffc725866b4
| |
| |
| |
| unmapped |
| space |
| |
| |
| |
--------------- 0x000001773000
| |
| data (bss) |
| |
---------------
| |
| data |
| |
---------------
| |
| text |
| |
---------------
The text
segment contains the binary code of the executable, the data
segment
has initialized static variables, the bss
segment has uninitialized static variables
(zeroed out before main() function is called), and the stack
segment contains the stack
(There's also space for the environment variables, and the OS kernel space is also mapped
for performance reasons, but I've skipped these in the diagram.) The adresses of
the these segments is randomized when the executable is loaded as a security measure (ASLR).
When malloc()
is called, it tries to allocate memory from previously freed memory that
is still mapped to the process. But if there is insufficient free memory, malloc()
must
make one of these system calls to request the OS to map additional memory:
The brk
/ sbrk
system calls enlarge the data segment. In the diagram above, calling sbrk(8)
would move the end of the data segment from 0x1773000
to 0x1773008
. If the process wants
to free the memory and return it to the OS, the data segment can be shrunk with the same syscalls.
The mmap
syscall can map pages anywhere in the virtual address space (the equivalent syscall
in Windows is VirtualAlloc
).
The malloc
implementation in glibc uses sbrk
when it needs small amounts of memory (~32K) and mmap
when it needs large amounts of memory. The reason mmap
is preferred for large objects is to prevent
losing too much memory to fragmentation in the data segment; if a small object is allocated with sbrk
after a large object and then, if the large object is freed, that memory cannot be freed
until the small object is freed as well.
mkdir foo.git
cd foo.git
git init --bare
That's it! Now, from the client, clone this repo with:
git clone username@example.com:path/to/foo.git
Having a dedicated user for git repos on the server makes it easier share access to the repo.
Create a new user git
with a login shell restricted to git commands:
sudo adduser --shell $(which git-shell) git
Now create a repo in the home directory of the git
user:
cd /home/git
sudo -u git mkdir bar.git
cd bar.git
sudo -u git git init --bare
As before, clone the new repo from the client using:
git clone git@example.com:bar
This is my script to take daily backups of all the git repos on the server to Amazon S3.
#!/bin/bash
set -e
GITDIR=/home/git
TMPDIR=/tmp/gitbackup
renice -n 15 $$
trap "rm -f /tmp/gitbackup/*.git.tar.gz" EXIT
mkdir -p ${TMPDIR}
cd ${TMPDIR}
for proj in ${GITDIR}/*.git; do
base=$(basename $proj)
tar -C $GITDIR -zcf ${base}.tar.gz $base
done
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=yyyyy
export AWS_DEFAULT_REGION=us-west-2
aws s3 cp ${TMPDIR}/*.git.tar.gz s3://mygitbucket/
If the repos are large, it might be worthwhile checking whether
the hash of the gzipped repo has changed before uploading.
It's also good idea to use envdir
to manage the access keys rather
than putting them in the backup script.
Sometimes it's useful to view source code and commits on a
web browser. cgit
is an awesome light-weight webapp for this.
Unlike heavy apps like GitLab, cgit
needs no database, which
reduces the administrative burden.
Install cgit, nginx, fcgiwrap, and apache-tools (to create a .htpasswd
file).
sudo apt install cgit nginx fcgiwrap apache2-utils
Specify the location of the git repos and static assets in the
cgit
config at /etc/cgitrc
.
css=/cgit-static/cgit.css
logo=/cgit-static/cgit.png
favicon=/cgit-static/favicon.ico
#source-filter=/usr/lib/cgit/filters/syntax-highlighting.py
scan-path=/home/git/
To get syntax highlighting, install python-pygments
and uncomment the source-filter option.
If you'd like to password protect access to www.example.com/git/
, create a .htpasswd
file:
sudo htpasswd /etc/nginx/.htpasswd <username>
This is my nginx
conf file to serve cgit
from www.example.com/git/
.
server {
listen 80;
listen [::]:80;
server_name www.example.com;
location /.well-known/acme-challenge/ {
root /var/www/www.example.com;
}
location / {
return 301 https://www.example.com$request_uri;
}
}
server {
listen 443 ssl;
listen [::]:443 ssl;
server_name www.example.com;
ssl_certificate /etc/letsencrypt/live/www.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/www.example.com/privkey.pem;
location /cgit-static/ {
alias /usr/share/cgit/;
}
location /cgit/ {
auth_basic "Restricted";
auth_basic_user_file /etc/nginx/.htpasswd;
include fastcgi_params;
fastcgi_split_path_info ^(/cgit)(.*)$;
fastcgi_param PATH_INFO $fastcgi_path_info;
fastcgi_param SCRIPT_FILENAME /usr/lib/cgit/cgit.cgi;
fastcgi_param QUERY_STRING $args;
fastcgi_param HTTP_HOST $server_name;
fastcgi_pass unix:/var/run/fcgiwrap.socket;
}
location / {
root /var/www/www.example.com;
}
}
You might also want to restrict repo access to only whitelisted IPs.
]]>Silly though it sounds, this might be a reasonable strategy. Suppose you want to show 15 results per page. Then, show up to 20 pages, and stop there. This works well when it's unlikely that anyone would want to see past the first few pages. Incidentally, Google does something like this for web search results.
SELECT * FROM users ORDER BY creation_date LIMIT 15 OFFSET 45;
This query is not efficient for large offsets because rows up to the offset have to be read and discarded. But that's OK since the offset is limited to a few hundred rows at most. It's a net win if only the first few pages are read most of the time.
This is based on the idea that random access is not really needed and that it's often necessary to only access the next page and the previous page from any given page. When you're on the fourth page, accessing a random page, say page 3124, might be inefficient. But, accessing the third and fifth pages are efficient if the right indexes have been setup. This is accomplished by keeping track of the first and last values of the column on which the results are ordered.
SELECT * FROM users WHERE creation_date > ? ORDER BY creation_date LIMIT 15;
When the next page is requested, the query is executed with the creation_date
of the last user in the current page. For the previous page, the creation_date
of the first user in the current page is used:
SELECT * FROM users WHERE creation_date < ? ORDER BY creation_date DESC LIMIT 15;
If the column by which the results are sorted is not unique, add additional columns or
the primary key to ORDER BY
and keep track of the first and last values of those columns as well.
Another example of using this method for pagination is in the SQLite wiki.
]]>