Files
pytorch/caffe2/python
Luke Yeager 82e318cf8b Optimizer: one LR op per (device, optimizer)
Summary:
Try running this script through `nvprof`:
```py
import numpy as np
from caffe2.proto import caffe2_pb2
from caffe2.python import brew, core, optimizer, workspace
from caffe2.python.model_helper import ModelHelper

do = core.DeviceOption(caffe2_pb2.CUDA, 0)
with core.DeviceScope(do):
    model = ModelHelper(arg_scope={'order': 'NCHW'})
    conv1 = brew.conv(model, 'data', 'conv1', 1, 20, 5)
    pool1 = brew.max_pool(model, conv1, 'pool1', kernel=2, stride=2)
    conv2 = brew.conv(model, pool1, 'conv2', 20, 50, 5)
    pool2 = brew.max_pool(model, conv2, 'pool2', kernel=2, stride=2)
    fc3 = brew.fc(model, pool2, 'fc3', 50 * 4 * 4, 500)
    fc3 = brew.relu(model, fc3, fc3)
    pred = brew.fc(model, fc3, 'pred', 500, 10)
    softmax, loss = model.SoftmaxWithLoss([pred, 'label'], ['softmax', 'loss'])
    model.AddGradientOperators([loss])
    optimizer.build_sgd(model, 0.01,
                        policy='step', stepsize=1, gamma=0.999,
                        momentum=0.9, nesterov=False)
    workspace.FeedBlob('data', np.zeros((1, 1, 28, 28), dtype=np.float32))
    workspace.FeedBlob('label', np.zeros((1, 1), dtype=np.int32))

workspace.RunNetOnce(model.param_init_net)
workspace.CreateNet(model.net)

for _ in range(100):
    workspace.RunNet(model.net)
```
Before this change:
```
                    1.55%  1.4185ms       837  1.6940us  1.6630us  2.4000us  [CUDA memcpy HtoD]
                    0.72%  656.03us       200  3.2800us  3.1350us  3.5840us  [CUDA memcpy DtoD]
                    0.39%  7.1574ms      1034  6.9220us  3.8300us  18.677us  cudaMemcpyAsync
                    0.00%  34.180us         3  11.393us  9.0960us  12.910us  cudaMemcpy
```
And after it (look at the third column):
```
                    0.73%  657.15us       200  3.2850us  3.1040us  3.6160us  [CUDA memcpy DtoD]
                    0.26%  235.07us       137  1.7150us  1.6640us  2.3680us  [CUDA memcpy HtoD]
                    0.20%  3.4493ms       334  10.327us  6.4220us  16.958us  cudaMemcpyAsync
                    0.00%  37.376us         3  12.458us  9.4120us  15.412us  cudaMemcpy
```
That makes a pretty big difference in performance. Is there any particular reason you decided to have a separate `LearningRate` op for every parameter in 1317e3498c?
Closes https://github.com/caffe2/caffe2/pull/893

Reviewed By: kennyhorror

Differential Revision: D5372541

Pulled By: asaadaldien

fbshipit-source-id: 57357e1be2d58ce294058e9422fb3b1eddfca24d
2017-07-12 21:17:49 -07:00
..
2017-06-23 14:02:40 -07:00
2017-07-11 18:17:52 -07:00
2017-03-29 06:46:16 -07:00
2017-07-06 15:17:07 -07:00
2017-07-10 17:52:25 -07:00
2017-06-25 17:21:24 -07:00
2017-07-11 18:17:52 -07:00
2017-05-18 23:35:26 -07:00
2017-03-29 06:46:16 -07:00
2017-06-27 22:06:30 -07:00
2017-07-02 13:04:20 -07:00
2017-05-26 16:04:32 -07:00
2017-03-29 06:46:16 -07:00
2017-05-05 14:16:38 -07:00
2017-07-10 17:52:25 -07:00
2017-03-29 06:46:16 -07:00
2017-07-03 22:18:32 -07:00
2017-03-29 06:46:16 -07:00
2017-06-07 00:04:26 -07:00
2017-06-07 00:04:26 -07:00
2017-07-11 15:26:52 -07:00
2017-06-21 03:18:20 -07:00
2017-04-25 21:17:04 -07:00
2017-06-28 13:50:48 -07:00
2017-03-29 06:46:16 -07:00
2017-07-07 23:06:11 -07:00
2017-05-01 20:18:30 -07:00
2017-03-29 06:46:16 -07:00
2017-03-29 06:46:16 -07:00