Commit Graph

44 Commits

Author SHA1 Message Date
Aapo Kyrola
26645154bb warn about using test/val model with init_params=True + fixed some cases
Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...).

Reviewed By: asaadaldien

Differential Revision: D5509963

fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f
2017-07-27 13:20:27 -07:00
Ahmed Taei
804ebf7c41 Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example.
Reviewed By: akyrola

Differential Revision: D5463772

fbshipit-source-id: 10b8963af778503a3de6edbabb869747bd1e986d
2017-07-21 16:24:10 -07:00
Hyungsuk Kang
ca2b608f83 Fixed typo
Summary:
peaces -> pieces, peace -> piece
Closes https://github.com/caffe2/caffe2/pull/819

Differential Revision: D5312417

Pulled By: aaronmarkham

fbshipit-source-id: 59d2c3f475197a5f29dc7cf3ecaf675a242d3cdf
2017-06-23 14:02:40 -07:00
Aapo Kyrola
34eaa19d27 CPU data parallel model
Summary:
CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs).

Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later.

Reviewed By: wesolwsk

Differential Revision: D5277350

fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210
2017-06-20 23:19:08 -07:00
Xiaoti Hu
969831ea33 Deprecate CNNModelHelper in lmdb_create_example
Reviewed By: akyrola

Differential Revision: D5233793

fbshipit-source-id: bae745791f071bc36fd45bd81145ce86c8ba9ed0
2017-06-19 13:04:02 -07:00
Aapo Kyrola
feba1eed00 resnet50: fetch right lr
Summary: I broke resnet50 when switching to use optimizer, which uses LR per parameter. This only happens after each epoch, and I did no test patiently enough. For a stop-gap, while asaadaldien works on a better solution, just fetch the lr of a conv1_w param.

Reviewed By: asaadaldien

Differential Revision: D5207552

fbshipit-source-id: f3474cd5eb0e291a59880e2834375491883fddfc
2017-06-07 21:46:35 -07:00
Thomas Dudziak
60c78d6160 Fixes range/xrange for Python 3
Summary: As title

Differential Revision: D5151894

fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638
2017-06-07 00:04:26 -07:00
Xiangyu Wang
c9c862fa8f 16117716 [Caffe2 OSS] make char-rnn exapmle use build_sgd
Summary: replace hand made sgd with build_sgd

Reviewed By: salexspb

Differential Revision: D5186331

fbshipit-source-id: 3c7b4b370e29a1344b95819766463bae3812c9a6
2017-06-06 13:54:59 -07:00
Aapo Kyrola
401908d570 add_weight_decay + restore weight decay to resnet50_trainer
Summary:
Add add_weight_decay to optimizer + test.

In D5142973 I accidentally removed weight decay from resnet50 trainer, so this restores it.

Reviewed By: asaadaldien

Differential Revision: D5173594

fbshipit-source-id: c736d8955eddff151632ae6be11afde0883f7531
2017-06-02 14:16:56 -07:00
Aapo Kyrola
cdb50fbf2b add optimizer support to data_parallel_model; Use MomentumSGDUpdate
Summary:
This diff does two things:
- add supports for optimizer to data_parallel_model. User can supply optimizer_builder_fun instead of param_update_builder_fun. The latter is called for each GPU separately with proper namescope and devicescope, while optimizer builder only is called once and adds optimizes to the whole model.

- use MomentumSGDUpdate instead of MomentumSGD + WeightedSum. This bring major perf benefits.

Changes resnet50 trainer to use optimizer.

This relies on D5133652

Reviewed By: dzhulgakov

Differential Revision: D5142973

fbshipit-source-id: 98e1114f5fae6c657314b3296841ae2dad0dc0e2
2017-05-30 12:49:57 -07:00
Aapo Kyrola
0af0cba2b7 Refactor data_parallel_model initial sync and checkpointing
Summary:
Major improvements. Before we only synced "params" and "computed params" of model after initialization and after loading a checkpoint. But actually we want to sync all blobs that are generated in the param_init_net. For example the _momentum blobs were missed by the previous implementation and had to be manually included in checkpoint finalization.

I also added GetCheckpointParams() to data_parallel_model because it is now fully general. Also added a unit test.

Reviewed By: andrewwdye

Differential Revision: D5093689

fbshipit-source-id: 8154ded0c73cd6a0f54ee024dc5f2c6826ed7e42
2017-05-19 12:48:06 -07:00
Yiming Wu
a28b01c155 rnn with brew
Summary:
Update rnn_cell.py and char_rnn.py example with new `brew` model.

- Deprecated CNNModelHelper
- replace all helper functions with brew helper functions
- Use `model.net.<SingleOp>` format to create bare bone Operator for better clarity.

Reviewed By: salexspb

Differential Revision: D5062963

fbshipit-source-id: 254f7b9059a29621027d2b09e932f3f81db2e0ce
2017-05-16 13:33:44 -07:00
Yiming Wu
64d43dbb6e new resnet building with brew
Summary: new resnet building with brew

Reviewed By: akyrola

Differential Revision: D4945418

fbshipit-source-id: d90463834cbba2c35d625053ba8812e192df0adf
2017-05-15 22:47:24 -07:00
Heng Wang
8a2433eacb Add model saving and loading to resnet50_trainer.py
Summary:
Script caffe2/caffe2/python/examples/resnet50_trainer.py can be used to train a ResNet-50 model with Imagenet data (or similar).

However, currently the script does not actually save the model, so it is kind of useless.

Task 1:  After each Epoch, save the model in a file "<filename>_X.mdl' where X is the epoch number and <filename> is given as a command line parameter. By default, use "resnet50_model" as filename.

Task 2: Add a functionality to restore the model from a previous file:
 - add a command line parameter "load_model", which user can use to specify a filename.
 - if this parameter is set, load the model parameters from the previous file

Reviewed By: prigoyal

Differential Revision: D4984340

fbshipit-source-id: 333e92679ba52a7effe9917fdfc2d55d652b868f
2017-05-05 10:08:37 -07:00
Yury Zemlyanskiy
31643d5ecb Inference code for seq2seq model
Summary: Beam search implementation

Differential Revision: D4975939

fbshipit-source-id: 67d8b73390221583f36b4367f23626a2aa80f4b4
2017-05-02 22:47:28 -07:00
Yiming Wu
885f906e67 resnet train print loss and accuracy
Summary: printing resnet training loss and accuracy for each batch so that people will have better idea of what is going on

Reviewed By: pietern

Differential Revision: D4945390

fbshipit-source-id: 0fcd60f4735e81641355aba6e6cbf0e57e886e38
2017-04-25 16:03:58 -07:00
Jay Mahadeokar
4dafb608e7 Fix char_rnn LSTM import
Summary:
Fix for char_rnn.py with latest LSTM changes in rFBS779c69758cee8caca6f36bc507e3ea0566f7652a.
Fixed some linting issues.

Reviewed By: salexspb

Differential Revision: D4927018

fbshipit-source-id: cda760a170056b8bc237b4c565cc34800992c8e0
2017-04-20 22:46:19 -07:00
inspire99
f750a2d2df fix a few typos
Summary:
fix typo: Dimention, probablity
Closes https://github.com/caffe2/caffe2/pull/310

Differential Revision: D4915798

Pulled By: Yangqing

fbshipit-source-id: 3a16d3adc469c9930ce0dad8584c4678b3c3b5c0
2017-04-19 13:31:33 -07:00
Yury Zemlyanskiy
4bf559eddb RNNCell, LSTMCell, LSTMWithAttentionCell
Summary: This is the nice way to re-use RNN layers for training and for inference.

Reviewed By: salexspb

Differential Revision: D4825894

fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
2017-04-18 00:47:20 -07:00
Pieter Noordhuis
8c9f4d8c3b Add throughput information to resnet50_trainer
Summary:
TSIA

Makes it easier for throughput debugging.

Differential Revision: D4879634

fbshipit-source-id: 8d479d51b0ec51ad3d86ad5500fc3095400cf095
2017-04-12 17:46:14 -07:00
Pieter Noordhuis
c907c7c7dc Update resnet50_trainer example
Summary:
A few fixes in this commit: the epoch size is now rounded
down to the closest integer multiple of the global batch size (batch
per GPU * GPUs per hosts * hosts per run). The num_shards and shard_id
parameters are now passed to CreateDB so multiple processes actually
train on different subsets of data. The LR step size is scaled by the
number of hosts in the run. The test accuracy is only determined after
each epoch instead of after every so many iterations.

Differential Revision: D4871505

fbshipit-source-id: d2703dc7cf1e0f76710d9d7c09cd362a42fe0598
2017-04-12 14:03:51 -07:00
Pieter Noordhuis
26d301fbe4 Configurable CuDNN workspace limit in resnet50_trainer
Summary: TSIA

Reviewed By: Yangqing, bwasti

Differential Revision: D4835477

fbshipit-source-id: a0083188fe91a56c5f910c7dda46412e38632d7e
2017-04-05 10:50:00 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Yury Zemlyanskiy
0c47d345df Multi-gpu training for OSS seq2seq
Summary:
Use data_parallel_model for seq2seq multi-gpu training. The main reason for complexity here is that GatherOp hasn't yet been implemented on GPU.

This diff also adds better cliping procedure - clip by global norm rather than by absolute value.

Differential Revision: D4778691

fbshipit-source-id: bff184dae02ecc227413fef51f48a4726e5d3825
2017-03-27 17:32:39 -07:00
James Reed
33f41c06c0 Remove more instances of batch_size
Summary: D4734505 part 2. Remove more instances of the batch_size parameter

Reviewed By: urikz

Differential Revision: D4736906

fbshipit-source-id: fc9d374e9308017d61c427890364c5ab9cec2edf
2017-03-19 22:31:30 -07:00
Pieter Noordhuis
92101aa87a Update resnet50 example
Summary:
Make it use Gloo and optionally use Redis for rendezvous (where a
shared filesystem is not available).

Differential Revision: D4709943

fbshipit-source-id: 59cc7a14316c7b634417ea5161a75fab3c19f2fa
2017-03-15 08:18:50 -07:00
Deepak Gopinath
a1d63da6af Adding UNK to vocab | Changing default params
Summary: UNK needs tobe indexed in the vocabulary for validation to work. Default args now result in training loss decreasing.

Reviewed By: urikz

Differential Revision: D4703393

fbshipit-source-id: e4d6ad100daf8392f8ba1e502f9ecf39bb8ce24a
2017-03-13 22:17:48 -07:00
Deepak Gopinath
001ac5d751 Fix to use appropriate corpus and vocab in eval
Summary: We should be using the vocabulary built on the training data, and corpus_eval as data for the evaluation phase.

Reviewed By: urikz

Differential Revision: D4700382

fbshipit-source-id: ca1dd043a28f9bb585faad050c82fb12c1cdf6cc
2017-03-13 14:31:27 -07:00
Pieter Noordhuis
6729d81418 Specify which GPUs to use in resnet50 example
Summary:
TSIA

This change also fixes an undefined attribute error after running 20
iterations of the resnet50 example trainer.

Differential Revision: D4692794

fbshipit-source-id: b98efdfeb078c5ba89d2a86837f3c672e1eade5f
2017-03-12 22:33:15 -07:00
Deepak Gopinath
57ecd20197 seq2seq open source implementation
Summary:
OSS implementation of seq2seq model in Caffe2. The script uses Seq2SeqModelCaffe2 class to build and run the model. It takes in training data in the form of text file with one sentence in each line, builds a vocabulary, generates batches based on batch size and runs the net for a configurable number of epochs. It prints total scalar loss at the end of each epoch.

All FBLearner and neural_mt type system dependencies have been removed. Unimplemented and unnecessary methods have been removed to make the script simpler.
fblearner/flow/projects/langtech/translation/neural_mt/model_util_caffe2.py has been moved to caffe2/caffe2/python/examples/seq2seq_util.py and remains unchanged

Potential TODOs:
  - Get the model running in GPU. Only GatherOp does not have a corresponding GPU implementation. Try adding CopyGPUToCPU before and CopyCPUToGPU after Gather, and use CUDA DeviceOption.
  - Add evaluation on test data with suitable metric (perplexity? bleu?)

Reviewed By: urikz

Differential Revision: D4653333

fbshipit-source-id: 1c7d970ebc86afe23fad4d48854296bf54eb0f77
2017-03-09 16:18:08 -08:00
Ahmed Taei
4f0e7730a9 Distrubited Multi-GPU resnet50
Summary: Use filesystem rendezvous for dist-multi GPU training.

Differential Revision: D4664945

fbshipit-source-id: 7b6767323e94bc4e7fa25ef3eba65b38abb79341
2017-03-08 11:39:29 -08:00
Alexander Sidorov
95262032d8 ] Char RNN bug fix for batching
Summary:
It could be that only first item
in the batch was really used in a case rest of the memory was 0. Or if
memory there had a big positive integer, then whole sequence was used. So we used rest of the batch depending on our luck :)

Reviewed By: Yangqing

Differential Revision: D4599569

fbshipit-source-id: ae89cee796bbcbc232e4abcab71dee360b0d8bc6
2017-02-22 17:34:30 -08:00
Alexander Sidorov
2727317384 char-rnn: add comments
Summary: Just some comments

Reviewed By: pietern

Differential Revision: D4544518

fbshipit-source-id: b517023bf5e9712a2bf96ae15a709553e5ee6032
2017-02-10 12:20:58 -08:00
Alexander Sidorov
98f66fd282 Char-rnn : fix batching
Summary:
Input have to be arranged in such a way so j-th example of
batch i goes right before j-th example in batch i+1 in the text.

Reviewed By: urikz

Differential Revision: D4519553

fbshipit-source-id: 9dd80658e0c4d9ff0f97a7904cbb164f267fe39f
2017-02-10 10:07:32 -08:00
Alexander Sidorov
e676f4411b GPU support for RecurrentOp + Char RNN example
Summary: On batch size of 32 and other default parameters I get 70 iterations per second vs. 40 on CPU. batching still doesn't produce good loss, I am going to work on this in a separate diff

Reviewed By: urikz

Differential Revision: D4516566

fbshipit-source-id: d0611534747beb2cd935a8607a283369378e4a6c
2017-02-09 22:54:53 -08:00
Aapo Kyrola
1c7886701e lr_scale to loss_scale
Summary:
As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/.

So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this.

In this diff I modified all my models to work correctly.

Reviewed By: Yangqing

Differential Revision: D4507002

fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90
2017-02-03 07:44:40 -08:00
Alexander Sidorov
2ce3cfefe1 Char-RNN Tutorial
Summary:
This learns Shakespeare and then generates samples one character at a time. We want this to be an example of using our LSTM and RNNs in general.

Now it takes 4ms to run the training net on current parameters (with batch size = 1). I don't have data on how much each operator takes yet. But overal python loop doesn't seem to influence much - with 1000 fake iterations in run_net it took 4s for each iteration as expected.

Future work:

* fixing convergence for batching
* profiling on operator level
* trying it out with GPUs
* benchmarking against  existing char-rnn implementations
* stacking lstms (one lstm is different from two, one needs to take care of scoping)

Reviewed By: urikz

Differential Revision: D4430612

fbshipit-source-id: b36644fed9844683f670717d57f8527c25ad285c
2017-02-02 15:44:32 -08:00
Aapo Kyrola
95b3309a87 Gradient Input memory sharing using memonger blob sharing
Summary:
This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs.

In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock.

The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough.

Module data_parallel_model supports this feature natively.

Reviewed By: prigoyal

Differential Revision: D4363209

fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1
2017-01-09 19:44:23 -08:00
Aapo Kyrola
e8dc09064e exhaustive_search=True
Summary: For some reason I had been disabling the exhaustive search heuristic for cudnn for xray/resnet trainers. On BigBasin, this gives 10% perf boost. On BigSur maybe 5%.

Reviewed By: prigoyal

Differential Revision: D4338654

fbshipit-source-id: 3974dd612f5d4f4dc8b2febccb59664d3f276c3e
2016-12-15 22:59:27 -08:00
Aapo Kyrola
68cfc52452 MomemtumSGDUpdate -- version of MomentumSGD with update.
Summary:
It gives a significant perf boost to do the parameter update inside MomentumSGD, instead of with a separate WeightedSum op.
To ensure backwards compatibility, I made it a separate op.

Also added an unit test.

Reviewed By: prigoyal

Differential Revision: D4262446

fbshipit-source-id: 38e7ee6d7677b398658ac7fe9b7a59b569e033f4
2016-12-15 12:01:29 -08:00
Aapo Kyrola
e65eeff665 LMDB example
Summary:
This examples writes a LMDB database of image data and labels (random). Then it reads them using Caffe2's TensorProtosDBINput and validates the checksums match. This example shows how to coerce image data into TensorProtos and be happy.

Before there was no clear example how to create databases for Caffe2.

Differential Revision: D4263614

fbshipit-source-id: 21e08066899095b4efcc2d23dbc3ede81e75914a
2016-12-05 11:53:26 -08:00
Aapo Kyrola
3410939459 pass learning rate scaling factor to parameter update builder function
Summary:
When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus.

Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size.

Reviewed By: prigoyal

Differential Revision: D4248907

fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be
2016-12-05 11:53:26 -08:00
Aapo Kyrola
b9f1555b6a remove unused function from resnet50_trainer
Summary: Just noticed that I had duplicate code in the example imagenet trainer. Removed the function.

Differential Revision: D4223070

fbshipit-source-id: 443a9401bf7e425f7a3a13a44c9d0f7e21e72303
2016-11-29 15:18:37 -08:00
Yangqing Jia
589398950f fbsync at f5a877 2016-11-18 15:41:06 -08:00