Files
pytorch/binaries/record_function_benchmark.cc

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

128 lines
3.8 KiB
C++
Raw Normal View History

#include <torch/torch.h>
#include <ATen/record_function.h>
#include "c10/util/Flags.h"
#include <chrono>
#include <iostream>
#include <ctime>
C10_DEFINE_int(iter, 10000, "Number of iterations");
C10_DEFINE_int(sampled_iter, 10e6,
"Number of iterations for the sampled observer benchmark");
namespace {
const int kTensorSize = 16;
const int kSmallTensorSize = 1;
[rfc] Reduce number of coin flips in RecordFunction (#40758) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758 Currently we flip a coin for each sampled callback each time we run RecordFunction, this PR is an attempt to skip most of the coin flips (for the low-probability observers) and keep the distribution close to the original one Test Plan: CI and record_function_benchmark ``` (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 30108 us. Time per iteration (1x1): 1496.78 us. Time per iteration (16x16): 2142.46 us. Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 19051 us. Time per iteration (1x1): 1581.89 us. Time per iteration (16x16): 2195.67 us. Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18715 us. Time per iteration (1x1): 1566.11 us. Time per iteration (16x16): 2131.17 us. Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18814 us. Time per iteration (1x1): 1536.2 us. Time per iteration (16x16): 1985.82 us. Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18278 us. Time per iteration (1x1): 1526.32 us. Time per iteration (16x16): 2093.77 us. Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18545 us. Time per iteration (1x1): 1524.65 us. Time per iteration (16x16): 2080 us. Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048 ``` Reviewed By: dzhulgakov Differential Revision: D22320879 Pulled By: ilia-cher fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
2020-06-30 17:21:26 -07:00
const float kLowSamplingProb = 0.0001;
}
void addTestCallback(
double sampling_prob = 1.0,
at::RecordFunctionCallback::StartCallback fn =
[](const at::RecordFunction&) -> std::unique_ptr<at::ObserverContext> { return nullptr; }) {
auto cb = at::RecordFunctionCallback(
fn,
[](const at::RecordFunction&, at::ObserverContext*) {})
.needsInputs(false);
if (sampling_prob < 1.0) {
cb.samplingProb(sampling_prob);
}
at::addGlobalCallback(cb);
}
float runTensorGEMMBench(int tensor_size, int iter) {
typedef std::chrono::high_resolution_clock clock;
typedef std::chrono::microseconds us;
std::chrono::time_point<clock> start_time = clock::now();
auto inp = torch::randn({tensor_size, tensor_size});
for (auto idx = 0; idx < iter; ++idx) {
torch::mm(inp, inp);
}
auto duration = static_cast<float>(
std::chrono::duration_cast<us>(clock::now() - start_time).count());
return duration;
}
float runPureRecordFunctionBench(int iter) {
typedef std::chrono::high_resolution_clock clock;
typedef std::chrono::microseconds us;
std::chrono::time_point<clock> start_time = clock::now();
for (auto idx = 0; idx < iter; ++idx) {
auto step_callbacks = at::getStepCallbacksUnlessEmpty(at::RecordScope::USER_SCOPE);
if (step_callbacks.has_value()) {
at::RecordFunction guard(std::move(*step_callbacks));
[RecordFunction] More effecient machinery to determine which callbacks to run. (#75807) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75807 There is a tension in RecordFunction between two use cases: 1) In the normal eager path we don't run any callbacks, so we need to bail out of the profiling path as soon as possible to minimize eager overhead. 2) When profiling we want to determine which callbacks to run as efficiently as possible to minimize instrumentation overhead. The confounding factor in all of this is sampling callbacks because they change which callbacks will run on each call, even in steady state operation. This has traditionally been handled with a two stage procedure: first we flip a coin to determine if a sampled callback *might* run. If false (which it usually is), do nothing. This solves (1). If true, check to see if we need to build the full callback set or if it was a false positive. This procedure has two negative effects: * It forces us to rebuild the set of callbacks to run on every step when profiling * It leaks the sampling abstraction, requiring other parts of the code to bump certain values and forces RecordFunction to lazily initialize. This change introduces a multi-level cache which can (in the common case) quickly determine which callbacks *will* run, rather than if callbacks *might* run. This means that rather than call `shouldRunRecordFunction`, we can simply get the callbacks for an invocation and check if they are empty. (And completely removes the pre-sampling heuristic.) Another major benefit of the new cache structure is that it allows thread-safe registration and unregistration of global callbacks. It's worth briefly discussing how this maintains eager performance. In the standard eager case (only sampling callbacks registered) the cache first checks that the global callbacks haven't changed (atomic read), decrements a counter to see if a sampling callback fired, and then returns the active callbacks which is simply a SmallVector of pointer pairs and a couple POD values (scope, needs inputs/outputs/ids). The biggest cost according to perf is the SmallVector logic; we could consider adopting a hard limit on active callbacks; more than half a dozen callbacks *running* in a single step would be quite a lot. But the total cost relative to `PYTORCH_DISABLE_PER_OP_PROFILING` is only ~10ns, so debatable if it's worth it to switch to `std::array`. The primary change is in `record_function.cpp`, which has a more detailed description of the new cache structure. `record_function.h` has some minor changes to align with the new calling convention and the remaining files are simply changes to the call sites. Future work: * RecordFunction no longer needs to be lazily initialized. * We can deprecate the disable/reenable APIs, since we can not safely add and remove global callbacks. Test Plan: I tested eager mode performance using the overhead benchmark and found that the non-profiled path was unaffected. However the no-op observer dropped from 0.41us to 0.37us (0.25us if no observers are active) which is about 1/3rd reduction in the cost of the callback selection machinery. I also added several C++ unit tests, as the core RecordFunction machinery (especially sampling) was largely untested. Reviewed By: swolchok, davidberard98 Differential Revision: D35276158 fbshipit-source-id: 35135f444724fba4eb97c0ae7f3f710f0f9016fd (cherry picked from commit 9e359b87422c18f2a195185f32e7e85c82f956fd)
2022-04-19 13:40:00 -07:00
guard.before("Test", -1);
}
}
auto duration = static_cast<float>(
std::chrono::duration_cast<us>(clock::now() - start_time).count());
return duration;
}
void runBenchmark() {
float duration = 0;
for (auto tensor_size : std::set<int>({kSmallTensorSize, kTensorSize})) {
duration = runTensorGEMMBench(tensor_size, FLAGS_iter);
std::cout << "Tensor GEMM benchmark ("
<< tensor_size
<< "x"
<< tensor_size
<< ", " << FLAGS_iter << "): " << duration
<< " us." << std::endl;
}
duration = runPureRecordFunctionBench(FLAGS_iter);
std::cout << "Pure RecordFunction benchmark ("
<< FLAGS_iter << "): "
<< duration
<< " us." << std::endl;
}
int main(int argc, char** argv) {
if (!c10::ParseCommandLineFlags(&argc, &argv)) {
std::cout << "Failed to parse command line flags" << std::endl;
return -1;
}
at::enableRecordFunction();
at::clearCallbacks();
std::cout << "Warm up" << std::endl;
runBenchmark();
[rfc] Reduce number of coin flips in RecordFunction (#40758) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758 Currently we flip a coin for each sampled callback each time we run RecordFunction, this PR is an attempt to skip most of the coin flips (for the low-probability observers) and keep the distribution close to the original one Test Plan: CI and record_function_benchmark ``` (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 30108 us. Time per iteration (1x1): 1496.78 us. Time per iteration (16x16): 2142.46 us. Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 19051 us. Time per iteration (1x1): 1581.89 us. Time per iteration (16x16): 2195.67 us. Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18715 us. Time per iteration (1x1): 1566.11 us. Time per iteration (16x16): 2131.17 us. Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18814 us. Time per iteration (1x1): 1536.2 us. Time per iteration (16x16): 1985.82 us. Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18278 us. Time per iteration (1x1): 1526.32 us. Time per iteration (16x16): 2093.77 us. Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18545 us. Time per iteration (1x1): 1524.65 us. Time per iteration (16x16): 2080 us. Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048 ``` Reviewed By: dzhulgakov Differential Revision: D22320879 Pulled By: ilia-cher fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
2020-06-30 17:21:26 -07:00
std::cout << "Running without observers" << std::endl;
runBenchmark();
addTestCallback();
std::cout << "Running with empty non-sampled observer" << std::endl;
runBenchmark();
at::clearCallbacks();
addTestCallback(kLowSamplingProb);
std::cout << "Running with empty sampled observer" << std::endl;
runBenchmark();
at::clearCallbacks();
std::cout << "Checking number of sampled observer invocations" << std::endl;
static int cb_count = 0;
addTestCallback(
kLowSamplingProb,
[](const at::RecordFunction&) -> std::unique_ptr<at::ObserverContext> {
[rfc] Reduce number of coin flips in RecordFunction (#40758) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758 Currently we flip a coin for each sampled callback each time we run RecordFunction, this PR is an attempt to skip most of the coin flips (for the low-probability observers) and keep the distribution close to the original one Test Plan: CI and record_function_benchmark ``` (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 30108 us. Time per iteration (1x1): 1496.78 us. Time per iteration (16x16): 2142.46 us. Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 19051 us. Time per iteration (1x1): 1581.89 us. Time per iteration (16x16): 2195.67 us. Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18715 us. Time per iteration (1x1): 1566.11 us. Time per iteration (16x16): 2131.17 us. Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18814 us. Time per iteration (1x1): 1536.2 us. Time per iteration (16x16): 1985.82 us. Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18278 us. Time per iteration (1x1): 1526.32 us. Time per iteration (16x16): 2093.77 us. Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18545 us. Time per iteration (1x1): 1524.65 us. Time per iteration (16x16): 2080 us. Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048 ``` Reviewed By: dzhulgakov Differential Revision: D22320879 Pulled By: ilia-cher fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
2020-06-30 17:21:26 -07:00
++cb_count;
return nullptr;
}
[rfc] Reduce number of coin flips in RecordFunction (#40758) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758 Currently we flip a coin for each sampled callback each time we run RecordFunction, this PR is an attempt to skip most of the coin flips (for the low-probability observers) and keep the distribution close to the original one Test Plan: CI and record_function_benchmark ``` (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 30108 us. Time per iteration (1x1): 1496.78 us. Time per iteration (16x16): 2142.46 us. Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 19051 us. Time per iteration (1x1): 1581.89 us. Time per iteration (16x16): 2195.67 us. Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18715 us. Time per iteration (1x1): 1566.11 us. Time per iteration (16x16): 2131.17 us. Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18814 us. Time per iteration (1x1): 1536.2 us. Time per iteration (16x16): 1985.82 us. Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18278 us. Time per iteration (1x1): 1526.32 us. Time per iteration (16x16): 2093.77 us. Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18545 us. Time per iteration (1x1): 1524.65 us. Time per iteration (16x16): 2080 us. Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048 ``` Reviewed By: dzhulgakov Differential Revision: D22320879 Pulled By: ilia-cher fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
2020-06-30 17:21:26 -07:00
);
auto duration = runPureRecordFunctionBench(FLAGS_sampled_iter);
std::cout << "Pure RecordFunction runtime of " << FLAGS_sampled_iter
<< " iterations: " << duration
[rfc] Reduce number of coin flips in RecordFunction (#40758) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758 Currently we flip a coin for each sampled callback each time we run RecordFunction, this PR is an attempt to skip most of the coin flips (for the low-probability observers) and keep the distribution close to the original one Test Plan: CI and record_function_benchmark ``` (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 30108 us. Time per iteration (1x1): 1496.78 us. Time per iteration (16x16): 2142.46 us. Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 19051 us. Time per iteration (1x1): 1581.89 us. Time per iteration (16x16): 2195.67 us. Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18715 us. Time per iteration (1x1): 1566.11 us. Time per iteration (16x16): 2131.17 us. Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18814 us. Time per iteration (1x1): 1536.2 us. Time per iteration (16x16): 1985.82 us. Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18278 us. Time per iteration (1x1): 1526.32 us. Time per iteration (16x16): 2093.77 us. Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18545 us. Time per iteration (1x1): 1524.65 us. Time per iteration (16x16): 2080 us. Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048 ``` Reviewed By: dzhulgakov Differential Revision: D22320879 Pulled By: ilia-cher fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
2020-06-30 17:21:26 -07:00
<< " us, number of callback invocations: " << cb_count
<< ", expected number: ~" << (int)(FLAGS_sampled_iter * kLowSamplingProb)
[rfc] Reduce number of coin flips in RecordFunction (#40758) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758 Currently we flip a coin for each sampled callback each time we run RecordFunction, this PR is an attempt to skip most of the coin flips (for the low-probability observers) and keep the distribution close to the original one Test Plan: CI and record_function_benchmark ``` (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 30108 us. Time per iteration (1x1): 1496.78 us. Time per iteration (16x16): 2142.46 us. Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 19051 us. Time per iteration (1x1): 1581.89 us. Time per iteration (16x16): 2195.67 us. Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18715 us. Time per iteration (1x1): 1566.11 us. Time per iteration (16x16): 2131.17 us. Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18814 us. Time per iteration (1x1): 1536.2 us. Time per iteration (16x16): 1985.82 us. Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18278 us. Time per iteration (1x1): 1526.32 us. Time per iteration (16x16): 2093.77 us. Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18545 us. Time per iteration (1x1): 1524.65 us. Time per iteration (16x16): 2080 us. Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048 ``` Reviewed By: dzhulgakov Differential Revision: D22320879 Pulled By: ilia-cher fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
2020-06-30 17:21:26 -07:00
<< " invocations" << std::endl;
at::clearCallbacks();
return 0;
}