天道酬勤,学无止境

Why is my CPU doing matrix operations faster than GPU instead?

When I tried to verify that the GPU does matrix operations over the CPU, I got unexpected results.CPU performs better than GPU according to my experience result, it makes me confused.

I used cpu and gpu to do matrix multiplication respectively.Programming environment is MXNet and cuda-10.1.

with gpu:

import mxnet as mx
from mxnet import nd
x = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
y = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
%timeit nd.dot(x,y)

50.8 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

with cpu:

x1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
y1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
%timeit nd.dot(x1,y1)

33.4 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Why CPU faster? My CPU model is i5-6300HQ and GPU model is Nividia GTX 950M.

评论

TLDR: Your matrix multiplication is actually not running :)

MXNet is an asynchronous framework that piles work requests in a queue treated asynchronously on a need-to-run basis by its execution engine. So what you're measuring is only the time it took to send the request, not to execute it. That's why it is so small (microseconds on a 100k*100k matrix would be surprisingly fast) and roughly equal for both CPU and GPU. To force execution, you need to add a call that forces production of a result, for example a print or a nd.dot(x, y).wait_to_read(). See here a code very similar to your benchmark https://github.com/ThomasDelteil/MXNetParisWorkshop/blob/master/FromNDArrayToTrainedModel.ipynb

Extra comments:

  1. The gain of using GPU vs CPU comes with the size of the parallelism opportunity. On simple tasks, that gain can be small to non existent. CPU core frequencies are actually 2 to 3 times bigger than GPU frequencies (your i5-6300HQ does 2.3GHz with 3.2GHz boost ability while your GTX 950M does 0.9GHz with 1.1GHz boost ability).

  2. MXNet ndarray is very fast at matrix algebra on CPU, because (1) its asynchronous paradigm optimizes the order of computation (2) its C++ backend runs things in parallel and (3) I believe the default MXNet build comes with Intel MKL, which significantly boosts algebra capacities of Intel CPUs (https://medium.com/apache-mxnet/mxnet-boosts-cpu-performance-with-mkl-dnn-b4b7c8400f98). Its ability to run compute on GPU within the same API is also a big strength over Numpy for example.

  3. I don't think your test will run on GPU: instantiating such a big matrix on an NVIDIA Tesla V100 (16GB men, 4x more than a GTX 950M) runs in a "large tensor size error"

I don't know the module you're using but your CPU can access your memory way quicker and also saves a lot of stuff in cache. Your GPU has longer times to load the data into GPU memory and also takes longer to be called from your CPU. Thats always the downside of GPU computation. When you can load a bunch of data into GPU memory, there's a good chance of being faster. Btw, thats why deep learning frameworks work in batches. When you can't work with batches, I'd always use the CPU. You also got some potential for performance improvements with multiprocessing.

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐