I am making some benchmarks with CUDA, C++, C#, and Java, and using MATLAB for verification and matrix generation. But when I multiply with MATLAB, 2048x2048
and even bigger matrices are almost instantly multiplied.
1024x1024 2048x2048 4096x4096
--------- --------- ---------
CUDA C (ms) 43.11 391.05 3407.99
C++ (ms) 6137.10 64369.29 551390.93
C# (ms) 10509.00 300684.00 2527250.00
Java (ms) 9149.90 92562.28 838357.94
MATLAB (ms) 75.01 423.10 3133.90
Only CUDA is competitive, but I thought that at least C++ will be somewhat close and not 60x
slower.
So my question is - How is MATLAB doing it that fast?
C++ Code:
float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
for (int k = 0; k < rozmer; k++)
{
temp = 0;
for (int m = 0; m < rozmer; m++)
{
temp = temp + matice1[j][m] * matice2[m][k];
}
matice3[j][k] = temp;
}
}
timer.stop();
Edit:
I also dont know what to think about the C# results. The algorithm is just the same as C++ and Java, but there's a giant jump 2048
from 1024
?
Edit2:
Updated MATLAB and 4096x4096
results
Answer
Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:
>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.
MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray
version uses MAGMA.
Update using R2014a on a machine with a Tesla K20c, and the new timeit
and gputimeit
functions:
>> A = rand(1024); gA = gpuArray(A);
>> timeit(@()A*A)
ans =
0.0324
>> gputimeit(@()gA*gA)
ans =
0.0022
Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:
>> timeit(@()A*A)
ans =
0.0229
>> gputimeit(@()gA*gA)
ans =
4.8019e-04
No comments:
Post a Comment