Benchmarks
Compiler: g++
Compilation flags: -O3 -Wno-deprecated -march=pentium4 -ffast-math
Libs: -lm -llapack -lblas -lg2c
Computer architecture: Intel(R) Pentium(R) 4 CPU 3.00GHz, L2 cache 2048 KB
Complex numbers addition
Consider the next simple lines of code:
#include <complex>
#include <string.h>
#include <fstream>
using namespace std;
typedef complex<double> dcomp;
int main(){
dcomp x=(2,1), y=(1,1), z;
double a=2, b=1, c=1, d=1, g, zre, zim;
for(int i = 0; i < 250; i++){
for(int j = 0; j < 10000000; j++){
z = x + y;
}
}
}
This takes 10.13 seconds. The loop overhead (the loop executed without the z = x + y; statement) is 1.27 seconds.
If we run the same loop with the expression z = x + y
replaced with zre = a + b; zim = c + d;
the loop will take 1.27 seconds,
showing that most of the time is spent evaluating the loop, thus leaving us little room for a quantitative evaluation, but still wondering about the lame performance
of the complex numbers addition.
The same trick performed with arrays:
dcomp *xac = new dcomp [1];
dcomp *yac = new dcomp [1];
dcomp *zac = new dcomp [1];
xac[0] = (2,1);
yac[0] = (1,1);
for(int i = 0; i < 250; i++){
for(int j = 0; j < 10000000; j++){
zac[0] = xac[0] + yac[0];
}
}
delete [] xac;
delete [] yac;
delete [] zac;
yields 16.61 seconds. Replacing this with
double *xac = new double [2];
double *yac = new double [2];
double *zac = new double [2];
xac[0] = 2.;
xac[1] = 1.;
yac[0] = 1.;
yac[1] = 1.;
for(int i = 0; i < 250; i++)
for(int j = 0; j < 10000000; j++){
zac[0] = xac[0] + yac[0];
zac[1] = xac[1] + yac[1];
}
}
yields 5.93 seconds. The loop overhead is 1.60 seconds for both cases, hence we'll get 15 seconds, respectively 4,33 seconds. In this case, adding the real and imaginary
parts separately was 3,46 times faster. Using the following trick:
double dummy_re, dummy_im;
for(int i = 0; i < 250; i++)
for(int j = 0; j < 10000000; j++){
dummy_re = xac[0] + yac[0];
dummy_im = xac[1] + yac[1]
}
zac[0] = dummy_re;
zac[1] = dummy_im;
we get an execution time smaller than the loop overhead, aproximately 1.26 seconds.
Complex numbers multiplication
The same loop with:
x = (2,1);
y = (1,1);
z = x*y;
takes 10.7 seconds, with a loop overhead of 1.4 seconds, final call - 9.3 seconds.
Whilst explicitely using the real and imaginary parts
double xre = 2., xim = 1., yre = 1., yim = 1., zre, zim;
zre= xre*yre - xim*yim;
zim = xre*yim + xim*yre;
takes 1.72 seconds, with a loop overhead of 1.6 seconds, final call - 0.12 seconds. The performance gain is ridiculously large, 77.5. Unfortunately we don't expect such
a lavish speed gain when using arrays.
Indeed, for:
dcomp *x = new dcomp [1];
dcomp *y = new dcomp [1];
dcomp *z = new dcomp [1];
x[0] = (2,1);
y[0] = (1,1);
z[0] = x[0]*y[0];
we get 18.97 seconds, with a loop overhead of 1.34 seconds. As for
double *x = new double [2];
double *y = new double [2];
double *z = new double [2];
x[0] = 2.;
x[1] = 1.;
y[0] = 1.;
y[1] = 1.;
z[0] = x[0]*y[0] - x[1]*y[1];
z[1] = x[0]*y[1] + x[1]*y[0];
we get only 12.05 seconds, with a loop overhead of 1.36 seconds. The performance gain is aproximately 1.65. A dramatic improvement is obtained if one uses the
known trick:
double dummy_re, dummy_im;
for(int i = 0; i < 250; i++)
for(int j = 0; j < 10000000; j++){
dummy_re = x[0]*y[0] - x[1]*y[1];
dummy_im = x[0]*y[1] + x[1]*y[0];
}
z[0] = dummy_re;
z[1] = dummy_im;
with the execution time arround (in most of the cases even smaller than) the loop overhead ~ 1.3 seconds. This result, identical for addition and multiplication
shows that writing array elements is an expensive operation.
Memory benchmarks
The code used for this benchmark looks as follows:
int m = 128;
double *zet = new double [m];
int steps = 0;
i = 1;
int naccess = 4096;
int s = 4;
for(int j = 0; j < 100000; j++){
steps = 0;
while (steps < naccess){
zet[i] = 0.;
steps = steps +1;
i = i +s;
if(i > m)
i = 1;
}
}
The number of operations the loop performs is constant 4096 * 10e5, the loop overhead is 0.21 seconds.
We intialize an array of dimension 'm' and we step trough it's elements 'nacces' times with steps of size 's'.
For various sizes of s and m we measure the time needed to perform the operation zet[i] = 0.;. Thus we hope to emphasize the cache miss penalty, to estimate the
L1 cache size and number of cache lines.
s |
m=64 |
m=128 |
m=256 |
m=512 |
m=1024 |
m=2048 |
m=4096 |
m=8192 |
m=16384 |
m=32768 |
m=65536 |
m=131072 |
m=163840 |
m=262144 |
1 |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
2 |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.51 s |
2.19 s |
4 |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.31s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
1.77 s |
4.01 s |
8 |
1.3 s |
2.06 s |
2.04 s |
2.11 s |
2.03 s |
2.09 s |
2.09 s |
2.11 s |
2.19 s |
2.20 s |
2.05 s |
2.35 s |
2.70 s |
8.31 s |
16 |
1.3 s |
1.3 s |
1.98 s |
1.99 s |
1.99 s |
1.98 s |
1.99 s |
1.97 s |
1.98 s |
1.98 s |
1.98 s |
1.99 s |
3.27 s |
8.31 s |
32 |
1.3 s |
1.3 s |
1.3 s |
1.96 s |
1.96 s |
1.97 s |
1.95 s |
1.94 s |
1.94 s |
1.98 s |
1.95 s |
1.96 s |
4.19 s |
12.73 s |
64 |
1.3 s |
1.3 s |
1.3 s |
1.32 s |
2.33 s |
2.33 s |
2.33 s |
2.33 s |
2.33 s |
2.32 s |
2.34 s |
2.93 s |
4.94 s |
12.91 s |
128 |
- |
1.3 s |
1.3 s |
1.3 s |
1.3 s |
2.33 s |
2.33 s |
2.33 s |
2.32 s |
2.32 s |
2.32 s |
2.34 s |
4.48 s |
11.56 s |
256 |
- |
- |
1.3 s |
1.3 s |
1.3 s |
1.31s |
1.92 s |
2.02 s |
1.98 s |
1.98 s |
2.11 s |
2.14 s |
3.30 s |
11.39 s |
512 |
- |
- |
- |
1.3 s |
1.3 s |
1.32 s |
1.34 |
1.95 s |
1.91 s |
1.91 s |
3.25 s |
3.36 s |
3.53 s |
14.92 s |
What it's already obvious from these results is that the main memory is very expensive in terms of access time: when our array doesn't 'fit' in the
L2 cache anymore, we get slow-downs by factors of almost 10, closely mimicking the speed ratio of the main memory versus cache on the
P4 architecture.
The 16KB of L1 cache of the P4 couldn't be evidentiated with this method, being practically indistinguishable from the 2MB of L2 cache.
The interesting result of this test is that even big arrays can be accessed fast if the elements of the array are accessed one by one, in order.
Addressing multidimensional arrays
2D arrays
Consider the following lines of code:
for(int j = 0; j < 100000; j++){
for(int k = 0; k < m; k++){
for(int l = 0; l < m; l++){
zet[k][l] = 0.;
}
}
}
The loop overhead is around 0.27 seconds, the loop itself roughly 0.5 seconds, for m=64. But if we exchange k with l, the loop will take
2.5 seconds. If we substract the overhead, the operation is 10 times slower, only 5 times with the overhead.
replacing the 2D array with a 1D array, and addressing the elements in order, we get an execution time of 0.4 seconds, an improvement of 20%.
Addressing the elements in the wrong order,
zet[l*m +k] = 0.;
we get an execution time of 2.53 seconds, slighltly worse than for the matrix.
3D arrays
For the case of 3D arrays, the loop considered has an overhead of about 0.2 seconds. The operation
ket[k][l][n] = 1.;
takes roughly 0.72 seconds, whilst
zet[m*m*k + m*l + n] = 1.;
takes 0.65 seconds, still an improvement, but not very dramatic. Reshuffling the addressing order, by exchanging l with n, and one
gets an execution time of 1.85 seconds in case of zet[m*m*k + m*n + l] and 1.95 seconds with ket[k][n][l], roughly a factor 3 down.