Home

Benchmarks



Compiler: g++

Compilation flags: -O3 -Wno-deprecated -march=pentium4 -ffast-math

Libs: -lm -llapack -lblas -lg2c

Computer architecture: Intel(R) Pentium(R) 4 CPU 3.00GHz, L2 cache 2048 KB



Complex numbers addition

Consider the next simple lines of code:
	#include <complex>
	#include <string.h>
	#include <fstream>
	
	using namespace std;
	typedef complex<double> dcomp;

	int main(){
	    
	    dcomp x=(2,1), y=(1,1), z;
	    double a=2, b=1, c=1, d=1, g, zre, zim;

		for(int i = 0; i < 250; i++){
		for(int j = 0; j < 10000000; j++){
			 z = x + y;
		}
		}
	}

This takes 10.13 seconds. The loop overhead (the loop executed without the z = x + y; statement) is 1.27 seconds. If we run the same loop with the expression
z = x + y
replaced with
zre = a + b; zim = c + d;
the loop will take 1.27 seconds, showing that most of the time is spent evaluating the loop, thus leaving us little room for a quantitative evaluation, but still wondering about the lame performance of the complex numbers addition.

The same trick performed with arrays:
	dcomp *xac = new dcomp [1];
	dcomp *yac = new dcomp [1];
	dcomp *zac = new dcomp [1];
		
	xac[0] = (2,1);
	yac[0] = (1,1);
		
	for(int i = 0; i < 250; i++){
	for(int j = 0; j < 10000000; j++){
	    zac[0] = xac[0] + yac[0];
	}
	}

	delete [] xac;
	delete [] yac;
	delete [] zac;
	
yields 16.61 seconds. Replacing this with
	double *xac = new double [2];
	double *yac = new double [2];
	double *zac = new double [2];

	xac[0] = 2.;
	xac[1] = 1.;
	yac[0] = 1.;
	yac[1] = 1.;

	for(int i = 0; i < 250; i++)
	for(int j = 0; j < 10000000; j++){
		zac[0] = xac[0] + yac[0];
		zac[1] = xac[1] + yac[1];
	}
	}
	 
yields 5.93 seconds. The loop overhead is 1.60 seconds for both cases, hence we'll get 15 seconds, respectively 4,33 seconds. In this case, adding the real and imaginary parts separately was 3,46 times faster. Using the following trick:
	  double dummy_re, dummy_im;

		for(int i = 0; i < 250; i++)
		for(int j = 0; j < 10000000; j++){
			dummy_re = xac[0] + yac[0];
			dummy_im = xac[1] + yac[1]
		}

		zac[0] = dummy_re;
		zac[1] = dummy_im;
	 
we get an execution time smaller than the loop overhead, aproximately 1.26 seconds.

Complex numbers multiplication

The same loop with:
	x = (2,1);
	y = (1,1);
	
	z = x*y;
	 
takes 10.7 seconds, with a loop overhead of 1.4 seconds, final call - 9.3 seconds. Whilst explicitely using the real and imaginary parts
	  double xre = 2., xim = 1., yre = 1., yim = 1., zre, zim;

	  zre= xre*yre - xim*yim;
	  zim = xre*yim + xim*yre;
	  
takes 1.72 seconds, with a loop overhead of 1.6 seconds, final call - 0.12 seconds. The performance gain is ridiculously large, 77.5. Unfortunately we don't expect such a lavish speed gain when using arrays. Indeed, for:
	  dcomp *x = new dcomp [1];
	  dcomp *y = new dcomp [1];
	  dcomp *z = new dcomp [1];

	  x[0] = (2,1);
	  y[0] = (1,1);

	  z[0] =  x[0]*y[0];
	   
we get 18.97 seconds, with a loop overhead of 1.34 seconds. As for
	   double *x = new double [2];
	   double *y = new double [2];
	   double *z = new double [2];

	    x[0] = 2.;
	    x[1] = 1.;
	    y[0] = 1.;
	    y[1] = 1.;

	    z[0] = x[0]*y[0] - x[1]*y[1];
	    z[1] = x[0]*y[1] + x[1]*y[0];
	    
we get only 12.05 seconds, with a loop overhead of 1.36 seconds. The performance gain is aproximately 1.65. A dramatic improvement is obtained if one uses the known trick:
	double dummy_re, dummy_im;
	
	for(int i = 0; i < 250; i++)
	for(int j = 0; j < 10000000; j++){
	    dummy_re = x[0]*y[0] - x[1]*y[1];
	    dummy_im = x[0]*y[1] + x[1]*y[0];
	}

	z[0] = dummy_re;
	z[1] = dummy_im;
	
with the execution time arround (in most of the cases even smaller than) the loop overhead ~ 1.3 seconds. This result, identical for addition and multiplication shows that writing array elements is an expensive operation.

Memory benchmarks

The code used for this benchmark looks as follows:
	int m = 128;
	double *zet = new double [m];

	int steps = 0;
	i = 1;
	int naccess = 4096;
	int s = 4;

	for(int j = 0; j < 100000; j++){
		 steps = 0;
		 while (steps < naccess){
			zet[i] = 0.;
			steps = steps +1;
			i = i +s;
			if(i > m)
			i = 1;
		}
	}
	 
The number of operations the loop performs is constant 4096 * 10e5, the loop overhead is 0.21 seconds. We intialize an array of dimension 'm' and we step trough it's elements 'nacces' times with steps of size 's'. For various sizes of s and m we measure the time needed to perform the operation zet[i] = 0.;. Thus we hope to emphasize the cache miss penalty, to estimate the L1 cache size and number of cache lines.



  s     
 m=64      
 m=128     
 m=256     
 m=512     
 m=1024    
 m=2048    
 m=4096    
 m=8192    
 m=16384   
 m=32768   
 m=65536   
 m=131072  
 m=163840  
 m=262144  
  1    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
  2    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.51 s    
 2.19 s 
  4    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.31s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 1.77 s 
 4.01 s 
  8    
 1.3 s    
 2.06 s   
 2.04 s   
 2.11 s   
 2.03 s   
 2.09 s   
 2.09 s   
 2.11 s   
 2.19 s   
 2.20 s   
 2.05 s   
 2.35 s   
 2.70 s   
 8.31 s 
  16   
 1.3 s    
 1.3 s    
 1.98 s   
 1.99 s   
 1.99 s   
 1.98 s   
 1.99 s   
 1.97 s   
 1.98 s   
 1.98 s   
 1.98 s   
 1.99 s   
 3.27 s    
 8.31 s  
  32   
 1.3 s    
 1.3 s    
 1.3 s    
 1.96 s   
 1.96 s   
 1.97 s   
 1.95 s   
 1.94 s   
 1.94 s   
 1.98 s   
 1.95 s   
 1.96 s   
 4.19 s    
 12.73 s 
  64   
 1.3 s    
 1.3 s    
 1.3 s    
 1.32 s   
 2.33 s   
 2.33 s   
 2.33 s   
 2.33 s   
 2.33 s   
 2.32 s   
 2.34 s   
 2.93 s   
 4.94 s    
 12.91 s 
  128  
   -      
 1.3 s    
 1.3 s    
 1.3 s    
 1.3 s    
 2.33 s    
 2.33 s    
 2.33 s    
 2.32 s    
 2.32 s    
 2.32 s    
 2.34 s    
 4.48 s    
 11.56 s 
  256  
   -      
   -      
 1.3 s    
 1.3 s    
 1.3 s    
 1.31s    
 1.92 s    
 2.02 s    
 1.98 s    
 1.98 s    
 2.11 s    
 2.14 s    
 3.30 s    
 11.39 s 
  512  
   -      
   -      
   -      
 1.3 s  
 1.3 s  
 1.32 s 
 1.34     
 1.95 s  
 1.91 s  
 1.91 s  
 3.25 s  
 3.36 s  
 3.53 s  
 14.92 s 


What it's already obvious from these results is that the main memory is very expensive in terms of access time: when our array doesn't 'fit' in the L2 cache anymore, we get slow-downs by factors of almost 10, closely mimicking the speed ratio of the main memory versus cache on the P4 architecture. The 16KB of L1 cache of the P4 couldn't be evidentiated with this method, being practically indistinguishable from the 2MB of L2 cache. The interesting result of this test is that even big arrays can be accessed fast if the elements of the array are accessed one by one, in order.

Addressing multidimensional arrays


2D arrays
Consider the following lines of code:
	for(int j = 0; j < 100000; j++){
		for(int k = 0; k < m; k++){
		for(int l = 0; l < m; l++){
			zet[k][l] = 0.;
		}
		}
	}
	
The loop overhead is around 0.27 seconds, the loop itself roughly 0.5 seconds, for m=64. But if we exchange k with l, the loop will take 2.5 seconds. If we substract the overhead, the operation is 10 times slower, only 5 times with the overhead. replacing the 2D array with a 1D array, and addressing the elements in order, we get an execution time of 0.4 seconds, an improvement of 20%. Addressing the elements in the wrong order,
	zet[l*m +k] = 0.; 
we get an execution time of 2.53 seconds, slighltly worse than for the matrix.

3D arrays
For the case of 3D arrays, the loop considered has an overhead of about 0.2 seconds. The operation
	ket[k][l][n] = 1.;
	
takes roughly 0.72 seconds, whilst
	zet[m*m*k + m*l + n] = 1.;
	
takes 0.65 seconds, still an improvement, but not very dramatic. Reshuffling the addressing order, by exchanging l with n, and one gets an execution time of 1.85 seconds in case of zet[m*m*k + m*n + l] and 1.95 seconds with ket[k][n][l], roughly a factor 3 down.