The Algorithm Mergesort, Working with MPI

Febrero 1, 2009

We work with the algorithm of merge sort. Merge sort or merge_sort is an comparison-based sorting algorithm. In most implementations it is stable, meaning that it preserves the input order of equal elements in the sorted output. It is an example of the divide and conquer algorithmic paradigm.

Conceptually, a merge sort works as follows:

1.If the list is of length 0 or 1, then it is already sorted. Otherwise:
2.Divide the unsorted list into two sublists of about half the size.
3.Sort each sublist recursively by re-applying merge sort.
4.Merge the two sublists back into one sorted list.

The program works with one array size of 1000000 elements, in disorder, all the numbers are obtain with randomized method. The master task distributes an array to the workers in chunks, zero pads for equal load balancing the workers sort and return to the master, which does a final merge.
Our experiment contained a series of test of a C computing the sort of the array in different modes, with 1,2,3,…n processors . We ran this code on Tarzan.
Tarzan has two different types of nodes:

Tarzan n6s – n5s : SMP Thin 4-way PowerPC 604e 332 Mhz
Tarzan n1s..n4s: Power2 SC Thin 160 Mhz

More references you can find here: http://www.fi.upm.es/?pagina=183.

In our experiments we worked with tarzan n6 and n5, running the program with 1-8 processors.

To compile the code, you need to have a C compiler. Compiling command is:

Mpcc mergesort.c

<pre>

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>

#define N 100000
#define MASTER 0		/* taskid of first task */

void showVector(int *v, int n, int id);
int * merge(int *A, int asize, int *B, int bsize);
void swap(int *v, int i, int j);
void m_sort(int *A, int min, int max);

double startT, stopT;

double startTime;

/*function to print a vector*/
void showVector(int *v, int n, int id)
{
	int i;
	printf("%d: ",id);
	for(i=0;i<n;i++)
		printf("%d ",v[i]);
	putchar('\n');
}

/*function to merge vectors*/
int * merge(int *A, int asize, int *B, int bsize) {
	int ai, bi, ci, i;
	int *C;
	int csize = asize+bsize;

	ai = 0;
	bi = 0;
	ci = 0;

	/* printf("asize=%d bsize=%d\n", asize, bsize); */

	C = (int *)malloc(csize*sizeof(int));	/*the array can be statically allocated too*/
	while ((ai < asize) && (bi < bsize)) {
		if (A[ai] <= B[bi]) {
			C[ci] = A[ai];
			ci++; ai++;
		} else {
			C[ci] = B[bi];
			ci++; bi++;
		}
	}

	if (ai >= asize)						/*if A is shorter*/
		for (i = ci; i < csize; i++, bi++)
			C[i] = B[bi];
	else if (bi >= bsize)					/*if B is shorter*/
		for (i = ci; i < csize; i++, ai++)
			C[i] = A[ai];

	for (i = 0; i < asize; i++)
		A[i] = C[i];
	for (i = 0; i < bsize; i++)
		B[i] = C[asize+i];

	/* showVector(C, csize, 0); */
	return C;
}

void swap(int *v, int i, int j)
{
	int t;
	t = v[i];
	v[i] = v[j];
	v[j] = t;
}

void m_sort(int *A, int min, int max)
{
	int *C;		/* dummy, just to fit the function */
	int mid = (min+max)/2;
	int lowerCount = mid - min + 1;
	int upperCount = max - mid;

	/* If the range consists of a single element, it's already sorted */
	if (max == min) {
		return;
	} else {
		/* Otherwise, sort the first half */
		m_sort(A, min, mid);
		/* Now sort the second half */
		m_sort(A, mid+1, max);
		/* Now merge the two halves */
		C = merge(A + min, lowerCount, A + mid + 1, upperCount);
	}
}

main(int argc, char **argv)
{
	int * data;
	int * chunk;
	int * other;
	int m,n=N;
	int id,p;
	int s = 0;
	int i;
	int step;
	MPI_Status status;

	MPI_Init(&argc,&argv);
	MPI_Comm_rank(MPI_COMM_WORLD,&id);
	MPI_Comm_size(MPI_COMM_WORLD,&p);

	startT = MPI_Wtime();

/**************************** master task ************************************/
	if(id == MASTER)
	{
		int r;
		srandom(MPI_Wtime());
		s = n/p;
		r = n%p;
		data = (int *)malloc((n+s-r)*sizeof(int));
		for(i=0;i<n;i++)
			data[i] = random();
		if(r!=0)
		{
			for(i=n;i<n+s-r;i++)
				data[i]=0;
			s=s+1;
		}

		MPI_Bcast(&s,1,MPI_INT,0,MPI_COMM_WORLD);
		chunk = (int *)malloc(s*sizeof(int));
		MPI_Scatter(data,s,MPI_INT,chunk,s,MPI_INT,0,MPI_COMM_WORLD);
		m_sort(chunk, 0, s-1);
		/* showVector(chunk, s, id); */
	}

/**************************** worker task ************************************/
	else
	{
		MPI_Bcast(&s,1,MPI_INT,0,MPI_COMM_WORLD);
		chunk = (int *)malloc(s*sizeof(int));
		MPI_Scatter(data,s,MPI_INT,chunk,s,MPI_INT,0,MPI_COMM_WORLD);
		m_sort(chunk, 0, s-1);
		/* showVector(chunk, s, id);*/
	}

      /*data propagation in a tree fashion*/
	step = 1;
	while(step<p)
	{
		if(id%(2*step)==0)
		{
			if(id+step<p)
			{
				MPI_Recv(&m,1,MPI_INT,id+step,0,MPI_COMM_WORLD,&status);
				other = (int *)malloc(m*sizeof(int));
				MPI_Recv(other,m,MPI_INT,id+step,0,MPI_COMM_WORLD,&status);
				chunk = merge(chunk,s,other,m);
				s = s+m;
			}
		}
		else
		{
			int near = id-step;
			MPI_Send(&s,1,MPI_INT,near,0,MPI_COMM_WORLD);
			MPI_Send(chunk,s,MPI_INT,near,0,MPI_COMM_WORLD);
			break;
		}
		step = step*2;
	}

	stopT = MPI_Wtime();
	if(id==0)
	{
		FILE * fout;

		printf("%d; %d processors; %f secs\n", s, p, (stopT-startT));

		fout = fopen("result","w");
		for(i=0;i<s;i++)
			fprintf(fout,"%d\n",chunk[i]);
		fclose(fout);
	}
	MPI_Finalize();
}
</pre>

$ cat script.sh
#!/bin/sh

poe ./a.out -hfile 1.txt -procs 1
poe ./a.out -hfile 2.txt -procs 2
poe ./a.out -hfile 3.txt -procs 3
poe ./a.out -hfile 4.txt -procs 4
poe ./a.out -hfile 5.txt -procs 5
poe ./a.out -hfile 6.txt -procs 6
poe ./a.out -hfile 7.txt -procs 7
poe ./a.out -hfile 8.txt -procs 8

result1


The code & Gprof

Enero 29, 2009

Installation

You can download the whole package with source codes, Makefile, executable and this report
on this address:

http //eibi.utalca.cl/~zangetsu/Madrid/tcc/

The program is written according to C99 standard and should be executable on any UNIX/Linux
distribution with SDL library. For compilation you will need dev libraries as well.

Required libraries (Linux):

libsdl1.2debian libsdl1.2dev
packages with SDL library
libsdlimage1.2
libsdlimage1.2dev
packages for image manipulation
libsdlttf1.2
libsdlttf1.2dev

packages for manipulation with TrueType fonts

You can download SDL from http://www.libsdl.org/ .
Program has a preprepared Makefile, that will do the compiling for you. To install the program,
simply type make and then make clean to clean the workspace. To run the basic sequential version of
the program execute seq. To run the parallelized one, execute par. The parallelized version requires
parameters that will be described later on.

Profiling


Profiling is a technique that, with a use of a special analysis tool (profiler), creates an
performance overview of a program as a whole and of its subprograms. This process is especially
focused on the frequency and duration of function calls from their invocation to termination. The
possible outputs may be a sequence of recorded events (a trace) or a statistical summary of the events
observed (a profile). Profilers use a wide variety of techniques to collect data, including hardware
interrupts, code instrumentation, instruction set simulation, operating system hooks, and performance
counters.

In our work, we used qprof and kprof. Lets start with gprof.

gprof

Gprof, part of the package called qprof, is a set of profiling utilities for Linux. It includes a
simple command line profiling tool, with the following characteristics (taken from manual):

● It is intended to be easy to install and use. No kernel modules or changes are required for basic
use.

● It supports profiling of dynamically linked code and includes information on time spent in
dynamic libraries.

● It supports profiling of multithreaded
applications.

● It generates profiles for all subprocesses started from a shell. Thus it easily can be used to
profile application with multiple processes.
It tries to generate symbolic output. This is usually successful for the main program, if that has
debug information, i.e. was compiled with g.
If not, you may need a debugger to fully interpret the
results. However the raw output will often give you a rough idea of where processor time is spent.

How to do the profiling?

1. compile the code with a new flag pg:
$>gcc -pg sec.c -o sec -lSDL

2. executed the program:
$>./sec

when the program finish to work, automatic is generated one file called gmon.out

3. execute the profiler:
$>gprof sec

Program will generate a report.

For our program, we got this results:

table 1

● Time [%]: the percentage of the total running time of the program used by this function.
cumulative a running sum of the number of seconds accounted seconds for by this function and
those listed above it.

● Self [s]: the number of seconds accounted for by this function alone. This is the major sort for
this listing.

● Cumulative [s]: the number of times this function was invoked, if this function is profiled, else
blank.

● Self [ms/call]: the average number of milliseconds spent in this function per call, if this function
is profiled, else blank.

● Total [ms/call]: the average number of milliseconds spent in this function and its descendents
per call, if this function is profiled, else blank.

● Name : the name of the function. This is the minor sort for this listing. The index shows the
location of the function in the gprof listing. If the index is in parenthesis it shows where it would
appear in the gprof listing if it were to be printed.


Profiling with Gprof & Kprof

Enero 27, 2009

In this lines we are going to demonstrate the process and outcomes of the performance analysis, sequential optimization and parallelization of a program for drawing fractals. In the beginning of this report, we use profiling to analyze the performance of the parts of the program, identify the parts that consume excessive amount of resources, identify and discuss parts suitable for optimization and parallelization and then create series of tests of the parallel-ability of the code.

For the performance analysis, we use the freely accessible tools gprof and its graphical interface kprof. For the time being, these are two mostly used tools for dynamical analysis of the program (profiling).
For our experimentation we chose program written earlier by Felipe, as we believe it has pretty high potential of optimization – and also its parallelization is possible to demonstrate the strength of the technique. Objective of the program we are going to use is simply to draw a fractal based on the given parameters. Program doesn’t interact with user through the GUI, so in all cases the most time is spent by computing the result image.

The sequential program drawing the Fractal

The sequential program drawing the Fractal

In the beginning program contains just a sequential algorithm to draw the gray-scale version of a fraction of the Mandelbrot fractal. That means that program basically contain few loops for counting a value pixel by pixel. Simplicity is its main advantage. However once we start to care about performance, we have to admit, that such a approach has proved itself to be highly inefficient on the modern computers with two or more processing units.

The parallel program

The parallel program

Later on we will come to the part, where the computation can be partially (or fully) parallelized – depending on the parameters. The sequential part will remain in gray tones, and the part parallelized with the use of a monitor and threads will be displayed in saturated colors.

In the next post we will work with GProf & Kprof for do one profiling for this Program.