#### Topic: OpenMP Algorithm Quandry

Not knowing where else to turn, is there someone who can help with a problem I have trying to get the most efficient use of OpenMP formalism. I apologize in advance because as usual, I think I understand what is happening in OpenMP based on documentation but I probably have missed something important in my understanding and thus my implementation.

I have the following where I have say for discussion n=1000 object positions and I

need to calculate the distance between each pair of objects. Currently, I use the brute

force method where in OpenMP, I set the up the number of threads (NThreads) and set the shared and

private variables as needed. In the case below r,i and j are private but x,y and z are

shared. I set the OpenMP loop as "DYNAMIC" with an appropriate chunk size (usually between

10 and 30, smaller when n is smaller that 1000).

!$OMP PARALLEL SHARED (x,y,z) PRIVATE(i,j,r) NUM_THREADS(NThreads)

!$OMP DO SCHEDULE(DYNAMIC,20)

Do i=1,n-1

Do j=i+1,n

r=sqrt((x(i)-x(j))**2+(y(i)-y(j))**2+(z(i)-z(j)))

enddo

enddo

!$OMP END DO

!$OMP END PARALLEL

The above works reasonably well but is not very efficient as the number of threads increases.

I assumed the inefficiencies resulted because as the top loop progresses, the number of

calculations per thread decreases and ending with some threads ending at different

times.

I wanted to find a better more efficient use for each of the threads. So I tried the

following. I create the top loop with NThreads and the nested loop where there are equal

number (or nearly equal) of distance calculations for each of the threads. So for n=1000

there are 249500 object pairs and say for 4 threads, that is Nij=62375 distance

calculations. I run through a quick count and save what I call the break points (ibrk and

jbrk arrays) where each of the internal i,j pair loops points would start.

Now I use the OpenMP loop as STATIC with a Chunck of 1 since now all calculations can be

done in order and are accounted for in the two loops. Note again the shared and private data

remains the same but now ibrk and jbrk are shared.

So now the nested loops look like (recreated code for this example):

!$OMP PARALLEL SHARED (x,y,z,ibrk,jbrk) PRIVATE(i,j,l,k,r) NUM_THREADS(NThreads)

!$OMP DO SCHEDULE(STATIC,1)

Do k=1,NThreads

i=ibrk(k)

j=jbrk(k)

Do l=1,62375

r=sqrt((x(i)-x(j))**2+(y(i)-y(j))**2+(z(i)-z(j))**2)

** Increment(i and J as necessary)**

enddo

enddo

!$OMP END DO

!$OMP END PARALLEL

My assumption in this exercise was that the second method be more efficient and run more

quickly however that is not the case. In this case, I can see that all the threads have the

same amount of work to do. However, the total execution time for this method is actually

longer that the first method. I cannot seem to find what is why with this method is worse.

If someone can shed some light on this, it would be much appreciated.

Thanks,

Rod