Topic: OpenMP Algorithm Quandry
Not knowing where else to turn, is there someone who can help with a problem I have trying to get the most efficient use of OpenMP formalism. I apologize in advance because as usual, I think I understand what is happening in OpenMP based on documentation but I probably have missed something important in my understanding and thus my implementation.
I have the following where I have say for discussion n=1000 object positions and I
need to calculate the distance between each pair of objects. Currently, I use the brute
force method where in OpenMP, I set the up the number of threads (NThreads) and set the shared and
private variables as needed. In the case below r,i and j are private but x,y and z are
shared. I set the OpenMP loop as "DYNAMIC" with an appropriate chunk size (usually between
10 and 30, smaller when n is smaller that 1000).
!$OMP PARALLEL SHARED (x,y,z) PRIVATE(i,j,r) NUM_THREADS(NThreads)
!$OMP DO SCHEDULE(DYNAMIC,20)
Do i=1,n-1
Do j=i+1,n
r=sqrt((x(i)-x(j))**2+(y(i)-y(j))**2+(z(i)-z(j)))
enddo
enddo
!$OMP END DO
!$OMP END PARALLEL
The above works reasonably well but is not very efficient as the number of threads increases.
I assumed the inefficiencies resulted because as the top loop progresses, the number of
calculations per thread decreases and ending with some threads ending at different
times.
I wanted to find a better more efficient use for each of the threads. So I tried the
following. I create the top loop with NThreads and the nested loop where there are equal
number (or nearly equal) of distance calculations for each of the threads. So for n=1000
there are 249500 object pairs and say for 4 threads, that is Nij=62375 distance
calculations. I run through a quick count and save what I call the break points (ibrk and
jbrk arrays) where each of the internal i,j pair loops points would start.
Now I use the OpenMP loop as STATIC with a Chunck of 1 since now all calculations can be
done in order and are accounted for in the two loops. Note again the shared and private data
remains the same but now ibrk and jbrk are shared.
So now the nested loops look like (recreated code for this example):
!$OMP PARALLEL SHARED (x,y,z,ibrk,jbrk) PRIVATE(i,j,l,k,r) NUM_THREADS(NThreads)
!$OMP DO SCHEDULE(STATIC,1)
Do k=1,NThreads
i=ibrk(k)
j=jbrk(k)
Do l=1,62375
r=sqrt((x(i)-x(j))**2+(y(i)-y(j))**2+(z(i)-z(j))**2)
** Increment(i and J as necessary)**
enddo
enddo
!$OMP END DO
!$OMP END PARALLEL
My assumption in this exercise was that the second method be more efficient and run more
quickly however that is not the case. In this case, I can see that all the threads have the
same amount of work to do. However, the total execution time for this method is actually
longer that the first method. I cannot seem to find what is why with this method is worse.
If someone can shed some light on this, it would be much appreciated.
Thanks,
Rod