<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Approximatrix Forums — OpenMP Algorithm Quandry]]></title>
		<link>http://forums.approximatrix.com/viewtopic.php?id=662</link>
		<atom:link href="http://forums.approximatrix.com/extern.php?action=feed&amp;tid=662&amp;type=rss" rel="self" type="application/rss+xml" />
		<description><![CDATA[The most recent posts in OpenMP Algorithm Quandry.]]></description>
		<lastBuildDate>Thu, 16 Nov 2017 17:26:26 +0000</lastBuildDate>
		<generator>PunBB</generator>
		<item>
			<title><![CDATA[Re: OpenMP Algorithm Quandry]]></title>
			<link>http://forums.approximatrix.com/viewtopic.php?pid=3075#p3075</link>
			<description><![CDATA[<p>try posting this question at the comp.lang.fortran newsgroup.</p>]]></description>
			<author><![CDATA[null@example.com (baf1)]]></author>
			<pubDate>Thu, 16 Nov 2017 17:26:26 +0000</pubDate>
			<guid>http://forums.approximatrix.com/viewtopic.php?pid=3075#p3075</guid>
		</item>
		<item>
			<title><![CDATA[OpenMP Algorithm Quandry]]></title>
			<link>http://forums.approximatrix.com/viewtopic.php?pid=3074#p3074</link>
			<description><![CDATA[<p>Not knowing where else to turn, is there someone who can help with a problem I have trying to get the most efficient use of OpenMP formalism. I apologize in advance because as usual, I think I understand what is happening in OpenMP based on documentation but I probably have missed something important in my understanding and thus my implementation.</p><p>I have the following where I have say for discussion n=1000 object positions and I<br />need to calculate the distance between each pair of objects. Currently, I use the brute<br />force method where in OpenMP, I set the up the number of threads (NThreads) and set the shared and<br />private variables as needed. In the case below r,i and j are private but x,y and z are<br />shared. I set the OpenMP loop as &quot;DYNAMIC&quot; with an appropriate chunk size (usually between<br />10 and 30, smaller when n is smaller that 1000).</p><p>!$OMP PARALLEL SHARED (x,y,z) PRIVATE(i,j,r) NUM_THREADS(NThreads)<br />!$OMP DO SCHEDULE(DYNAMIC,20)<br />Do i=1,n-1<br /> Do j=i+1,n</p><p>&nbsp; r=sqrt((x(i)-x(j))**2+(y(i)-y(j))**2+(z(i)-z(j)))</p><p> enddo<br />enddo<br />!$OMP END DO <br />!$OMP END PARALLEL</p><p>The above works reasonably well but is not very efficient as the number of threads increases.<br />I assumed the inefficiencies resulted because as the top loop progresses, the number of<br />calculations per thread decreases and ending with some threads ending at different<br />times.</p><p>I wanted to find a better more efficient use for each of the threads. So I tried the<br />following. I create the top loop with NThreads and the nested loop where there are equal<br />number (or nearly equal) of distance calculations for each of the threads. So for n=1000<br />there are 249500 object pairs and say for 4 threads, that is Nij=62375 distance<br />calculations. I run through a quick count and save what I call the break points (ibrk and<br />jbrk arrays) where each of the internal i,j pair loops points would start.</p><p>Now I use the OpenMP loop as STATIC with a Chunck of 1 since now all calculations can be<br />done in order and are accounted for in the two loops. Note again the shared and private data<br />remains the same but now ibrk and jbrk are shared. </p><p>So now the nested loops look like (recreated code for this example):</p><p>!$OMP PARALLEL SHARED (x,y,z,ibrk,jbrk) PRIVATE(i,j,l,k,r) NUM_THREADS(NThreads)<br />!$OMP DO SCHEDULE(STATIC,1)<br />Do k=1,NThreads<br />&nbsp; i=ibrk(k)<br />&nbsp; j=jbrk(k)<br />&nbsp; Do l=1,62375</p><p>&nbsp; r=sqrt((x(i)-x(j))**2+(y(i)-y(j))**2+(z(i)-z(j))**2)</p><p>** Increment(i and J as necessary)**</p><p>&nbsp; enddo<br />enddo<br />!$OMP END DO <br />!$OMP END PARALLEL </p><p>My assumption in this exercise was that the second method be more efficient and run more<br />quickly however that is not the case. In this case, I can see that all the threads have the<br />same amount of work to do. However, the total execution time for this method is actually<br />longer that the first method. I cannot seem to find what is why with this method is worse.</p><p>If someone can shed some light on this, it would be much appreciated. <br />Thanks,<br />Rod</p>]]></description>
			<author><![CDATA[null@example.com (grogley)]]></author>
			<pubDate>Thu, 16 Nov 2017 10:46:40 +0000</pubDate>
			<guid>http://forums.approximatrix.com/viewtopic.php?pid=3074#p3074</guid>
		</item>
	</channel>
</rss>
