Topic: Program randomly (it seems) halts execution

I am struggling with an issue with my simulation program where it just halts execution at random points. Sometimes when restarted, it will halt at the same point but other times will just continue to some other point. Note too that when I say halts, I mean the console that the code runs in just stops, no error messages, segmentation faults or anything. The CPU just goes less busy (I have other simulations running simultaneously on my development system) and that is how I know it has halted.

Understand that this code is "stable" in the sense that I am currently running about 8 simulations on 4 different machines (Win 7 Pro and Win10 Pro). These instances may run for weeks/months without being stopped unless there is power outage that forces the issue.

I saw in some previous posts that if I run the code in debug mode, it will stop at the offending part of the code. However, I am trying this but the code refuses to halt in this mode. Also I am running the debug, non-optimized version outside the debugger and so far it has not halted. As time goes on, I wonder if the issue has something to do with the optimization I am using but again, all these other instances that are currently running made with full optimization do not show this problem. I have seen this on these other machines but usually restarting the sim will make the problem go away.

I am using SF 2.41 build 2559, developing the F90 executable as a Win64 console application on a Win10 Pro 64 OS. The program seems memory stable and the instance I am testing is using about 32 Mbytes.

At this point I am just grasping for help from anyone would might have seen this sort of thing.
Thanks in advance.

Rod

Oh, here is the latest make file:
#
# Automagically generated by Approximatrix Simply Fortran 2.41
#
FC="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\gfortran.exe"
CC="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\gcc.exe"
AR="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\ar.exe"
WRC="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\windres.exe"
RM=rm -f


OPTFLAGS= -g -mtune=broadwell

SPECIALFLAGS=$(IDIR)

RCFLAGS=-O coff

PRJ_FFLAGS= -fopenmp

PRJ_CFLAGS=

PRJ_LFLAGS=-Wl,--stack,150000000 -lgomp

FFLAGS=$(SPECIALFLAGS) $(OPTFLAGS) $(PRJ_FFLAGS) -Jmodules

CFLAGS=$(SPECIALFLAGS) $(OPTFLAGS) $(PRJ_CFLAGS)

"build\riod.o": ".\riod.f90"
    @echo Compiling .\riod.f90
    @$(FC) -c -o "build\riod.o" $(FFLAGS) ".\riod.f90"

clean: .SYMBOLIC
    @echo Deleting build\riod.o and related files
    @$(RM) "build\riod.o"
    @echo Deleting default icon resource
    @$(RM) "build\sf_default_resource.res"
    @echo Deleting riod.exe
    @$(RM) "riod.exe"

"riod.exe":  "build\riod.o" "build\Riod-MP-F90.prj.target"
    @echo Generating riod.exe
    @$(FC) -o "riod.exe" -static -fopenmp "build\riod.o" $(LDIR) $(PRJ_LFLAGS)

all: "riod.exe" .SYMBOLIC

Re: Program randomly (it seems) halts execution

I would suggest enabling runtime diagnostics (In Project Options under "Fortran").  This mode is different from debugging and does not require debugging to be enabled.  It sounds very much like there is some sort of memory violation occurring, especially if everything works when debugging is enabled.  With runtime diagnostics, the program should halt if there are any Fortran-specific array bounds violations or similar issues.

Also, I noticed you're tuning for "Broadwell."  Is there any reason you haven't just used -mtune=native?  It lets the compiler decide the best optimizations to enable.

Jeff Armstrong
Approximatrix, LLC

Re: Program randomly (it seems) halts execution

Thanks Jeff, I will try the run time diagnostics option. Since I have no idea what to expect with this option, is there something I can read that will clue me in as to possible outcomes?

I only have a vague idea why I am using the Broadwell tuning, I think I tested a bunch of different CPU tunings and this one seemed to work the best. What will the Native setting do on different CPU types lile AMD versus Intel. Currently I am only using Intel chips but that could change.

Rod

Here is the latest makefile:

#
# Automagically generated by Approximatrix Simply Fortran 2.41
#
FC="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\gfortran.exe"
CC="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\gcc.exe"
AR="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\ar.exe"
WRC="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\windres.exe"
RM=rm -f


OPTFLAGS= -O3 -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -floop-parallelize-all -mtune=native

SPECIALFLAGS=$(IDIR)

RCFLAGS=-O coff

PRJ_FFLAGS= -fcheck=all -fopenmp

PRJ_CFLAGS=

PRJ_LFLAGS=-Wl,--stack,150000000 -lgomp

FFLAGS=$(SPECIALFLAGS) $(OPTFLAGS) $(PRJ_FFLAGS) -Jmodules

CFLAGS=$(SPECIALFLAGS) $(OPTFLAGS) $(PRJ_CFLAGS)

"build\riod.o": ".\riod.f90"
    @echo Compiling .\riod.f90
    @$(FC) -c -o "build\riod.o" $(FFLAGS) ".\riod.f90"

clean: .SYMBOLIC
    @echo Deleting build\riod.o and related files
    @$(RM) "build\riod.o"
    @echo Deleting default icon resource
    @$(RM) "build\sf_default_resource.res"
    @echo Deleting riod.exe
    @$(RM) "riod.exe"

"riod.exe":  "build\riod.o" "build\Riod-MP-F90.prj.target"
    @echo Generating riod.exe
    @$(FC) -o "riod.exe" -static -fopenmp "build\riod.o" $(LDIR) $(PRJ_LFLAGS)

all: "riod.exe" .SYMBOLIC

Re: Program randomly (it seems) halts execution

When using runtime diagnostics, your program will terminate immediately if any bounds errors occur.  You can look at the -fcheck=all option to see what is actually being checked.  If an error occurs, your program will simply terminate with a location of the error regardless of whether you're compiling and running with debugging.

Broadwell is just a (relatively dated) design from Intel.  Using "native" as your preferred architecture will generate an executable that is tuned specifically for the current host chip, but the executable should still run on any 64-bit x86 CPU.  Using Broadwell is already creating an executable that isn't particularly tuned for AMD CPUs, so there's no harm in switching to "native" for now.

Jeff Armstrong
Approximatrix, LLC

Re: Program randomly (it seems) halts execution

Thanks Jeff.

Just as an update, I have been running the offending instance with the runtime diagnostics checked for nearly a day without any halts. The running executable is compiled with full optimization just as I normally do but has not shown the problem yet. I will continue running to see if it happens but it is frustrating trying to get to the bottom of this behavior.

I may switch back to my older executable to see if it will halt again.
Rod

6 (edited by grogley 2019-10-10 21:28:13)

Re: Program randomly (it seems) halts execution

Jeff,

So I tested older builds and it would continue to halt. I went back and rebuild the code with the runtime diagnostics enabled and started it up again. Now it has halted but there is nothing different about the halted program, no diagnostics are present, no clues as to what is going on. I checked the build file and as you mentioned the -fcheck=all is in the makefile.

I was running the code in a "Start" mode in a windows CMD enviroment that closes when cntl-c is done on the window. I am now just running it in a standard CME window that will stay open after a cntl-c to see if that helps.

See below for the makefile:

#
# Automagically generated by Approximatrix Simply Fortran 2.41
#
FC="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\gfortran.exe"
CC="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\gcc.exe"
AR="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\ar.exe"
WRC="C:\Program Files (x86)\Simply Fortran 2\mingw-w64\bin\windres.exe"
RM=rm -f


OPTFLAGS= -O3 -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -floop-parallelize-all -mtune=native

SPECIALFLAGS=$(IDIR)

RCFLAGS=-O coff

PRJ_FFLAGS= -fcheck=all -fopenmp

PRJ_CFLAGS=

PRJ_LFLAGS=-Wl,--stack,150000000 -lgomp

FFLAGS=$(SPECIALFLAGS) $(OPTFLAGS) $(PRJ_FFLAGS) -Jmodules

CFLAGS=$(SPECIALFLAGS) $(OPTFLAGS) $(PRJ_CFLAGS)

"build\riod.o": ".\riod.f90"
    @echo Compiling .\riod.f90
    @$(FC) -c -o "build\riod.o" $(FFLAGS) ".\riod.f90"

clean: .SYMBOLIC
    @echo Deleting build\riod.o and related files
    @$(RM) "build\riod.o"
    @echo Deleting default icon resource
    @$(RM) "build\sf_default_resource.res"
    @echo Deleting riod.exe
    @$(RM) "riod.exe"

"riod.exe":  "build\riod.o" "build\Riod-MP-F90.prj.target"
    @echo Generating riod.exe
    @$(FC) -o "riod.exe" -static -fopenmp "build\riod.o" $(LDIR) $(PRJ_LFLAGS)

all: "riod.exe" .SYMBOLIC

Re: Program randomly (it seems) halts execution

Does the code run properly if you disable OpenMP?

Jeff Armstrong
Approximatrix, LLC

8 (edited by grogley 2019-10-11 13:33:48)

Re: Program randomly (it seems) halts execution

Jeff,

Do I have to recompile without the OpenMP options, or is running the code with a single thread the same thing? I can set the code to run with just a single thread or as many as the OS will allow; I running one instance with a single thread now along with another instance that has 6 threads rather than the 8 it was running with originally.

The 6 thread instance has been running most for several hours without halting.

My instance running overnight with 8 threads halted with no diagnostics.

EDIT: I terminated the single thread test and just recompiled without the OpenMP option. My question above was dumb!

Rod

Re: Program randomly (it seems) halts execution

Jeff,

I will be away for the next 10 days or so, and just to follow up on this topic I ran the code for 5-6 hours with OpenMP disabled without any halting. I also ran the 6 thread instance for about the same amount of time, also without halting. So to answer your last question, it appears that my OpenMP code my be the issue.

While this was going on, I did some code review to be sure that the number of threads expected to be used was not arbitrarily being changed or used incorrectly. I did find a bug in a rarely called subroutine that I recently made multi-threaded, where the dimension of the threaded arrays were one element too small. This was an easy fix and I rebuild the code and that has been running nearly 11 hours as I write this without incident.

I am hopeful that this was the issue but I am skeptical because the offending subroutine is not called often in this simulation scenario and from the output sequence where the program would halt would not suggest that this routine was called before the halt.

Before I could finish this note, I made one more change to the code to print out to the console when the fixed subroutine is called. I rebuilt and started the code up in the usual mode. Understand that the usual start method is to use the Windows start command with an affinity mask to efficiently utilize the system threads that are available. So this instance uses 8 threads and has a hex mask of 0xff00. My development system has 16 threads available and has another instance running using 8 threads in a mask of 0x00ff (note that this other instance never has halted nor all the other instances on other machines currently running). 

This rebuilt code halted soon after it was started. There was a difference from the overnight run. I just ran the code in a standard CMD window, without the affinity mask set. I am trying running in just a CMD window again.

So I am back to where this all started. The only thing that seems certain is that this is probably an OpenMP coding issue. But whatever that may be, I am at a loss to find it. This code runs forever in other instances, using the same run characteristics and on several different hardware platforms. I have no doubt that that 7 simulations I have running on 3 other machines will still be happily running (power outages not withstanding) when I return.

Rod

Re: Program randomly (it seems) halts execution

Couple of comments. 

Jeff recommended runtime diagnostics which is appropriate for your situation.  However, the executable will likely run several times slower than "normal", and you may be fooled by thinking it has run further into the simulation than it would have without the diagnostic option.

The gfortran team is always making improvements in the compiler, and the version of SF you are using now ships with a somewhat dated version of the compiler.  You might try installing the latest version of SF to gain access to a more current version of gfortran (9.1 at the moment) which may be helpful.

11 (edited by grogley 2019-10-22 14:27:48)

Re: Program randomly (it seems) halts execution

I have returned from my trip and want to follow up on the current status.

The troublesome simulation instance ran to completion (and beyond) while I was gone without the halting it was showing. Based on what I think was happening when I left, I expected this to run without incident. The reason for this is that I was running it in a standard CMD window which presents all possible thread options to the executable. As I noted earlier, when running in the "Start" mode with affinity mask set to 8 threads was when the program would halt. This suggests a threading issue but I am unsure where this is happening; as noted before, I use this "Start" method for running many simulations and see no halting from them. Also, I started this simulation again using the "Start" with affinity mask set, it halted within the hour; it ran for 10 days while I was gone...

Responding to Baf1, thank you for your suggestions. I was aware that the code would run much slower and did let it run (if I remember correctly) many hours; be aware that the halting would happen randomly so it is possible that I didn't wait long enough but I was approaching diminishing returns and given the above discussion, I felt that it ran enough. Also, the diagnostics showed no information when it did halt.

Finally, I agree that the new compiler would probably help. However, I am retired now and have limited budget to upgrade. Upgrading is something I would have to discuss with the CFO (read wife, LOL!).

Thanks again,
Rod