sasomao

News...

Commented: sasomao 622

August 27 2010

Hi guys,

I rerun 5 jobs in the same machine, and yet another time they all crashed, at the same time. This time 3 output ended with

Execution Stopped: Unhandled signal caught (UNKNOWN: 1)

while the two others have not been written at the same time; but there were two core* files instead, created at the same time:

212976740 -rw------- 1 me invites 21968 août 27 11:59 out14.out
212976738 -rw------- 1 me invites 21968 août 27 11:59 out13.out
212976736 -rw------- 1 me invites 21968 août 27 11:59 out12.out
212976744 -rw------- 1 me invites 93229056 août 27 11:59 core.11163
212976743 -rw------- 1 me invites 109871104 août 27 11:59 core.11084

then the total is still 5. It really seems an issue with multiple jobs running at the same time

also...

Commented: sasomao 622

August 27 2010

also, files like this one:

212976737 -rw------- 1 me invites 73911 août 27 10:42 .nfs00000000cb1c461000000e7

appear while the jobs are running

Evidences...

Commented: sasomao 622

August 27 2010

Hi all,

I've some new evidences, maybe.
Yesterday I tried to run 5 times the same file, in the same computer (more often I use different computers) and they all crashed with core dump, unknonw error1.

Here the logs:

tail -n 1 psiPi4epsilonPi2M2.8eta0.25*/log.txt==> psiPi4epsilonPi2M2.8eta0.253/log.txt <==Beginning h21 and v4
==> psiPi4epsilonPi2M2.8eta0.25-4/log.txt <==Beginning h21 and v4
==> psiPi4epsilonPi2M2.8eta0.25-5/log.txt <==Beginning h22
==> psiPi4epsilonPi2M2.8eta0.25-6/log.txt <==Beginning h31
==> psiPi4epsilonPi2M2.8eta0.25-first/log.txt <==Beginning h31
==> psiPi4epsilonPi2M2.8eta0.25/log.txt <==Beginning h22

Then actually there seems not to be a precise function that bothers the code. The reason must be another.

What is interesting is that all these jobs failed in the same minute:

212976735 -rw------- 1 me invites 21963 aoû 27 01:39 out5.out
212976225 -rw------- 1 me invites 21963 aoû 27 01:39 out4.out
212976224 -rw------- 1 me invites 21963 aoû 27 01:39 out3.out
212976223 -rw------- 1 me invites 21962 aoû 27 01:39 out2.out
212976221 -rw------- 1 me invites 21961 aoû 27 01:39 out1.out

Then is seems that the problem is not with a single job, but with the very fact of running multiple jobs at the same time. I will be running tests todaty, to verify if there is a systematic fail when more than 1 job is launched in the same machine.

Is there anything forbidding to launch (say) five jobs in the same pc, that could explain the crashes (let me remember that the programs have crashed after 5 hours)?
Should I put an particular option in my code, in this multi-runs?

Tkhs

Salvo

Then why not every time?...

Answered: sasomao 622

August 26 2010

0 0

Hi Duncan,

All these jobs are runned into a Fedora 13 x86_64 machine. Additionally, what does it mean in this contest "compiled"??
The system administrator of our labo, installed Maple from the DVD (I guess it was a bash installer), then no real compilation has been performed.

I didn't build any library of mine. All the procedures I use in the file are defined there.

And, if the problem was the architecture, wouldn't the problem manifeste at EVERY run, instead of this random way?

Salvo

Then why not every time?...

Commented: sasomao 622

August 26 2010

Hi Duncan,

All these jobs are runned into a Fedora 13 x86_64 machine. Additionally, what does it mean in this contest "compiled"??
The system administrator of our labo, installed Maple from the DVD (I guess it was a bash installer), then no real compilation has been performed.

I didn't build any library of mine. All the procedures I use in the file are defined there.

And, if the problem was the architecture, wouldn't the problem manifeste at EVERY run, instead of this random way?

Salvo

random?...

Answered: sasomao 622

August 26 2010

0 0

Hi Robert, all

I found your suggestion useful, and seeded my mpl file with log entries. Tonight I let three jobs running, and two stopped with unkwnown error 1 etc.

Basically my program is composed of two parts: the first where I define a lot of procedures, and a second where two nested for run the procedures defined above, and append output to some txt files.

The crashed jobs, they bot were running the same procedure:

Beginning the for[27,2] #### 27 and 2 are the values of the FORs indexes
Beginning p_network
Letting p_network
Beginning h1_network
Letting p_network
Beginning snr2
Letting snr2
Beginning low_fisher
Letting low_fisher
Beginning up_fisher
Letting up_fisher
Beginning crlb's calculations
Letting crlbs
Beginning h2 network
Letting h2 network
Beginning h3 network
Letting h3 network
Beginning h31
Letting h31
Beginning h22

for one, and

Beginning the for[0,5]
Beginning p_network
Letting p_network
Beginning h1_network
Letting p_network
Beginning snr2
Letting snr2
Beginning low_fisher
Letting low_fisher
Beginning up_fisher
Letting up_fisher
Beginning crlb's calculations
Letting crlbs
Beginning h2 network
Letting h2 network
Beginning h3 network
Letting h3 network
Beginning h31
Letting h31
Beginning h22

for the other. So h22 seems to be the problem. I was very happy with that, but only for some seconds.

I tried to reproduce the error in a graphical interface, xmaple, (in a different PC), copying the code, and starting the FORs with the guilty values, 27 and 2 of the first run. But it simpy works, without any problem. h22 is calculated and all is fine... The problem seems to happen randomly. How am I supposed to fix it if I cannot make it happen????

random?...

Commented: sasomao 622

August 26 2010

Hi Robert, all

I found your suggestion useful, and seeded my mpl file with log entries. Tonight I let three jobs running, and two stopped with unkwnown error 1 etc.

Basically my program is composed of two parts: the first where I define a lot of procedures, and a second where two nested for run the procedures defined above, and append output to some txt files.

The crashed jobs, they bot were running the same procedure:

Beginning the for[27,2] #### 27 and 2 are the values of the FORs indexes
Beginning p_network
Letting p_network
Beginning h1_network
Letting p_network
Beginning snr2
Letting snr2
Beginning low_fisher
Letting low_fisher
Beginning up_fisher
Letting up_fisher
Beginning crlb's calculations
Letting crlbs
Beginning h2 network
Letting h2 network
Beginning h3 network
Letting h3 network
Beginning h31
Letting h31
Beginning h22

for one, and

Beginning the for[0,5]
Beginning p_network
Letting p_network
Beginning h1_network
Letting p_network
Beginning snr2
Letting snr2
Beginning low_fisher
Letting low_fisher
Beginning up_fisher
Letting up_fisher
Beginning crlb's calculations
Letting crlbs
Beginning h2 network
Letting h2 network
Beginning h3 network
Letting h3 network
Beginning h31
Letting h31
Beginning h22

for the other. So h22 seems to be the problem. I was very happy with that, but only for some seconds.

I tried to reproduce the error in a graphical interface, xmaple, (in a different PC), copying the code, and starting the FORs with the guilty values, 27 and 2 of the first run. But it simpy works, without any problem. h22 is calculated and all is fine... The problem seems to happen randomly. How am I supposed to fix it if I cannot make it happen????

very long...

Commented: sasomao 622

August 25 2010

Hi Acer,

the code is some 2000 lines long now (although I guess I could remove all the comments and some indexing function, going below the 1500 lines).

Would it be worth posting it?

Salvo

Help welcomed...

Commented: sasomao 622

August 24 2010

Hi all,

the problem is still there, today I had two jobs stopped after 7-8 hours, with this misterious unknown signal, an the core file.

I tried to give a look a a core file, most of it is binary garbage, but some lines in in english. Between the others, the was this strange warning:

machine is big endian but maple was not compiled so

buried in a lot of binary symbols, I don't know if this can be ralated with the problem.

Btw, I'm trying to run the program with printlevel=10, but this screw up all my ouput files, created in the mpl file using commands like:

appendto(cat(new_dir_name,"/crlb-seq-theta.txt")):
printf("crlb_theta=[ ");

unfortunately, with the priintlevel the created file is not as it should, but contain a lot of printlevel lines. Is there no a way to avoid that?

Thanks

Salvo

Hi, ok for the delay, I understand. I'm...

Answered: sasomao 622

August 18 2010

0 0

Hi,

ok for the delay, I understand. I'm a linux user, and I'm using chrome to surf.

I cleared the cache, but I still don't see my answer at my older thread. I have not understood what you told about the replies not going at the top of the stack. I went to my older thread, pressed "reply" to the last message there, and replied. If it doesn't go at the top of the last questions how can people reply to it?

Thanks

Salvo

Hi, ok for the delay, I understand. I'm...

Commented: sasomao 622

August 18 2010

Hi,

ok for the delay, I understand. I'm a linux user, and I'm using chrome to surf.

I cleared the cache, but I still don't see my answer at my older thread. I have not understood what you told about the replies not going at the top of the stack. I went to my older thread, pressed "reply" to the last message there, and replied. If it doesn't go at the top of the last questions how can people reply to it?

Thanks

Salvo

still there...

Commented: sasomao 622

August 18 2010

Hi All,

I dig out this thread of mines, because I still have the same problem with this UNKNOWN 1

I avoid that I've not understood what assertlevel does, I've tried to set it on 2, but I don't see any difference in a fake program I've created ad hoc.

The problem with "trace" is that I don't know which functions is crashing, the error warning doesn't tell that. I cannot trace all the functions, there a lot of them (the program is 1800 lines long)
Idem with Infolevel, I should modify all the procedures.

The problem with printlevel is that it really prints lots of output, and my garbage file becomes too big and unreadable.

Is there no a way to understand in which procedure the errors begins, before starting the real debug?????

Thanks

Salvo

Hi Axel, hi allhow should I modify the i...

Answered: sasomao 622

June 11 2010

0 0

Hi Axel, hi all

how should I modify the integration routine in order to have a still more precise answer?

Thanks

Salvo

Hi Axel, hi allhow should I modify the i...

Commented: sasomao 622

June 11 2010

Hi Axel, hi all

how should I modify the integration routine in order to have a still more precise answer?

Thanks

Salvo

Thanks!...

Answered: sasomao 622

June 02 2010

0 0

Thanks!

E-Mail Address:
Password:
Remember Me:	Automatically sign in on future visits

E-Mail Address:
Password:
Remember Me:	Automatically sign in on future visits

Ask a Question

Create a Post

622 Reputation

7 Badges

MaplePrimes Activity

These are replies submitted by sasomao

News...

also...

Evidences...

Then why not every time?...

Then why not every time?...

random?...

random?...

very long...

Help welcomed...

Hi, ok for the delay, I understand. I'm...

Hi, ok for the delay, I understand. I'm...

still there...

Hi Axel, hi allhow should I modify the i...

Hi Axel, hi allhow should I modify the i...

Thanks!...

Save this setting as your default sorting preference?

Ask a Question

Create a Post

Generating PDF…

Save this setting as your default sorting preference?
Note: You can change your preference any time in your account settings
Don't show this again

From:
To:

Custom Message (optional):