D&C Lug - Home Page
Devon & Cornwall Linux Users' Group

[ Date Index ][ Thread Index ]
[ <= Previous by date / thread ] [ Next by date / thread => ]

RE: [LUG] Error message



Mike,

I don't know this program, but I suspect your guess with SIGPIPE is correct.
As you know in Perl you have to $? >> 8 to get the exit code.  This is not
just being obscure, the C interface (wait/waitpid) is the same.  It is
possible (although bad programming) that it is just returning the full
return value from waitpid without decoding it.

A child sends a SIGCHLD to the parent when it dies, which at worst will
produce a zombie, not kill the parent.  However if a pipe/unix domain socket
writer looses its reader, then SIGPIPE is raised.   

Might I suggest that you run the parent under strace?  Try:  strace -o
strace.out -f c25dl1 2>&1 >> $log.  The -o will dump all kernel calls to
strace.out (might be large!) and the -f means "follow", ie. include child
processes in the trace.  You should then be able to see the SIGPIPE, or
whatever is raised, in the strace.out file.

Clive

-----Original Message-----
From: mike@xxxxxxxxxxxxx [mailto:mike@xxxxxxxxxxxxx]
Sent: 23 November 2003 21:02
To: list@xxxxxxxxxxxx
Subject: Re: [LUG] Error message


On 23.11.2003 13:04 mike@xxxxxxxxxxxxx wrote:
> G'day all,
> 
> I have a bit of code...
> 
> log="/log/x25dl1.log";
> while true
> do
>   x25dl1 2>&1 >> $log
>   status=$?
>   dd=`date +"%d-%b-%Y %H:%M:%S"|tr [:lower:] [:upper:]`
>   echo "$dd [$$] FATAL: x25dl1 has died. Status = $status">>$log
>   sleep 5
> done
> 
> In the log file I get...
> 22-NOV-2003 19:00:02 [688] FATAL: x25dl1 has died. Status = 141
> 
> I can't find what the status of 141 is. Any ideas?
> 
> x25dl1 is a c program.
> If it was perl I would have said it was a SIGPIPE error, which would make
sense since it uses sockets to communicate.


OK, more on this.

x25dl1 is a forking process (In more ways than one!!)

The process that dies is the parent, when it dies it takes all the children
with it.
This could be bad, in that the children might actually be doing something at
the time.

I think there might be a problem with a child that is causing the parent to
die, does that help/make sense?

There is another problem.
The children send and receive via sockets, it sends data to a another
process and then waits for the other process to reply.
Sometime the other process never answers and the x25dl1 child sits there for
ever waiting for data that is never going to appear (why I am not sure yet,
but I am working on that)
In the mean time I have a process that runs from cron every 15 minutes and
kills off x25dl1 children that have been running for more than 15 minutes.
This is just to reduce the clutter, if the child is still running after
about 15 seconds something has gone wrong.

Could this kill process be causing a problem? I.E. causing the parent to
die, I don't see a link as yet, but I am beginning to wonder.
The log file shows that the parent dies at either 00,15,30,45 minutes past
the hour which is when the kill process runs.
But it also dies with the same error outside these times as well.

 --
'ooroo

Mike...(:)-)
---------------------------------------------------
Email: mike@xxxxxxxxxxxxx        o
You need only two tools.        o /////
A hammer and duct tape. If it    /@   `\  /) ~
doesn't move and it should use  >  (O)  X<  ~  Fish!!
the hammer. If it moves and      `\___/'  \) ~
shouldn't, use the tape.           \\\
---------------------------------------------------

--
The Mailing List for the Devon & Cornwall LUG
Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the
message body to unsubscribe.

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

--
The Mailing List for the Devon & Cornwall LUG
Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the
message body to unsubscribe.


Lynx friendly