D&C Lug - Home Page
Devon & Cornwall Linux Users' Group

[ Date Index ][ Thread Index ]
[ <= Previous by date / thread ] [ Next by date / thread => ]

RE: [LUG] Error message 141



Mike,

I'm guessing of course, but I think it might be the parent dying with the
SIGPIPE.  If the child is being killed (with a -KILL) then really the parent
should detect the EPIPE.
You might have two processes reading from the same pipe/socket, it could be
that the size of the message blocks being sent is not consistent, or there
is a timing issue.  It could also be that the PIPE_BUF value is being
exceeded, and the read/writes are not atomic.  I can't really say without
trawling through the source code, can't you get the developer on board?

(I would still strace, and grep for ERR)

Clive

-----Original Message-----
From: mike@xxxxxxxxxxxxx [mailto:mike@xxxxxxxxxxxxx]
Sent: 24 November 2003 11:19
To: list@xxxxxxxxxxxx
Subject: Re: [LUG] Error message 141


Clive,

Thanks for that info.
(I am fairly new to socket type programming and have picked this up from
someone else)
Probably bad coding, it is written in house :-)

I think strace is going to be a last ditch effort, the amount of output it
generates will be huge....

A question.

Parent runs.
 parent forks child.
  child does stuff
  child gets SIGPIPE
  child dies with SIGPIPE
 
Will this cause the parent to die as well?

I am not sure if the child is generation the error or the parent.

It would make more sense for the child to generate the error.

If I read your email correctly...
the child dies with a SIGPIPE
the parent detects the error, exits with a SIGPIPE error but is returning
the full return value, 141, instead of decoding it.

What should happen of course is that the child should detect the error and
do something useful, like log an error and die gracefully rather falling
over

Does that make sense???

 --
'ooroo

Mike...(:)-)
---------------------------------------------------
Email: mike@xxxxxxxxxxxxx        o
You need only two tools.        o /////
A hammer and duct tape. If it    /@   `\  /) ~
doesn't move and it should use  >  (O)  X<  ~  Fish!!
the hammer. If it moves and      `\___/'  \) ~
shouldn't, use the tape.           \\\
---------------------------------------------------


On 24.11.2003 09:55 "Darke, Clive" wrote:
> Mike,
> 
> I don't know this program, but I suspect your guess with SIGPIPE is
correct.
> As you know in Perl you have to $? >> 8 to get the exit code.  This is not
> just being obscure, the C interface (wait/waitpid) is the same.  It is
> possible (although bad programming) that it is just returning the full
> return value from waitpid without decoding it.
> 
> A child sends a SIGCHLD to the parent when it dies, which at worst will
> produce a zombie, not kill the parent.  However if a pipe/unix domain
socket
> writer looses its reader, then SIGPIPE is raised.   
> 
> Might I suggest that you run the parent under strace?  Try:  strace -o
> strace.out -f x25dl1 2>&1 >> $log.  The -o will dump all kernel calls to
> strace.out (might be large!) and the -f means "follow", ie. include child
> processes in the trace.  You should then be able to see the SIGPIPE, or
> whatever is raised, in the strace.out file.
> 
> Clive
> 
> -----Original Message-----
> From: mike@xxxxxxxxxxxxx [mailto:mike@xxxxxxxxxxxxx]
> Sent: 23 November 2003 21:02
> To: list@xxxxxxxxxxxx
> Subject: Re: [LUG] Error message
> 
> 
> On 23.11.2003 13:04 mike@xxxxxxxxxxxxx wrote:
> > G'day all,
> > 
> > I have a bit of code...
> > 
> > log="/log/x25dl1.log";
> > while true
> > do
> >   x25dl1 2>&1 >> $log
> >   status=$?
> >   dd=`date +"%d-%b-%Y %H:%M:%S"|tr [:lower:] [:upper:]`
> >   echo "$dd [$$] FATAL: x25dl1 has died. Status = $status">>$log
> >   sleep 5
> > done
> > 
> > In the log file I get...
> > 22-NOV-2003 19:00:02 [688] FATAL: x25dl1 has died. Status = 141
> > 
> > I can't find what the status of 141 is. Any ideas?
> > 
> > x25dl1 is a c program.
> > If it was perl I would have said it was a SIGPIPE error, which would
make
> sense since it uses sockets to communicate.
> 
> 
> OK, more on this.
> 
> x25dl1 is a forking process (In more ways than one!!)
> 
> The process that dies is the parent, when it dies it takes all the
children
> with it.
> This could be bad, in that the children might actually be doing something
at
> the time.
> 
> I think there might be a problem with a child that is causing the parent
to
> die, does that help/make sense?
> 
> There is another problem.
> The children send and receive via sockets, it sends data to a another
> process and then waits for the other process to reply.
> Sometime the other process never answers and the x25dl1 child sits there
for
> ever waiting for data that is never going to appear (why I am not sure
yet,
> but I am working on that)
> In the mean time I have a process that runs from cron every 15 minutes and
> kills off x25dl1 children that have been running for more than 15 minutes.
> This is just to reduce the clutter, if the child is still running after
> about 15 seconds something has gone wrong.
> 
> Could this kill process be causing a problem? I.E. causing the parent to
> die, I don't see a link as yet, but I am beginning to wonder.
> The log file shows that the parent dies at either 00,15,30,45 minutes past
> the hour which is when the kill process runs.
> But it also dies with the same error outside these times as well.
> 
>  --
> 'ooroo
> 
> Mike...(:)-)

--
The Mailing List for the Devon & Cornwall LUG
Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the
message body to unsubscribe.

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

--
The Mailing List for the Devon & Cornwall LUG
Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the
message body to unsubscribe.


Lynx friendly