D&C Lug - Home Page
Devon & Cornwall Linux Users' Group

[ Date Index ][ Thread Index ]
[ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Error message 141



Clive,

Thanks for that info.
(I am fairly new to socket type programming and have picked this up from someone else)
Probably bad coding, it is written in house :-)

I think strace is going to be a last ditch effort, the amount of output it generates will be huge....

A question.

Parent runs.
 parent forks child.
  child does stuff
  child gets SIGPIPE
  child dies with SIGPIPE
 
Will this cause the parent to die as well?

I am not sure if the child is generation the error or the parent.

It would make more sense for the child to generate the error.

If I read your email correctly...
the child dies with a SIGPIPE
the parent detects the error, exits with a SIGPIPE error but is returning the full return value, 141, instead of decoding it.

What should happen of course is that the child should detect the error and do something useful, like log an error and die gracefully rather falling over

Does that make sense???

 --
'ooroo

Mike...(:)-)
---------------------------------------------------
Email: mike@xxxxxxxxxxxxx        o
You need only two tools.        o /////
A hammer and duct tape. If it    /@   `\  /) ~
doesn't move and it should use  >  (O)  X<  ~  Fish!!
the hammer. If it moves and      `\___/'  \) ~
shouldn't, use the tape.           \\\
---------------------------------------------------


On 24.11.2003 09:55 "Darke, Clive" wrote:
> Mike,
> 
> I don't know this program, but I suspect your guess with SIGPIPE is correct.
> As you know in Perl you have to $? >> 8 to get the exit code.  This is not
> just being obscure, the C interface (wait/waitpid) is the same.  It is
> possible (although bad programming) that it is just returning the full
> return value from waitpid without decoding it.
> 
> A child sends a SIGCHLD to the parent when it dies, which at worst will
> produce a zombie, not kill the parent.  However if a pipe/unix domain socket
> writer looses its reader, then SIGPIPE is raised.   
> 
> Might I suggest that you run the parent under strace?  Try:  strace -o
> strace.out -f x25dl1 2>&1 >> $log.  The -o will dump all kernel calls to
> strace.out (might be large!) and the -f means "follow", ie. include child
> processes in the trace.  You should then be able to see the SIGPIPE, or
> whatever is raised, in the strace.out file.
> 
> Clive
> 
> -----Original Message-----
> From: mike@xxxxxxxxxxxxx [mailto:mike@xxxxxxxxxxxxx]
> Sent: 23 November 2003 21:02
> To: list@xxxxxxxxxxxx
> Subject: Re: [LUG] Error message
> 
> 
> On 23.11.2003 13:04 mike@xxxxxxxxxxxxx wrote:
> > G'day all,
> > 
> > I have a bit of code...
> > 
> > log="/log/x25dl1.log";
> > while true
> > do
> >   x25dl1 2>&1 >> $log
> >   status=$?
> >   dd=`date +"%d-%b-%Y %H:%M:%S"|tr [:lower:] [:upper:]`
> >   echo "$dd [$$] FATAL: x25dl1 has died. Status = $status">>$log
> >   sleep 5
> > done
> > 
> > In the log file I get...
> > 22-NOV-2003 19:00:02 [688] FATAL: x25dl1 has died. Status = 141
> > 
> > I can't find what the status of 141 is. Any ideas?
> > 
> > x25dl1 is a c program.
> > If it was perl I would have said it was a SIGPIPE error, which would make
> sense since it uses sockets to communicate.
> 
> 
> OK, more on this.
> 
> x25dl1 is a forking process (In more ways than one!!)
> 
> The process that dies is the parent, when it dies it takes all the children
> with it.
> This could be bad, in that the children might actually be doing something at
> the time.
> 
> I think there might be a problem with a child that is causing the parent to
> die, does that help/make sense?
> 
> There is another problem.
> The children send and receive via sockets, it sends data to a another
> process and then waits for the other process to reply.
> Sometime the other process never answers and the x25dl1 child sits there for
> ever waiting for data that is never going to appear (why I am not sure yet,
> but I am working on that)
> In the mean time I have a process that runs from cron every 15 minutes and
> kills off x25dl1 children that have been running for more than 15 minutes.
> This is just to reduce the clutter, if the child is still running after
> about 15 seconds something has gone wrong.
> 
> Could this kill process be causing a problem? I.E. causing the parent to
> die, I don't see a link as yet, but I am beginning to wonder.
> The log file shows that the parent dies at either 00,15,30,45 minutes past
> the hour which is when the kill process runs.
> But it also dies with the same error outside these times as well.
> 
>  --
> 'ooroo
> 
> Mike...(:)-)

--
The Mailing List for the Devon & Cornwall LUG
Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the
message body to unsubscribe.


Lynx friendly