Discussion:
Marking proxy servers as zombie - odd behaviour
John Horne
2010-06-17 10:26:37 UTC
Permalink
Hello,

We have 3 backend servers which are used in a client-balance mode from
our local proxy server. We are running FR 2.1.10 (from git), but have
seen the following behaviour when we were running 2.1.7 and 2.1.9 for a
short time. Our logs are showing that FR marks the backend servers as
zombie every few minutes. Not all of them at once, just one or maybe
two, and then one will be marked as alive and another marked as zombie,
and so on. As far as we can see there is no general network problem, and
neither the radius nor backend servers are under any particular load.

Running tcpdump for a short time seemed to show that something odd was
going on (I have included the actual date/time of the packets, but
removed the contents. I think the packet header lines should show what I
mean):

=============================================================
2010-06-17 10:47:35 IP 141.163.66.101.1812 > 141.163.195.250.1814:
RADIUS, Access Accept (2), id: 0xdd length: 278

2010-06-17 10:47:46 IP 141.163.66.101.1812 > 141.163.195.250.1814:
RADIUS, Access Reject (3), id: 0xb2 length: 20
=============================================================


At this time, our radius log showed:

=============================================================
Thu Jun 17 10:47:35 2010 : Auth: Login OK: [xxxxxx] (from client
GRN-DVB-WISM-1 port 29 cli 7C-6D-62-63-03-8D)
Thu Jun 17 10:47:46 2010 : Proxy: Marking home server 141.163.66.101
port 1812 as zombie (it looks like it is dead).
Thu Jun 17 10:47:46 2010 : Proxy: Received response to status check
10959 (1 in current sequence)
=============================================================


So what is being seen is that backend server 141.163.66.101 has sent an
accept accept packet (to the local proxy server 195.250) and the log
shows a user as having authenticated. About 10 seconds later, the server
is marked as zombie, but tcpdump shows that a packet (access reject - we
have status-server set up with an invalid userid, so the reject is
correct) is received from that server.

So I think the question is why did FR think the server was zombie when
it had received an access-accept just a few seconds before? Why does it
think it looks like it is dead? The tcpdump shows no other packets being
sent to/from the server for FR to think that. And why is it that when FR
thinks the backend server is dead it receives a status check reply at
the same time? Did it send the status check query (and if so why didn't
tcpdump show it?) but for whatever reason immediately decide that the
server had not replied?

Maybe I am wrong but I would not have expected FR to even consider the
backend server as zombie/dead given that it had received a packet from
it 10 seconds before.



Thanks,

John.
--
John Horne, University of Plymouth, UK
Tel: +44 (0)1752 587287 Fax: +44 (0)1752 587001

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Josip Rodin
2010-06-17 12:16:10 UTC
Permalink
Post by John Horne
So what is being seen is that backend server 141.163.66.101 has sent an
accept accept packet (to the local proxy server 195.250) and the log
shows a user as having authenticated. About 10 seconds later, the server
is marked as zombie, but tcpdump shows that a packet (access reject - we
have status-server set up with an invalid userid, so the reject is
correct) is received from that server.
Can you paste the exact home_server settings you used? That might shed
some light...
--
2. That which causes joy or happiness.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Alan Buxey
2010-06-17 13:09:44 UTC
Permalink
Hi,
Post by Josip Rodin
Post by John Horne
So what is being seen is that backend server 141.163.66.101 has sent an
accept accept packet (to the local proxy server 195.250) and the log
shows a user as having authenticated. About 10 seconds later, the server
is marked as zombie, but tcpdump shows that a packet (access reject - we
have status-server set up with an invalid userid, so the reject is
correct) is received from that server.
Can you paste the exact home_server settings you used? That might shed
some light...
expected response time? what are your timers? 5 seconds? whats the cleanup_delay
for example?

alan
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
John Horne
2010-06-17 13:29:56 UTC
Permalink
Post by Alan Buxey
Hi,
Post by Josip Rodin
Post by John Horne
So what is being seen is that backend server 141.163.66.101 has sent an
accept accept packet (to the local proxy server 195.250) and the log
shows a user as having authenticated. About 10 seconds later, the server
is marked as zombie, but tcpdump shows that a packet (access reject - we
have status-server set up with an invalid userid, so the reject is
correct) is received from that server.
Can you paste the exact home_server settings you used? That might shed
some light...
expected response time?
A few seconds at most.
Post by Alan Buxey
what are your timers? 5 seconds?
We are currently letting them default.

Running 'radiusd -XC' shows:

===========================================
home_server IAS {
ipaddr = ias.plymouth.ac.uk IP address [141.163.66.101]
port = 1812
type = "auth"
secret = xxxxxxxxx
response_window = 30
max_outstanding = 65536
zombie_period = 40
status_check = "request"
ping_interval = 30
check_interval = 30
num_answers_to_alive = 3
num_pings_to_alive = 3
revive_interval = 300
status_check_timeout = 4
username = xxxxxxx
password = xxxxxxx
irt = 2
mrt = 16
mrc = 5
mrd = 30
}
===========================================
Post by Alan Buxey
whats the cleanup_delay for example?
max_request_time = 30
cleanup_delay = 5
max_requests = 262144




John.
--
John Horne, University of Plymouth, UK
Tel: +44 (0)1752 587287 Fax: +44 (0)1752 587001

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
John Horne
2010-06-17 13:10:26 UTC
Permalink
Post by Josip Rodin
Post by John Horne
So what is being seen is that backend server 141.163.66.101 has sent an
accept accept packet (to the local proxy server 195.250) and the log
shows a user as having authenticated. About 10 seconds later, the server
is marked as zombie, but tcpdump shows that a packet (access reject - we
have status-server set up with an invalid userid, so the reject is
correct) is received from that server.
Can you paste the exact home_server settings you used? That might shed
some light...
For this particular server:

home_server IAS {
type = auth
ipaddr = ias.plymouth.ac.uk
port = 1812
secret = xxxxxx
# status_check = status-server
status_check = request
username = xxxxxx
password = xxxxxx
}




John.
--
John Horne, University of Plymouth, UK
Tel: +44 (0)1752 587287 Fax: +44 (0)1752 587001

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Alan DeKok
2010-06-17 15:54:16 UTC
Permalink
Post by John Horne
So what is being seen is that backend server 141.163.66.101 has sent an
accept accept packet (to the local proxy server 195.250) and the log
shows a user as having authenticated. About 10 seconds later, the server
is marked as zombie, but tcpdump shows that a packet (access reject - we
have status-server set up with an invalid userid, so the reject is
correct) is received from that server.
Yes, that can happen.
Post by John Horne
So I think the question is why did FR think the server was zombie when
it had received an access-accept just a few seconds before? Why does it
think it looks like it is dead?
Because the home server didn't respond to *another* request.

Each request has a timer. If the home server doesn't respond within
that time, then it is marked "zombie".
Post by John Horne
The tcpdump shows no other packets being
sent to/from the server for FR to think that. And why is it that when FR
thinks the backend server is dead it receives a status check reply at
the same time?
Because the failure to respond gets it marked as zombie. When it gets
marked zombie, FreeRADIUS starts pinging it. It responds to the ping,
but that doesn't mean it's responsive. It takes 3 pings before it's
marked "alive" again.
Post by John Horne
Maybe I am wrong but I would not have expected FR to even consider the
backend server as zombie/dead given that it had received a packet from
it 10 seconds before.
The "mark as zombie" code could be less aggressive, yes.

Alan DeKok.

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
John Horne
2010-06-17 16:19:31 UTC
Permalink
Post by Alan DeKok
Why does it think it looks like it is dead?
Because the home server didn't respond to *another* request.
Each request has a timer. If the home server doesn't respond within
that time, then it is marked "zombie".
Hmm. Given that the servers are lightly loaded, I guess we are looking
at packet loss over the network? I have no idea if our networking people
can 'detect' something like that, but I guess monitoring traffic coming
out of the proxy and going to the home server should show if a packet is
lost somewhere.

Many thanks for all the replies. We always have at least one home server
'alive' according to FR so this has not been a problem for our users. It
just seemed a bit odd behaviour when we looked into the log messages
more closely.


Thanks,

John.
--
John Horne, University of Plymouth, UK
Tel: +44 (0)1752 587287 Fax: +44 (0)1752 587001

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Alan DeKok
2010-06-17 16:29:34 UTC
Permalink
Post by John Horne
Hmm. Given that the servers are lightly loaded, I guess we are looking
at packet loss over the network?
Yes. Many packets lost. The NAS re-transmits, FR re-transmits, and
the home server doesn't respond.

The default timeout before marking a home server zombie is 30s. If
there aren't *any* responses to retransmits in 30s, something is very wrong.
Post by John Horne
Many thanks for all the replies. We always have at least one home server
'alive' according to FR so this has not been a problem for our users. It
just seemed a bit odd behaviour when we looked into the log messages
more closely.
I'll take a look at making the zombie code less aggressive.

Alan DeKok.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Bjørn Mork
2010-06-17 18:08:19 UTC
Permalink
Post by Alan DeKok
Post by John Horne
Hmm. Given that the servers are lightly loaded, I guess we are looking
at packet loss over the network?
Yes. Many packets lost. The NAS re-transmits, FR re-transmits, and
the home server doesn't respond.
The default timeout before marking a home server zombie is 30s. If
there aren't *any* responses to retransmits in 30s, something is very wrong.
Given these conditions, I believe it's much more likely that the home
servers *choose* not to respond to some of the requests. There are lots
of reasons why this might happen. Some are even default FreeRADIUS
behaviour (if no module handle an accounting request, then it isn't
acked).

I would start by looking for any such deliberately ignored request. As
you say, packet loss is a very unlikely explanation.



Bjørn

-
List info/subscribe/unsubscribe? See http://
John Horne
2010-06-17 18:55:03 UTC
Permalink
Post by Bjørn Mork
I would start by looking for any such deliberately ignored request.
I am told that the home server logs show nothing suspicious. I have no
direct access to those servers so I cannot say for myself. However, I
have asked that the logs are monitored and if some part of it can be
provided to me, then I will try and correlate what I see on the proxy
server with the home server logs.




John.
--
John Horne, University of Plymouth, UK
Tel: +44 (0)1752 587287 Fax: +44 (0)1752 587001

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Bjørn Mork
2010-06-17 16:38:47 UTC
Permalink
Post by John Horne
Post by Alan DeKok
Why does it think it looks like it is dead?
Because the home server didn't respond to *another* request.
Each request has a timer. If the home server doesn't respond within
that time, then it is marked "zombie".
Hmm. Given that the servers are lightly loaded, I guess we are looking
at packet loss over the network?
Or maybe some odd request the home servers refuse to answer?


Bjørn

-
List info/subscribe/u
Loading...