Proxy realms and home_server_pool fallback not working

Discussion:

Peter Lambrechtsen

2016-03-06 23:54:41 UTC

Hi

I'm looking to add more robustness into my proxy architecture and noticed
in the home_server_pool there is the option for "fallback = virtualrealm"
so if all home servers fail then a last resort home_server is used with
some config locally to always accept / reject customers based on the realm
they are coming from. I'm not using the status_check as some of the
downstream clients don't support status-server, but I will look into that
to see if it makes a difference. However for this situation I would expect
if you are using or not using Status server checks shouldn't have any
impact on how the fallback server works.

In the proxy.conf I have configured:

home_server ProxyDest {
type = auth+acct
ipaddr = 192.168.1.113
port = 1812
secret = password
response_window = 1
require_message_authenticator = no
zombie_period = 5
revive_interval = 10
status_check = none
#status_check = status-server
# username = "test_user_please_reject_me"
# password = "this is really secret"
check_interval = 10
num_answers_to_alive = 3
max_outstanding = 65536
}

home_server cacheuser {
virtual_server = cacheuser
}

# Main server pool
#
home_server_pool ProxyDestPool {
type = fail-over
home_server = ProxyDest
# home_server = cacheuser
fallback = cacheuser
}

Then in my virtual server I have configured:

server cacheuser {
authorize {
accept
}

So when the Destination server is up life is good.

(0) Proxying request to home server 192.168.1.113 port 1812 timeout 1.000000
(0) Sent Access-Request Id 26 from 0.0.0.0:58512 to 192.168.1.113:1812
length 337
...
Waking up in 0.3 seconds.
(0) Marking home server 192.168.1.113 port 1812 alive
(0) Clearing existing &reply: attributes
(0) Received Access-Accept Id 26 from 192.168.1.113:1812 to
192.168.1.116:58512 length 55

But if the server is down the first request I get a reject as expected due
to the home server being down.

(2) Proxying request to home server 192.168.1.113 port 1812 timeout 1.000000
(2) Sent Access-Request Id 17 from 0.0.0.0:47755 to 192.168.1.113:1812
length 337
...
Waking up in 0.3 seconds.
(2) Expecting proxy response no later than 0.669753 seconds from now
Waking up in 0.4 seconds.
(2) No proxy response, giving up on request and marking it done
Marking home server 192.168.1.113 port 1812 as zombie (it has not responded
in 1.000000 seconds).
(2) ERROR: Failing proxied request for user "peter", due to lack of any
response from home server 192.168.1.113 port 1812
(2) Clearing existing &reply: attributes

But the second and subsequent request I would expect to get proxied to the
local fallback virtual server as the home_server has been marked as zombie.
But that never seems to happen. It keeps on rejecting the requests and
fallback never seems to be used.

If I configure a second home server in the pool.

home_server_pool ProxyDestPool {
type = fail-over
home_server = ProxyDest
home_server = cacheuser
fallback = cacheuser
}

Then the second server is failed over to when the first fails. Which is all
good if I wanted to use the type fail-over, but if I wanted to use
load-balance then I can't have my fallback server as a home server
otherwise a percentage of requests will always be local which isn't ideal.

The other interesting thing with the failover is I set the check_interval
to 10 seconds, or 30 seconds. But it only seems that the first client is
re-checked after 60 seconds and assumed to be back up.

Waking up in 0.2 seconds.
Marking home server 192.168.1.113 port 1812 alive again... we have no idea
if it really is alive or not.
Waking up in 1.0 seconds.

I would have thought that

zombie_period = 5
revive_interval = 10
check_interval = 10

Would mean that the client would be re-checked in 10 seconds.

Am I mis-understanding how fallback is supposed to work?

Cheers

Peter
-
List info/subscribe/unsubscribe? See http://w

Alan DeKok

2016-03-07 01:55:44 UTC

Permalink

Post by Peter Lambrechtsen
I'm looking to add more robustness into my proxy architecture and noticed
in the home_server_pool there is the option for "fallback = virtualrealm"
so if all home servers fail then a last resort home_server is used with
some config locally to always accept / reject customers based on the realm
they are coming from. I'm not using the status_check

Then you can do "status_check = request". An Access-Accept or Access-Reject response will be accepted as an indication that the home server i alive.

Post by Peter Lambrechtsen
as some of the
downstream clients don't support status-server, but I will look into that
to see if it makes a difference.

It should.

Post by Peter Lambrechtsen
However for this situation I would expect
if you are using or not using Status server checks shouldn't have any
impact on how the fallback server works.

It does. A lot.

The problem is that without Status-Server, FreeRADIUS has to *guess* when the home server is alive. And the guess is usually wrong. Because most guesses are wrong.

Post by Peter Lambrechtsen
home_server ProxyDest {
type = auth+acct
ipaddr = 192.168.1.113
port = 1812
secret = password
response_window = 1
require_message_authenticator = no
zombie_period = 5
revive_interval = 10

That's really low. After 10s, just mark the home server alive?

It should be 60s at the minimum. Maybe 5min.

Post by Peter Lambrechtsen
But if the server is down the first request I get a reject as expected due
to the home server being down.

That's good.

Post by Peter Lambrechtsen
But the second and subsequent request I would expect to get proxied to the
local fallback virtual server as the home_server has been marked as zombie.
But that never seems to happen. It keeps on rejecting the requests and
fallback never seems to be used.

Hmm... I'll take a look.

Post by Peter Lambrechtsen
If I configure a second home server in the pool.

...

Post by Peter Lambrechtsen
Then the second server is failed over to when the first fails. Which is all
good if I wanted to use the type fail-over, but if I wanted to use
load-balance then I can't have my fallback server as a home server
otherwise a percentage of requests will always be local which isn't ideal.

Yes. You can't do load-balance and fallback.

You *can* put something into Post-Proxy-Type Fail. Which is probably what we should do. And remove the fallback virtual server.

This allows the same behaviour for all packets, and simplifies the proxy code.

Post by Peter Lambrechtsen
The other interesting thing with the failover is I set the check_interval
to 10 seconds, or 30 seconds. But it only seems that the first client is
re-checked after 60 seconds and assumed to be back up.

Because you have revive_interval set.

Post by Peter Lambrechtsen
Waking up in 0.2 seconds.
Marking home server 192.168.1.113 port 1812 alive again... we have no idea
if it really is alive or not.

And that message is printed only when you have revive_interval set.

The solution is to *not* set revive_interval. And use Status-Server exclusively.

Post by Peter Lambrechtsen
Waking up in 1.0 seconds.
I would have thought that
zombie_period = 5
revive_interval = 10
check_interval = 10
Would mean that the client would be re-checked in 10 seconds.

check_interval and revive_interval should be mutually exclusive. It just doesn't make sense to both check that the home server is alive every 10s, and then *always* mark it as alive after 10s.

Post by Peter Lambrechtsen
Am I mis-understanding how fallback is supposed to work?

A bit.

But the fallback virtual server should work. Tho I'm inclined to remove it in 3.1, as it makes everything more complicated.

Alan DeKok.

-
List info/subscribe/unsubscribe? See http://www.freeradius

Peter Lambrechtsen

2016-03-07 08:22:36 UTC

Permalink

Post by Alan DeKok

realm

Post by Peter Lambrechtsen
they are coming from. I'm not using the status_check

Then you can do "status_check = request". An Access-Accept or
Access-Reject response will be accepted as an indication that the home
server i alive.

Post by Peter Lambrechtsen
as some of the
downstream clients don't support status-server, but I will look into that
to see if it makes a difference.

It should.

Post by Peter Lambrechtsen
However for this situation I would expect
if you are using or not using Status server checks shouldn't have any
impact on how the fallback server works.

It does. A lot.
The problem is that without Status-Server, FreeRADIUS has to *guess*
when the home server is alive. And the guess is usually wrong. Because
most guesses are wrong.

Yes, I have figured that out. I'm now pinging all our downstream radius
clients to see which respond to something sane when sent a Status, and then
turning on Status server for them.

Post by Alan DeKok

That's really low. After 10s, just mark the home server alive?
It should be 60s at the minimum. Maybe 5min.

It was purely for testing as waiting around for 10 seconds is much better
than waiting around for 2 mins. Now with check_interval with status turned
on things are making more sense.

Post by Alan DeKok

Post by Peter Lambrechtsen
But the second and subsequent request I would expect to get proxied to

the

Post by Peter Lambrechtsen
local fallback virtual server as the home_server has been marked as

zombie.

Post by Peter Lambrechtsen
But that never seems to happen. It keeps on rejecting the requests and
fallback never seems to be used.

Hmm... I'll take a look.

Post by Peter Lambrechtsen
If I configure a second home server in the pool.

...

Post by Peter Lambrechtsen
Then the second server is failed over to when the first fails. Which is

all

Post by Peter Lambrechtsen
good if I wanted to use the type fail-over, but if I wanted to use
load-balance then I can't have my fallback server as a home server
otherwise a percentage of requests will always be local which isn't

ideal.
Yes. You can't do load-balance and fallback.
You *can* put something into Post-Proxy-Type Fail. Which is probably
what we should do. And remove the fallback virtual server.

What could I do in Post-Proxy-Type? As I can't call the virtual server, and
Proxy-To-Realm doesn't proxy to a new destination nor does setting the
control to accept. There doesn't seem to be a way to turn a Reject from a
failed proxy request back into an Accept.

(0) ERROR: Failing proxied request for user "peter", due to lack of any
response from home server 192.168.1.113 port 1812
(0) Clearing existing &reply: attributes
(0) Found Post-Proxy-Type Fail-Authentication
(0) # Executing group from file ./sites-enabled/default
(0) Post-Proxy-Type Fail-Authentication {
(0) policy accept {
(0) update control {
(0) &Response-Packet-Type = Access-Accept
(0) } # update control = noop
(0) [handled] = handled
(0) } # policy accept = handled
(0) } # Post-Proxy-Type Fail-Authentication = handled
(0) There was no response configured: rejecting request
(0) Using Post-Auth-Type Reject

Post by Alan DeKok
This allows the same behaviour for all packets, and simplifies the proxy code.

Because you have revive_interval set.

Post by Peter Lambrechtsen
Waking up in 0.2 seconds.
Marking home server 192.168.1.113 port 1812 alive again... we have no

idea

Post by Peter Lambrechtsen
if it really is alive or not.

And that message is printed only when you have revive_interval set.
The solution is to *not* set revive_interval. And use Status-Server exclusively.

check_interval and revive_interval should be mutually exclusive. It
just doesn't make sense to both check that the home server is alive every
10s, and then *always* mark it as alive after 10s.

Post by Peter Lambrechtsen
Am I mis-understanding how fallback is supposed to work?

A bit.
But the fallback virtual server should work. Tho I'm inclined to remove
it in 3.1, as it makes everything more complicated.

Thanks for all your help on this, the fail-over with the second server
being the virtual seems to work well, just means I am restricted to a
single server and can't use load-balance. But having this config would be
my ideal:

home_server_pool ProxyDestPool {
type = load-balance
home_server = ProxyDest1
home_server = ProxyDest2
home_server = ProxyDest3
fallback = cacheuser
}

Where if all the home servers go awol I use the local virtual server
cacheuser.

Many thanks

Peter
-
List info/subscribe/unsubscribe? See http://www.freeradius.org

Alan DeKok

2016-03-07 14:04:43 UTC

Permalink

Post by Peter Lambrechtsen
Yes, I have figured that out. I'm now pinging all our downstream radius
clients to see which respond to something sane when sent a Status, and then
turning on Status server for them.

Or just send Access-Request with a fake username "thisismejusttesting". They'll respond with an Access-Reject, which is good enough to determine that they're alive.

Post by Peter Lambrechtsen
What could I do in Post-Proxy-Type?

Anything you can do anywhere else.

The fallback virtual server is just there for ease of use. But... it complicates the proxy handling, as you've seen. A simpler approach is to put all of the "unlang" handling into... unlang. And not into the proxy code.

Post by Peter Lambrechtsen
As I can't call the virtual server,

We'll fix that for 3.2.

Post by Peter Lambrechtsen
and
Proxy-To-Realm doesn't proxy to a new destination nor does setting the
control to accept.

The home server pools should take care of fail-over to another home server. But yes, once the whole pool has failed... you can't send the packet to a different destination. That's what home server pools are for...

Post by Peter Lambrechtsen
There doesn't seem to be a way to turn a Reject from a
failed proxy request back into an Accept.

A failed proxy request is not really a reject... it's just a failed request. And you can force it to be an Access-Accept via Post-Proxy-Type Fail:

post-proxy {
...

Post-Proxy-Type Fail-Authentication {
update control {
Response-Packet-Type := Access-Accept
}

}
...
}

We'll work on simplifying that for 3.2, also.

Post by Peter Lambrechtsen
Thanks for all your help on this, the fail-over with the second server
being the virtual seems to work well, just means I am restricted to a
single server and can't use load-balance. But having this config would be
home_server_pool ProxyDestPool {
type = load-balance
home_server = ProxyDest1
home_server = ProxyDest2
home_server = ProxyDest3
fallback = cacheuser
}

That works for me. When all home servers in a "load-balance" pool are down, it uses the fallback virtual server:

(0) } # authorize = updated
Home server pool example.net failing over to fallback example.net
Proxying to virtual server example.net

Alan DeKok.

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/use

Peter Lambrechtsen

2016-03-08 09:24:08 UTC

Permalink

Post by Alan DeKok

Post by Peter Lambrechtsen
Yes, I have figured that out. I'm now pinging all our downstream radius
clients to see which respond to something sane when sent a Status, and

then

Post by Peter Lambrechtsen
turning on Status server for them.

Or just send Access-Request with a fake username "thisismejusttesting".
They'll respond with an Access-Reject, which is good enough to determine
that they're alive.

Post by Peter Lambrechtsen
What could I do in Post-Proxy-Type?

Anything you can do anywhere else.
The fallback virtual server is just there for ease of use. But... it
complicates the proxy handling, as you've seen. A simpler approach is to
put all of the "unlang" handling into... unlang. And not into the proxy
code.

Post by Peter Lambrechtsen
As I can't call the virtual server,

We'll fix that for 3.2.

Post by Peter Lambrechtsen
and
Proxy-To-Realm doesn't proxy to a new destination nor does setting the
control to accept.

The home server pools should take care of fail-over to another home
server. But yes, once the whole pool has failed... you can't send the
packet to a different destination. That's what home server pools are for...

Post by Peter Lambrechtsen
There doesn't seem to be a way to turn a Reject from a
failed proxy request back into an Accept.

A failed proxy request is not really a reject... it's just a failed
request. And you can force it to be an Access-Accept via Post-Proxy-Type
post-proxy {
...
Post-Proxy-Type Fail-Authentication {
update control {
Response-Packet-Type := Access-Accept
}
}
...
}

This doesn't seem to work in 3.0.x head, I will test it on 3.1.x tomorrow.

(0) ERROR: Failing proxied request for user "peter", due to lack of any
response from home server 192.168.1.113 port 1812
(0) Clearing existing &reply: attributes
(0) Found Post-Proxy-Type Fail-Authentication
(0) # Executing group from file ./sites-enabled/default
(0) Post-Proxy-Type Fail-Authentication {
(0) update control {
(0) Response-Packet-Type := Access-Accept
(0) } # update control = noop
(0) policy accept {
(0) update control {
(0) &Response-Packet-Type = Access-Accept
(0) } # update control = noop
(0) [handled] = handled
(0) } # policy accept = handled
(0) } # Post-Proxy-Type Fail-Authentication = handled
(0) There was no response configured: rejecting request <- How do I
configure a response?
(0) Using Post-Auth-Type Reject
(0) # Executing group from file ./sites-enabled/default
(0) Post-Auth-Type REJECT {

We'll work on simplifying that for 3.2, also.

Post by Alan DeKok

That works for me. When all home servers in a "load-balance" pool are
(0) } # authorize = updated
Home server pool example.net failing over to fallback example.net
Proxying to virtual server example.net

I think this must work in 3.1 as it doesn't work for me in 3.0.x head from
last week, as I just tried this and fallback didn't seem to get applied at
all.

I'll test 3.1 head tomorrow morning.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.ht

Alan DeKok

2016-03-08 16:19:20 UTC

Permalink

Post by Peter Lambrechtsen
This doesn't seem to work in 3.0.x head, I will test it on 3.1.x tomorrow.

I've pushed a fix.

Post by Peter Lambrechtsen
I think this must work in 3.1 as it doesn't work for me in 3.0.x head from
last week, as I just tried this and fallback didn't seem to get applied at
all.

v3.0.x head worked for me yesterday what I tried that.

The "fallback" code for home_server_pools is independent of the type of the home_server_pool.

You may be running into a timer issue... i.e. if the timers are short, the home_server is marked alive, and the fallback is never used.

I used "radmin" to forcibly set the home_server state to "dead". That avoids the timer issues, and the fallback works correctly.

Alan DeKok.

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.

Peter Lambrechtsen

2016-03-08 21:54:29 UTC

Permalink

Post by Peter Lambrechtsen

Post by Peter Lambrechtsen
This doesn't seem to work in 3.0.x head, I will test it on 3.1.x

tomorrow.
I've pushed a fix.

That's fixed it... Brilliant :)

Post by Peter Lambrechtsen

Post by Peter Lambrechtsen
I think this must work in 3.1 as it doesn't work for me in 3.0.x head

from

Post by Peter Lambrechtsen
last week, as I just tried this and fallback didn't seem to get applied

Post by Peter Lambrechtsen
all.

v3.0.x head worked for me yesterday what I tried that.
The "fallback" code for home_server_pools is independent of the type of
the home_server_pool.
You may be running into a timer issue... i.e. if the timers are short,
the home_server is marked alive, and the fallback is never used.
I used "radmin" to forcibly set the home_server state to "dead". That
avoids the timer issues, and the fallback works correctly.

I think that was my issue, as I was using a second VM on the network as the
proxy destination I was shutting down the destination server and not
waiting for the zombie period to expire.

(9) } # authorize = updated
Home server pool ProxyDestPool failing over to fallback cacheuser
(9) # Executing section pre-proxy from file ./sites-enabled/default

That seems to be my issue, I've just re-tested that with 3.0.x head and had
the zombie_timeout set too high. After I wound that number down to the same
as check_interval and once the server went to zombie then the fallback
occurred.

zombie_period = 10
check_interval = 10
num_answers_to_alive = 2

This way once the server has been offline for 10 seconds it's zombied and
fallback occurs. Then the check interval 10 x 2 means after the server has
been back up and responded to 2x alives then it goes back to the remote
proxy server.

Granted I won't have the values set this low in production, but since this
will be a high volume server with some critical services on it. I suspect I
will stick with 30 seconds or 1 min for the check interval but keep the
zombie value at 20 seconds. So if a radius server dies or becomes
unresponsive we don't wait around until we mark it zombie before we start
authing everyone locally. Then have a reasonable backoff before we attempt
to start authing again.

Many thanks again.

Peter
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users

Alan DeKok

2016-03-08 23:40:45 UTC

Permalink

Post by Peter Lambrechtsen
I think that was my issue, as I was using a second VM on the network as the
proxy destination I was shutting down the destination server and not
waiting for the zombie period to expire.

Yeah. It's documented, but it's not immediately obvious.

Post by Peter Lambrechtsen
That seems to be my issue, I've just re-tested that with 3.0.x head and had
the zombie_timeout set too high. After I wound that number down to the same
as check_interval and once the server went to zombie then the fallback
occurred.

Good.

Post by Peter Lambrechtsen
Granted I won't have the values set this low in production, but since this
will be a high volume server with some critical services on it. I suspect I
will stick with 30 seconds or 1 min for the check interval but keep the
zombie value at 20 seconds.

The check interval can be set lower without any problem. It's only one RADIUS packet.

Post by Peter Lambrechtsen
So if a radius server dies or becomes
unresponsive we don't wait around until we mark it zombie before we start
authing everyone locally. Then have a reasonable backoff before we attempt
to start authing again.

Yes. It's OK to set zombie_period to a low value. In any normal deployment, you'll be proxying many packets a second to a home server. If it doesn't respond to *any* packets for 10 seconds, you're pretty sure it's dead.

1 to 2 seconds is probably too low, as there may be transient network issues which are that long.

Post by Peter Lambrechtsen
Many thanks again.

You're welcome.

Alan DeKok.

-
List info/subscribe/unsubscribe? See http://www.freeradius.or