[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Heartbeats Straw Poll (part 1)


My apologies for starting this thread and then not participating in the
discussion, but I've been having mail server problems all week and I've only
just fixed the problem now.

And since this thread has generated over 2^6 replies ;-) , I'm going to
batch up all my responses into a couple of messages.

Part 1: outline of the problem
Part 2: alternate proposals
Part 3: conclusions


> Speaking as a free-software developer (technical lead of
> FreeS/WAN), our
> current analysis says that an IKE-level heartbeat is
> necessary and important.

I was going by what Hugh Daniel said at the meeting. You don't appear to
have an internal consensus on the issue.


> When the group was asked "how many people understand this proposal",
> I saw lots of people who I would have hoped would have raised their
> hands not doing so (and not voting in the next two questions,
> thankfully).

I was a bit surprised at this as well. Certainly, I did not try to explain
the proposal from scratch at the meeting; I only went over a few points of
interest. However, I thought more people had read and understood the drafts
and subsequent discussions on the list.

Andrew (me):

> To those who voted against the idea of a keep-alive protocol,
> how do you
> propose that we ensure high availablilty IPsec for our
> customers who demand
> it?


> As several people brought up in the meeting, "keepalives" under the
> wrong circumstances tend to turn into "make-deads".  IKE and IPSEC
> implementations should not delete SA's prior to their normal
> expiration merely because they haven't heard from the other end in a
> while.

There's certainly a case for and against keep-alives, which is why I phrased
the question that way. We have a specific problem to solve, which is
ensuring that IPsec hosts can recover from black holes in a secure and
timely manner. If there is a better solution which meets these requirements,
I'm open to standardizing that.


> There appear to be two different properties people are looking for
> from heartbeats/keepalives:
> First, rapid recovery from loss of state on one end of a security
> association (due to power loss/reboot/reset), so a new IKE SA can be
> initiated on one end or the other.  Once this happens, the half-dead
> state on one end can be garbage collected as a result of an
> affirmative indication (IKE INITIAL-CONTACT) that the other side lost
> state.
> Second, detection of loss of connectivity between two security
> gateways so that traffic can be rerouted through an alternate gateway.
> This is really a dynamic routing problem and could (and probably
> should) be done without prematurely tearing down IKE SA's and IPSEC
> SA's which may still exist and may still be useful once the
> connectivity comes back.

As far as I'm concerned, the real problem is correcting the black hole.
Other people mentioned other requirements, so we tried to accomodate those
as well. Why not... after all, as in any 12 step recovery program, the first
step is admitting you have a problem. Once you detect the problem, you can
take whatever other corrective action you want (stop accounting, return dhcp
address to pool, throw your arms up in despair, etc.)

If we can distinguish between the two cases you mentioned in a sensible
manner (e.g. by also sending clear pings to the public IP) then that would
be a good idea as well.

BTW, I think it's also important to include the problem of loss of
synchronization due to lost deletes. Note that this problem is compounded
due to case 2 above. If deletes get sent during the temporary connection
loss then there is the problem of resynchronizing the phase 1 & 2 SADBs when
the connection comes back up.


> Sounds like we might want a *short*, concise statement
> of the problem to the list before the straw poll is taken next. Maybe
> start with a neutral description of the problem, followed by two
> paragraphs in favor and two opposed.

I am proposing that we need to solve the problem of state desynchronization
(due primarily to reboots, but also due to lost deletes) in a manner that
can't be exploited by an adversary.

Feature requirements:
- The detection should occur within a predictable (and tweakable) time
- We don't want to defeat idle timers, thus forcing SAs up indefinitely (and
rekeying indefinitely).

The vulnerability to exploitation should be limited as follows:

An unprivileged adversary:
- should not be able to spoof that the connection is still up.
- should not be able to take the connection down.
- should not be able to force a greater than O(n) response to spoofed
- and the coefficient of the O(n) response should be reasonable.

An intermediate router:
- should not be able to spoof that the connection is still up.
- should be able to take the connection down (because you can't prevent
- but the effects of this attack should be rate-limited.

Beauty with out truth is insubstantial.
Truth without beauty is unbearable.