a lengthy analysis of counter mode and ESP

To begin, we need to understand what counter mode (CM) is, at least relative to the process of encrypting packets. CM is not new (I believe it dates from about the mid 80's), although it is newer than the other modes commonly defined and used for block ciphers (which go back to the late 70's and very early 80's). To the best of my knowledge, CM was first used in a security protocol standard by the ATM Forum, in the early 90's, when they adopted it for use with DES.

Note that CM can be used with any block cipher. It is generic, not unique to AES, although the size of the "counter" is a function of the block size and thus is somewhat a function of the specific block cipher employed. In CM a sequence of values is generated based on an algorithm (the choice of algorithm is one focus of our discussion), and then encrypted with a key using the basic block (sometimes referred to as electronic code book, or ECB) mode of the cipher. The output from this encryption process usually is referred to as the "key stream." The underlying block cipher is being used to generate a pseudorandom key stream, which is XORed with the plaintext to produce ciphertext. Decryption involves encrypting the same sequence of inputs to produce the same key stream, which is then XORed with the received ciphertext, to recover the transmitted plaintext.

By the way, CM is not the only block cipher mode that operates this way. Output feedback mode (OFB) does the same thing, from a black box perspective, i.e., it acts as a pseudorandom number generator (PRNG) to produce key stream for XORing with data, to effect encryption and decryption. In both modes, the keystream is completely independent of the data being encrypted, an important feature that can significantly improve performance in some contexts. The primary reason for considering CM vs. OFB, from an engineering perspective, is the additional performance enhancing opportunities provided by CM, as described later.

An essential requirement for CM (and OFB) is that the key stream generated using a specific key not be reused. Stripping off the key stream is easy for a wiretapper (just XORing), if he knows that reuse has occurred and has access to ciphertext that was encrypted under the reused keystream. (It's especially easy if the attacker knows which pairs of ciphertext outputs reflect the reuse.)

Remember, that the keystream is just the result of encrypting a sequence of inputs provided to CM, one block at a time. Block ciphers are 1-to-1 and onto mappings. Thus, so long as the sequence of inputs is unique, under a selected key, the sequence of outputs will be unique as well. In the IPsec context, this translates into a requirement to generate unique input value sequences for each SA, since each SA uses a unique key. (As Scott pointed out, it is important to note in the CM spec that the key management procedure ensure uniqueness for each SA, including the common case of the two SAs that make up a full-duplex connection, since the adverse effects are so severe under CM.)

Using CM with Packets (and cells)

For use in a packet-oriented environment like IP, each input to the block cipher algorithm can be divided into two parts: a per-packet part and an intra-packet part. (It may be convenient, though not necessary, to think of the per-packet part as a set of high order bits for the input, and the intra-packet part as the low order bits.) The per-packet part of the input needs to be unique for each packet, much the same way as we use an IV for CBC or CFB mode. The intra-packet part of the input is used to generate keystream for the packet, a block at a time, for as many blocks as are needed to encrypt the packet. So long as the per-packet part of the input is unique for each packet, the intra-packet input can repeat, from one packet to the next (but not within a packet). This is OK because keystream uniqueness is guaranteed so long as the whole input to the block cipher is unique; even if two inputs differ by only one bit, that is enough to guarantee that the outputs (keystream in this case) will be distinct.

For example, the intra-packet part of the input could start at 0 and count up by 1 for each block in a packet, until enough keystream is generated to encrypt the packet. This means that the intra-packet part of the input would repeat for each packet, but so long as the inter-packet input was unique, the keystream generated for each packet will be unique and thus not pose a security problem.

For a receiver to decrypt a received packet, it must generate the same keystream as the one that the transmitter used to encrypt the packet. Thus the receiver must know the key being used, and it must be able to generate the same sequence of inputs for the block cipher. There are many ways to make this happen, depending on the context.

When the ATM Forum selected CM for encrypting virtual circuits, it adopted a design that assumes that, most of the time, all cells on a VC are delivered in order, without loss. This means that if the transmitter and receiver know the starting value for the input to CM, and know the algorithm used to generate the sequence of inputs (and of course, both use the same key), each can generate the same input sequence to CM, and thus generate the same keystream, without transmitting any additional data. Moreover, in the ATM context, each cell is a fixed size (48 bytes plus header), so the amount of keystream needed to encrypt each cell is constant. This makes it especially easy for the transmitter and receiver to generate keystream for each cell in a synchronous fashion. ATM cells have no place to carry any data for synchronizing the receiver's crypto to the transmitter, but if one assumes perfect, ordered delivery, this is not a problem for CM.

One might assume from the ATM example that CM was chosen because it requires no IV to be carried with each cell. But there are many other ways to encrypt cells or packets without carrying an IV, so long as the communication path generally delivers traffic in order and without loss. For example, some X.25 encryption devices developed in the 80's used CBC mode, and did not carry a per-packet IV. Each end transmitted an IV to the other when a call was established and ran CBC mode continuously on the data portion of each packet in the call. Thus CM is not uniquely attractive because of an ability to operate without transmission of explicit, per-packet crypto synch data. If packets can arrive out of order, then generally there is a need to carry data for per-packet crypto synch, or to have some other way for a receiver to know what keystream to create (for a mode like CM).

At this point it is worth noting that the CM design adopted by the ATM Forum is not primarily based on a counter, per se. Rather it uses a linear feedback shift register (LFSR) to generate sequences of up to 2^21 inputs to the block cipher (enough for over 2^19 cells when using DES or 3DES). It also uses a counter as part of the input, to be able to "jump" forward in the input space, to facilitate resynchronization. I view this latter feature as a sort of "fix" for the fact that ATM VCs are not perfect after all. (Later we'll discuss why input generation using other than simple counters may be advantageous in some contexts.)

The ATM use of CM is illustrative in that it represents a design for a specific network context. In the case of IP, we have a different context and thus a different set of requirements that influence the design. Specifically, in the IP context, we cannot assume ordered delivery of packets, nor can we assume no packet loss will occur. This implies that each packet must carry data to allow the receiver to synchronize the CM input, to generate the correct keystream for decrypting received packets.

If we assume that the input consists of two parts, a per-packet part and an intra-packet part, it is obvious that we do not need to transmit the intra-packet part with each packet, so long as there is agreement on the algorithm used to generate the sequence of intra-packet values. In principle, one could negotiate the intra-packet sequence generation algorithm for each SA, and performance advantages could accrue from adopting that approach. For IPsec, I suggest that we not go all out for increased performance at the cost of increased complexity. If one chose to allow negotiation of per-SA, intra-packet sequence generation algorithms, the resulting complexity would be imposed on all implementations, because of the need to support a well-defined set of these algorithms at both transmitter and receiver. Thus I suggest that we stick with the simpler approach of defining exactly one method for generating the intra-packet input sequences, to ensure interoperability with minimal complexity.

Given the above assumption about a pre-defined, standard algorithm for generating sequences for the intra-packet part of the CM input, we need to carry only the per-packet part of the CM input in each packet, much as we carry an IV with each packet in CBC mode. Note that since we must carry this value in each packet. The transmitter and receiver need not agree on how the value is generated. All we require is that the algorithm used by the transmitter generate a unique sequence of values under a single key, and that this sequence supports the maximum number of packets that we can send on an SA (2^32 or 2^64, depending on whether the ESN option is employed). The receiver need not be aware of the algorithm used by the transmitter; it is a local matter for the transmitter to generate the per-packet values and to ensure that they are unique.

If the transmitter makes an error and duplicates a per-packet CM input value, the damage is done as soon as the packet with the duplicate counter input is sent. Even if a receiver could detect that a duplicate per-packet value had been used, e.g., because the receiver knew what algorithm the transmitter used to generate these values, it is not clear what the receiver would do. IPsec has adopted a design philosophy of not having receivers respond with error messages to what appear to be problems in received packets. (This is part of the design philosophy to avoid creating DoS opportunities.) Also, there are many ways in which such duplicates might escape detection by a receiver. For example, a transmitter might send each packet twice, once encrypted using the previous per-packet counter input, and once with a new per-packet input. This would result in duplicate keystream for every packet, a passive wiretapper's dream come true! Every packet would arrive and be decrypted properly by the receiver. Depending on how the per-packet counter input was managed, this might go completely unnoticed by the IPsec receiver, or the receiver might see the replicated packets as duplicated, and just discard them as part of the usual anti-replay mechanism. Failure of this sort, might well go undetected for some time.

Note that, unlike the IV's used with CBC or CFB modes, there is not a requirement that the per-packet value used for CM be pseudorandom. An obvious candidate for such a value is a counter that increments by 1 for each newly transmitted packet. Since the preceding analysis indicated that generation of the per-packet values can be a purely local matter for the transmitter, there is not a requirement to mandate a standard mechanism for generating these values, although the standard could suggest some good ways of doing this. Also, if the implementation undergoes security evaluation, as described later, that evaluation will review the means used by the transmitter to ensure CM input uniqueness, irrespective of anything we state in an IETF standard.

CM and Performance

Performance considerations are the primary motivations for using a counter mode with a block cipher such as AES. The primary reason for the performance advantage of CM relative to other encryption modes derives from the fact that it has no feedback loop, e.g., in contrast to CFB, CBC or even OFB modes. In a mode that has a feedback loop, one must complete encryption of a block of data before using the result as an input to encrypting the next block of data. Even when a crypto mode does not directly encrypt data, e.g., OFB mode, the presence of a feedback loop in the mode requires that a block of data be processed all the way through the algorithm before the next block can be processed.

Modern block ciphers involve multiple "rounds" of computation. In DES there were 16 rounds; in 3DES 48 rounds, and for AES (used with 128-bit keys) there are 12 rounds. As one tries to maximize performance in hardware or software, it is common to "unwind" the loop implied by the round structure. In hardware, this creates an opportunity to pipeline the encryption process. For example, in AES, after a 16-byte block of data has been input to a chip for processing, one round later another 16-byte block could be input, if there is no dependence on the encryption of the previous input. Thus, even though the first block is still being processed and will not be output until 11 more rounds have been completed, one can begin processing the next input block, if the mode permits.

Thus an encryption mode that allows easy use of pipelining for encryption (and decryption) permits hardware implementations to speed up processing by a factor that reflects the number of rounds used by the basic block cipher, e.g., by a factor of 12 for AES.

A feedback mode like CBC or CFB does not permit one to easily take advantage of pipelining, at least not for encrypting a single packet. In a straightforward implementation, one would have to wait until all the rounds were completed for encrypting the first block of input, before starting to encrypt the second block of data from the same packet, because the encrypted output from the first block is combined with the next input, to counter some forms of attacks.

There are ways to try to overcome this limitation, e.g., by viewing a packet not as a single series of blocks of data, but rather viewing it as several series of blocks, interleaved in a fashion to account for the number of rounds in the algorithm.
Techniques of this sort make use of multiple algorithm "cores," each a complete implementation of the block cipher, and thus makes use of parallelism as well as pipelining. Modern, high speed crypto chips often contain several algorithm cores, so this is not an unrealistic notion. For example, in the ATM case, since each cell contains a 48-byte payload, one could view the payload as composed of two sequences, each consisting of 3, 8-byte blocks, interleaved. One could use two algorithms implementations in parallel, to process each sequence, thus doubling the effective throughput. Or, the same payload could be viewed as 4 interleaved sequences of 2 blocks each, for even greater parallelism and improved throughput. However, this combination of pipelining and parallelism can get very complicated in a hurry.

More importantly from a protocol standards and interoperability perspective, a receiver needs to know how the transmitter chose to "slice" the packet, in order to perform the decryption correctly. That is because the way the interleaving is performed must be duplicated by the receiver, in order to effect decryption. Interleaving is not a transparent way to increase performance; a receiver must know exactly what interleaving strategy was used by every transmitter from which it receives packets. This makes it less desirable in an open system environment, where we strive to remove barriers to interoperability. As a result complicated ways of trying to speed up encryption through interleaving of data while using feedback modes are not usually employed in standard protocols. (IPsec is striving to reduce the complexity mandated for all implementations, as evidenced in the new version of IKE, and the complexity of interleaving would be counter to this effort.) Thus it is common for a chip with multiple cores to be used to encrypt multiple packets in parallel, rather than trying to encrypt one packet in parallel.

Note that if multiple packets from the same SA are encrypted in parallel, e.g., using multiple algorithm cores within a crypto chip, this internal parallelism can cause system-level problems.
Consider the case where several short packets arrive for transmission on an SA after encryption has begun on a large packet. Encrypting multiple packets for the same SA in parallel could cause the short packets to be encrypted, transmitted, received and decrypted before the large packet was received at the destination IPsec implementation. IPsec can tolerate this sort of self-imposed reordering for small numbers of packets, nominally 64 per SA under the current recommendations. One can imagine cases where an IPsec receiver might start discarding large packets that arrive too late on an SA, because of using packet parallelism on a per-SA basis to improve performance. Also, TCP performance suffers when packets arrive out of order, so there is the potential that an attempt to improve IPsec performance, through internal parallelism on a per-SA basis, could ultimately degrade performance! Thus developers must exercise care in designing high speed IPsec devices, and they must understand how crypto chips work to achieve high speeds.

In CM, taking advantage of pipelining and parallelism for an individual packet is much easier because of the lack of feedback. One can load the per-packet value into a core, and begin generating the keystream for a series of blocks. After the first round has completed, a new input can be loaded into the same core, to begin generating another part of the keystream. In the simplest case, one could pipeline 12 blocks worth of keystream generation for a single packet, using a single core, taking advantage of the 12 round depth of AES. The core could generate the sequence of intra-packet inputs, using whatever algorithm is defined for that purpose.

If we are to make maximum use of a core, at some point it will be necessary to load a new input into a core for a different packet, while the core is still generating keystream for another packet, maybe even for a packet for another SA. Otherwise the core will be idle while we wait for up to 11 rounds (for AES) for the keystream for the previous packet to exit. This suggests that a crypto chip has to maintain some state for each core it contains, to remember which keystream outputs go with which packets on which SAs, in order to be able to get the most out of each core. So long as that state is completely local to the chip, this is feasible.

Since crypto chips often contain multiple cores, it also is possible to improve throughput for a single packet by using parallelism within the chip. One might begin generating keystream in two or more cores, in parallel, for the same packet. This is not too hard for CM, unlike CBC or CFB mode, because we know what inputs to load into each core to generate keystream for different blocks within a packet. Also, the way we choose to do this at a transmitter will be invisible to a receiver, so long as we follow the same algorithm for generating the intra-packet input sequences. Thus transmitters and receivers can employ different numbers of cores, even different numbers of chips, to generate keystream in a deterministic fashion and still have the process work properly, with no additional coordination. This is an important feature: any approach we adopt for using CM in ESP should allow a transmitter maximum flexibility in optimizing performance, while ensuring that receivers need not be aware of the optimizations employed by the transmitter.

For example, if a packet to be encrypted is 64 bytes long (more properly, 49-64 bytes in length), one could assign two crypto cores to the task of generating two blocks of keystream each, or assign four cores the task of assigning one block each. In the former case the whole packet can be encrypted in the time it takes for two rounds of AES, in the latter case, in the time for one round of AES. All that is required to make this work is to load the same inter-packet value into each core, and then to direct each core to use the appropriate offset into the packet for generating the right portion of the keystream. This use of both pipelining and parallelism makes CM very attractive as we try to create IPsec implementations that can operate at higher and higher speeds. At the same time that we try to preserve maximum flexibility for hardware implementations re performance, we don't want to adopt a scheme that will impose burdens on software implementations that may be peers. I believe this should be an important principle in our decisions.

Finally, note that this sort of pipelining does not reduce the delay associated with multi-round encryption. It will still take 12 rounds to generate the keystream for a packet using AES with 128-bit keys. Parallelism helps with delay, and the ability to pre-compute keystream also can reduce delay. However, pre-computation of keystream in IPsec may be feasible only on some contexts, e.g., when the number of SAs is small and thus the burden of storing pre-computed keystream is not too great.

CM and Overhead

With this background, we can examine the issue that has generated considerable debate recently, namely how to manage the per-packet inputs for CM. There is agreement that the intra-packet input should just be a simple counter. To accommodate the largest possible jumbogram that IPv6 can represent, a 28-bit value will suffice, and we can represent it in a 32-bit field, as that is convenient for both hardware and software implementations.

The disagreement has centered around the form that the per-packet input to CM should take. As noted earlier, this value need not be pseudorandom, unlike the recommendation for CBC and CFB modes. Some concerns have been expressed about using a simple counter here, because this means that subsequent inputs to CM will have a small Hamming distance. Conceivably, this might facilitate some forms of differential cryptanalysis, but AES is generally believed to be sufficiently strong so that this concern has generally been dismissed. So, there is agreement that a per-packet counter is sufficient.

If one adopts a very conservative model of encryption security, then at most 2^64 blocks should be encrypted under a single key, which says that no more than 64 bits are needed to identify each packet in an SA. This is exactly the same size as the ESP extended sequence number, so that value is an obvious candidate for use as the per-packet CM input. It has the advantage that it is already carried in the ESP header for each packet, and thus its use does not add any overhead to the ESP packet structure for crypto synch purposes. The extent of the savings is subject to interpretation.

CM has the property that one need not pad a packet to the block size in order to encrypt/decrypt it. Rather, one merely XORs as many bits as needed with the last block of a packet to encrypt/decrypt it. In the IPsec context we require that the ciphertext portion of a packet end on a 4-byte boundary, so this becomes the determining factor for padding in the context of CM encryption. So, already CM has an advantage over CBC which requires padding to a block boundary. Today, using DES or 3DES, this CBC padding adds 4 bytes to every packet, on average. CBC mode for AES would add 8 bytes of overhead. If we use the ESP sequence number as the per-packet input to CM, we can gain another 8-byte savings relative to DES/3DES CBC.

There are several ways of looking at the overhead figures. Today we tolerate an average of 12 bytes of per-packet crypto overhead, using DES or 3DES CBC, an 8-byte IV plus the average 4-byte padding overhead. Moving to CM will shave this overhead by an average of 4 bytes (by removing the padding), a 1/3 savings, even if we carry an explicit 8-byte per-packet CM input. This analysis is accurate, but simplistic, as it looks only at the overhead due to the crypto mode. It ignores the overhead imposed by ESP.

The ESP header is 8-bytes long (4-byte SPI and 4-byte explicit sequence number). Because of the 4-byte alignment requirement for the end of the ciphertext, and because the last two bytes of the ciphertext are part of the ESP trailer, an average of 4 bytes of overhead is added there as well. Then the ICV is attached, in the common (and recommended case) where integrity is employed. (Because CM provides NO protection of any form with regard to integrity, an ICV is critical when this mode is used.) The default ICV algorithms add an additional 12 bytes to the end of the packet. So, in aggregate, ESP typically adds 24 bytes to a packet, independent of the use of an explicit IV. In that context, an explicit 8-byte IV represents 1/4 of the total overhead added by ESP (24 bytes of pure ESP overhead, plus an 8-byte IV = 32 bytes.)

If there is no compression of an (outer) IPv4 header, and if there are no options, which is typical, that header is 20 bytes long. A TCP header, inside the ESP payload, is 20 bytes (again, assuming no options). A UDP header is 8 bytes. (IPsec does not accommodate compression of transport layer headers; they are used for SA selection and the transmitter and for SA checking at the receiver, so any compression of these headers would become an integral part of security critical IPsec processing, not a generally desirable outcome.) The protocol overhead for a packet is 20+24 (or 8 for UDP) = 44 (or 28 for UDP) bytes, independent of any application data and prior to applying ESP. Other than VoIP, for which we have heard varying estimates of application data size, the typical minimum size packet is a TCP ACK, which is 44 bytes. The ESP overhead, at 24 bytes, makes for a minimum encrypted packet size of 68 bytes (sans IV). In that common context, the 8 added bytes for an explicit IV represents an 11.7% increase in packet size. Typical packet sizes tend to be much larger, with an attendant decrease in the percentage overhead attributable to ESP and to an explicit IV.

In practice it seems that most IPsec traffic is carried in tunnel mode, which adds another 20-byte inner IP header, which would bring the IV overhead down to about 9%. (The use of tunnel mode is required for the common case of remote user access, and for inter-site VPNs. It also seems to be used as a default in many implementations, where transport mode would suffice.) I leave it to others to provide analogous estimates for the VoIP case, with appropriate assumptions about what compression is applied where, and under what circumstances, and whether this requires use of any compression algorithms that must be licensed. But, it seems clear that in a very large number of common IPsec uses, the overhead associated with carriage of an explicit IV is fairly small.

None of this analysis is intended to suggest that we should be cavalier about the overhead associated with IPsec. When I developed the extended sequence number (ESN) option for AH and ESP, I chose to not carry an extra 4 bytes of sequence number for several reasons. There is an obvious desire to maintain backward compatibility for the on the wire format, and to minimize overhead. Perhaps the most important reason for making the added 4 bytes for the ESN be implicit rather than explicit, is to preserve the 8-btye alignment for the beginning of payload, consistent with IPv6 requirements. But the overhead imposed by these 4 bytes would not, by itself, justify the added complexity that results from making them implicit. In the same vein, I suggest that we not introduce undue complexity to IPsec in an effort to save 8 more bytes per packet, since it represents a fairly small overhead in the vast majority of circumstances.

CM & System Security

When we consider the use of an explicit vs. an implicit IV for CM, we also need to examine system security issues. Some of these issues are rather subtle and thus require a detailed explanation, much like the description of CM and how it can be used to achieve higher performance.

Management of the ESP sequence number is well defined by RFC 2402 & 2406, and management of the ESN is defined in 2402/2406bis. A separate counter must be maintained for each SA, and recall that an SA is a unidirectional connection. The sequence number begins at 0; the first transmitted packet carrying a value of 1, and the value is not allowed to cycle for an SA. Thus a new SA must be created prior to transmitting a packet with the sequence number value of 2^32 (or with the implied value of 2^64 in the case of the ESN option.)

The purpose of the sequence number is to allow a receiver to detect and reject replayed packets, e.g., as a countermeasure to DoS attacks. As currently defined and used, e.g., with CBC mode, it has no implications for the security of the cryptography employed with IPsec. Thus a failure by a transmitter to properly manage the sequence number will not result in a violation of the confidentiality, integrity, or authentication guarantees associated with use of ESP. In this sense, the anti-replay feature of ESP is a secondary security function.

As a result of this well-defined and limited security use for the sequence number, its management is outside of the scope of the FIPS 140-1/2 crypto module evaluation criteria. These criteria focus on security functionality and assurance directly related to the management of crypto keys, on the correct implementation and operation of crypto algorithms, and on related aspects of crypto module security. These evaluation criteria does NOT address security issues associated with a protocol like IPsec, even though a number of IPsec implementations make use of crypto modules that have been evaluated under FIPS 140-1/2. It is appropriate that these IPsec products cite FIPS 140 evaluation ratings in product brochures even though this evaluation does not address the other security features of IPsec. (The evaluation criteria applies to both hardware and software modules and thus is widely applicable.) Keep this in mind as we discuss implementation options for management of ESP sequence numbers.

Some crypto chips that have been designed expressly for IPsec manage the ESP sequence number on chip. This is a good security design approach which makes it unlikely that the sequence number will be mismanaged, e.g., that the same sequence number will be assigned to multiple packets. However, it is not always possible to manage the sequence number for an SA on a chip. Vendors developing IPsec products have sometimes found it necessary to use multiple chips, perhaps spread over more than one (printed circuit) board to provide crypto support for traffic from one SA. An obvious context in which this has arisen is in PPVPN products, where many subscriber data streams come together in a product and the goal is to multiplex these data streams over a pool of crypto chips. The motivations here are the same one that cause ISPs to aggregate traffic at a POP using a hierarchy of routers and switches. In such contexts it is uneconomical to devote a crypto chip/module to each subscriber.

Sharing of a pool of crypto resources allows traffic for any SA to be assigned to any available crypto chip. This requires that the keys for each SA be available to each crypto chip. Because keys are constant for the life of an SA, this is easy accomplished by replicating the key material on each chip. Management of sequence numbers poses a problem in this context if one tries to manage them on the crypto chips. This is because this per-SA data changes every time a packet is transmitted on an SA and it is generally impractical to update the data and make it available to all the crypto chips in a timely fashion.

Although the PPVPN context is the one that first brought this issue to our attention, it is anticipated that there may be other contexts in which traffic for an SA may be directed to different chips at different times. It is a generic design strategy for scaling a system to accommodate higher traffic volumes without dedicating crypto chips to each possible traffic source.

When this issue was raised on the IPsec mailing list over a year ago, the advice provided by the WG to vendors who encountered this problem was to move management of the sequence numbers out of the crypto chips. For each SA there is a point in the system where the traffic is mapped to the SA and at that point one can assign the sequence number to what will become an ESP packet, without any synchronization problems. This approach does reduce the assurance with which sequence numbers are managed, relative to on-chip management, but because the security function offered by sequence numbers is quite limited, this is not viewed as a problem.

If we make use of the ESP sequence number as the per-packet input to CM, the situation changes. In this case, the sequence number is a critical security parameter for the encryption algorithm, as noted earlier. This is because encrypting two packets under the same key with the same per-packet CM input results in duplicate keystream and a serious compromise of confidentiality. Moreover, because the sequence number is clearly visible in the packet, a wiretapper would know precisely when this occurred, if it happened. If the ESP sequence number is used as the per-packet input in CM, then management of the sequence number falls within the scope of a FIPS 140-1/2 evaluation. This means that the security boundary for evaluation of a product must encompass management of the sequence number.

If sequence numbers are managed on crypto chips, this is not a problem. But, if a product wants to manage sequence numbers off chip, e.g., to better support scaling in the ways described above, then the security evaluation boundary will have to encompass a much larger portion of the IPsec product. In many instances, this will likely limit the level of assurance at which a product can be evaluated. In pragmatic terms, the security of the encryption process will almost certainly be lower, because the per-packet CM input is now managed in a less secure fashion. Some products might even forgo FIPS evaluation entirely, because if the security evaluation boundary grow to encompass most of the product software, any change in that software voids the evaluation!

This also creates a potentially very odd situation: if an IPsec product supports both CM and CBC modes, the security of the product might vary depending on which mode was used. This could arise if the security critical inputs to the encryption mode were managed in different places. For example, for CBC mode on-chip IV generation could be very secure, but off-chip sequence number management would result in lower assurance for CM, if the sequence number were used as the per-packet input.

This discussion of security evaluation boundaries is not based on arbitrary evaluation criteria. It has a very pragmatic basis. Experience has shown that many, if not most, of the security vulnerabilities that arise in fielded products are a result of implementation errors. Increasing the size of the security boundary either reduces the likely level of assurance that is achieved, or increases the costs associated with engineering the product (for a fixed level of assurance). In general, architectural approaches that make it easy to do a good job of ensuring keystream uniqueness, make it more likely that a product will not have vulnerabilities due to keystream generation failures.

To put this issue in perspective, the current ID warns implementers against generating keystream for more than 2^64 blocks (not packets), even though we do not know of any specific vulnerabilities that could be exploited if this limit is exceeded. Yet we understand precisely what attacks could result if keystream is reused. Thus it makes sense to adopt approaches to keystream management that do not make it harder to do this task in a very secure fashion.

This analysis argues for adopting an approach to generating the per-packet CM input in a fashion that can be uniformly secure, irrespective of whether the ESP sequence number is managed on chip or off. For performance reasons, one would like an approach that allows multiple chips to generate keystream for packets transmitted on the same SA, while ensuring no keystream reuse and without having to synchronize values (across chips) that change on a per-packet basis. If one carries an explicit, per-packet IV, these goals can be achieved.

There are several ways one can generate the per-packet IV in a secure fashion across multiple chips. For example, one can assign a distinct, small ID number to each chip (or each core on a chip), and include that number as part of a 64-bit IV. This ensures that each chip can generate a unique IV sequence in a purely local fashion, using the per-chip (or per-core) ID as part of the IV. There is no need to specify in a standard what portion of the IV is devoted to a per-chip ID, since to a receiver the whole IV is opaque. This means that an IPsec implementation that has just one chip can use the whole IV as a counter or an LFSR, while a multi-chip implementation can assign as many bits as are needed to allow for maximum, per-SA parallelism. This is precisely the sort of opportunity for local optimization, with no adverse implications for receiver implementation choices, that one should strive for in a standard.

Counters vs. Linear Feedback Shift Registers (LFSRs)

Finally, there has been discussion of the merits of using counters vs. LFSR to generate CM input sequences. LFSRs were used in the ATM Forum standard for CM encryption. They exhibit advantages, relative to adders, for generating a sequence of inputs in very high performance hardware. The time required by an LFSR to generate the next value in the sequence is constant, very small, and the hardware required is also minimal. In contrast, if one generates a sequence of inputs using a counter, the counter must allow for time to propagate carries, and the counter is bigger than an LFSR, for comparable input sizes. (One can diminish the delay attributable to carry propagation by devoting more gates to the adder, but it is not likely to be faster than an LFSR generating the same size sequence, and it will certainly be bigger in gate count.)

Note that these arguments do not apply to software implementations, especially ones making use of general purpose processors. In those implementations, an LFSR would generally be slower than use of a counter, given the availability of a 32 or 64-bit adder in the CPU. Software implementations are not capable of very high performance, and so there is no motivation to optimize for use of software with an LFSR. But it is important to not burden software implementations that communicate with very high performance hardware.

In the context of ESP, we argued that it is important to use a well-defined, standard way to generate the intra-packet input sequences. We also argued that it is preferable to have just one way to do this. Using an LFSR here would yield performance benefits, but these benefits may not be great, since the intra-packet sequences need be only 28-bit values. Also, we want to preserve flexibility for transmitters with regard to keystream generation. It does not seem easy to standardize a means for using LFSRs for intra-packet keystream generation. This leads to the suggestion to use a (28-bit) counter for this part of the CM input.

However, generation of per-packet (vs. intra-packet) inputs to CM does not have to be standardized, as explained in earlier analysis. We also have seen how some implementations might need to generate a new per-packet input as fast as an AES implementation can perform one round, the same speed at which the intra-packet sequences need to be generated. Thus it is attractive to allow for use of LFSRs to generate per-packet inputs, i.e., IVs. Note too that because the per-packet values are bigger (64-bits vs. 28-bits), the performance and gate count advantages for an LFSR vs. a counter can be greater, at least in principle, for hardware.

By allowing an implementer the freedom to choose how to generate these IVs, no constraints (other than uniqueness) are imposed, and there is no adverse effect on receivers, to whom these values are opaque. Thus a software implementation that receives packets from a hardware implementation using an LFSR for IVs is not affected by the transmitters implementation choice. The same is true in the other direction, i.e., use of a counter for IV generation does not affect a receiver.

The Bottom Line

The preceding, lengthy analysis has indicated why it is preferable to carry an explicit IV for per-packet synchronization for CM, in a range of contexts. Using explicit IVs allows product designers flexibility for improving encryption performance, while maintaining or improving security, relative to reusing the ESP sequence number as an IV. Reusing the ESP sequence number in this fashion precludes some designs that increase performance, as illustrated in the analysis. In some contexts, e.g., PPVPN designs, reuse of the sequence number will either make FIPS 140 evaluation harder and more expensive, or will reduce the security assurance level of products, when this mode is employed.