Subject: Re: [pqc-forum] Question on Classic McEliece definition of FixedWeight
From: "D. J. Bernstein" <djb@cr.yp.to>
Date: Fri, 24 Feb 2023 08:51:06 +0100
To: pqc-forum <pqc-forum@list.nist.gov>
Message-ID: <20230224075106.560655.qmail@cr.yp.to>

The general context here is the Classic McEliece specification defining
exactly how random input is used, including exactly how much randomness
is used. https://classic.mceliece.org/mceliece-rationale-20221023.pdf
includes the security rationale for this in Section 4.

In particular, pure rejection-sampling loops such as "flip a coin until
it's heads" can consume an arbitrary amount of randomness, although the
chance of this happening drops exponentially with the amount. For
careful testing and verification, it's important to note that the
possible coin sequences used are heads, tails-heads, tails-tails-heads,
etc., and not, e.g., heads-tails with the tails being ignored.

Wrenna Robson writes:
> However, what then is the domain of the FixedWeight mathematical
> function?

The specification "defines each mathematical function F by presenting an
algorithm to compute F" (page 3). For randomized functions such as
FixedWeight, the set of strings of random bits allowed as input "is
defined as the set of strings of random bits consumed by these
algorithms" (page 3).

In particular, the specification presents a FixedWeight algorithm (page
10). The FixedWeight mathematical function is the function computed by
this algorithm. The domain of the function is the set of strings of
random bits consumed by this algorithm. The output of the mathematical
function, for any particular string in the domain, is the output of this
algorithm for that string of random bits.

FixedWeight begins by generating sigma_1 t random bits (Step 1). It then
performs a specific computation (Steps 2, 3, 4), which may lead the
algorithm to restart; otherwise the algorithm produces an output (Steps
5 and 6). Consequently, the domain consists of

   * all accepted strings R_0 of sigma_1 t bits, where "accepted" means
     that the algorithm doesn't restart;

   * all strings R_0+R_1 (with + for concatenation) where R_0 is a
     rejected string of sigma_1 t bits ("rejected" means non-accepted)
     and R_1 is an accepted string of sigma_1 t bits;

   * all strings R_0+R_1+R_2 where R_0 and R_1 are rejected and R_2 is
     accepted;

etc.

Note that, in general, algorithms do not necessarily halt. If you're
formalizing the definition of the output of an algorithm, you have to
include an extra output possibility meaning non-halting. See, e.g., the
definition of "M(x) = ⬈" on page 20 of Papadimitriou's 1993
Computational Complexity textbook.

In particular, pure rejection-sampling algorithms such as FixedWeight
can consume an infinite string R_0+R_1+... rather than halting---which
doesn't matter in reality for any well-designed algorithm since it
happens with probability 0, but does matter for rigorous proofs. The
domain includes all R_0+R_1+... such that each R_i is rejected.

Something else to watch out for in formalization is that elementary
definitions of randomness for finite sets can't formalize the infinite
RNG-output string as a random object, can't formalize statements such as
"the average number of coin flips until heads is 2", and can't formalize
statements about average run times of typical algorithms. The necessary
definitions for rigorous analysis of probabilistic algorithms come from
measure theory. See https://gilith.com/papers/termination.pdf for an
example of formally verifying the behavior of probabilistic algorithms.

> As written, steps 3 and 4 can cause
> a restart of the algorithm back to 1, which then requires taking more
> random bits. So in theory FixedWeight could need an arbitrary number
> of random bits before it successfully outputs,

Right. Pure rejection-sampling algorithms can consume arbitrarily long
finite strings, or infinite strings in the non-halting case.

> and it isn't deterministic.

Correct. That isn't tied to the number of random bits used or to the
internal rejection-sampling structure: encapsulation algorithms (and
keygen algorithms) always have to be randomized for security.

It's normal for larger systems to make their own decisions of RNGs as
separate modules from cryptosystem specifications. Security review is
partitioned the same way; see Section 4 of the rationale.

> This is in contrast to the approach with KeyGen, where KeyGen is
> described as a no-input function which generates a random string l,
> and then SeededKeyGen uses the pseudorandom bit generator G in order
> to generate the required bits: if there is a restart in SeededKeyGen,
> the output of G is used to produce a new seed. So once you decide the
> initial seed, it is deterministic.

Right. You can think of key generation as being constructed in three
layers: first there's a pure rejection-sampling algorithm; then this is
composed with a stream cipher generating randomness for that algorithm;
then this is composed with the external RNG to generate the cipher key.

The reason for this complication is explained in Section 4.2 of the
rationale. In short, the details support a spectrum of interoperable
compression options for the private key.

For formalization, one disadvantage of merging symmetric cryptography
into rejection-sampling loops is that suddenly it's no longer clear that
the composed algorithm has probability 0 of running forever. However,
suitable PRF assumptions imply that the probability is extremely close
to 0.

> Is there an intent that FixedWeight be given a definition which makes
> it deterministic in an analogous way?

Any test framework that provides a deterministic randombytes() for test
vectors (e.g., SUPERCOP) is already doing this in a modular way, without
complicating individual cryptosystem specifications.

McTiny needed to compress the encapsulation state to meet extreme
performance requirements, and simply took the modular approach to
compressing this state. McEliece-specific compression options provide
interesting performance tradeoffs for private keys but not for the
encapsulation state.

See Section 4.5 of the rationale: "Applications might plug in an RNG
that generates output from a series of seeds as in key generation,
allowing a ciphertext to be compressed to the final seed. See, e.g.,
[8]. The same compression approach works for rejection sampling in much
more generality."

---D. J. Bernstein (on behalf of the Classic McEliece team)