Balanced Sampling II
safe landing
Last week I defined balanced sampling and I discussed the first part of the two-part cube method for obtaining a balanced sample. In that post we identified samples with vertices of the unit cube, and we discussed the flight phase of the sampling algorithm, which converges to a point in the cube satisfying the given balance equations and inclusion probabilities. We stated that, in many cases, there are no samples which satisfy the balance equation and the flight phase won’t be able to converge to a vertex. This week I’ll describe the landing phase, the second part of the cube method which we use to obtain the final sample.
Throughout this post I assume the context of the previous entry on balanced sampling.
The flight phase is a martingale
First a bit of an aside. In the last installment I defined the flight phase as a random walk \((\mathbf S(t): t = 1, 2, \ldots)\), and while I didn’t say so explicitly \(\mathbf S(t)\) has the property that
for all \(t\). In probability theory, a stochastic process with this property is called a martingale. We can think of a martingale as a fair game: at each step, your expected wealth after playing one more round of the game is equal to your current wealth. That is, the game is not biased toward making you richer or poorer.
In his discussion of the cube method, Tillé defines the notion of a balancing martingale, which is a martingale that satisfies some additional properties pertinent to the balanced sampling problem. He then gives the flight algorithm as an example of a balancing martingale. I left this discussion out of the last post because I didn’t think it was necessary to a basic understanding of the flight phase. Martingales are, however, interesting objects in their own right: they have useful convergence properties, they’re the subject of the powerful Optional Stopping Theorem, and they show up in applications of physics and financial mathematics. Let’s revisit this subject in the future.
Incidentally (as Krzszystof Burdzy is fond of pointing out), a martingale originally meant part of the leather strappy thing that you put over a horse’s head.
The landing phase
At the end of the flight phase, we’re left with an \(N\)-dimensional vector
that belongs to the unit cube. Some coordinates will be 0 or 1, and others will be strictly between 0 and 1. The purpose of the landing phase is to “round” \(\mathbf S\) to a cube vertex, i.e. an \(N\)-dimensional vector all of whose coordinates are 0 or 1.
The flight phase has already brought us as close as possible, in some sense, to a cube vertex while still conforming to both the balance equation and the given inclusion probabilities. In order to land, we’ll have to relax one of these constraints. Tillé suggests that we put a probability distribution on cube vertices that are compatible with \(\mathbf S\) in such a way that we maintain the correct inclusion probabilities and we minimize deviation from the balance equation. The landing phase simply chooses a vertex according to these computed probabilities. Tillé calls this the enumerative algorithm.
A cube vertex \(\mathbf s = (s_1, \ldots, s_N)\) is said to be compatible with \(\mathbf S\) if
Let \(C(\mathbf S)\) denote the set of all cube vertices compatible with \(\mathbf S\). For \(\mathbf s \in C(\mathbf S)\) we define the probabilities \(p^*(\mathbf s)\) as the solution to the following linear program.
subject to
As for cost functions, we might choose to minimize the sum of (normalized) square deviations of the Horvitz-Thompson estimator from the true value:
Alternatively, we could take \(\text{Cost}(\mathbf s)\) to be the Euclidean distance from \(\mathbf s\) to \(\pi\). The point is that at this stage we can fall back on standard linear programming tools to obtain our final sample.
Wrapping up
Now that we’ve defined the enumerative algorithm, you might be wondering why we bother with the flight phase at all. Theoretically, there’s nothing stopping us from solving the linear program over the set of all samples \(\mathbf s\) and then selecting a sample according to the computed probabilities. The reason is efficiency. If there are \(N\) individuals in our universe then there are \(2^N\) possible samples, a prohibitively large number to optimize over. In contrast, arguments from linear algebra show that at the end of the flight phase there will be at most \(p\) coordinates in \(\mathbf S\) that will need to be rounded to 0 or 1, where \(p\) was the dimension of the auxilliary data vector \(\mathbf x\). This leaves us with a linear programming problem of size \(2^p\) regardless of the universe size.
Tillé and his collaberators have implemented the cube method along with a number of other algorithms in the survey
package, available from CRAN.