Towards large scale linear planning

In this post I will look at how large linear programs we can expect to be solvable on a modern computer cluster. I will only consider linear programs related to economic planning, meaning programs that are very sparse. I will be using the local supercomputer center HPC2N and its Kebnekaise cluster as my reference. I assume the reader is familiar with the Message Passing Interface (MPI). I use the term work here rather than time, and it is to be understood as the total amount of computation done by all nodes. The amount of time necessary depends on the number of nodes and how they are connected. Similarly applies for memory as I don't always bother to specify where everything is stored, but only concern myself with the total amount of memory used by the entire cluster.

In a previous post I laid down the rough shape of the linear programs in question. I will expand on it slightly. Each column in the constraint matrix now represents one production method. Each workplace could have many methods available. Each column could represent one particular machine. The average number of non-zeroes per column is bound by some constant $q$ .

Let $S$ denote the entire system matrix including non-negativity constraints and the constraint for the objective function. Finding a solution $x \in R^{n \times 1}$ that is within some given percentage of optimum involves repeatedly solving the following system

$H_{(i)} h_{(i)} = S^{T} Λ_{(i)}^{- 2} S h_{(i)} = - S^{T} Λ_{(i)}^{- 1} 1 = - g_{(i)}$

where $Λ_{(i)}$ is diagonal and changes between steps. Sizes are as follows:

$H_{(i)} \in R^{n \times n}, S \in R^{m \times n}, Λ_{(i)} \in R^{m \times m}, m > n$

The number of solves necessary for $L$ bits of precision and $m$ constraints is somewhere between $L / 2$ and $O (L \sqrt{m})$ . The lower bound is my own empirical result and the upper bound is due to Renegar.

Because the system is symmetric and positive definite, the conjugate gradient method can be used. The number of CG iterations necessary depends on how large global steps are taken. With a conservative $δ$ , few CG iterations are necessary per solve. If lots of extrapolation is used, like always stepping 75% of the way to the nearest constraint, then a lot more CG iterations are necessary per solve. A practical solver balances these two. A cluster implementation may choose to try different step sizes in parallel, assigning each one to a subset of machines and seeing which makes the most progress.

For an empirical lower bound using a diagonal preconditioner and large steps I have found 2000 CG iterations to be enough.

For an upper bound we need to first know the number of Newton steps per bit. Renegar tells us that each step with $δ = 1 / (13 \sqrt{m})$ shrinks the objective function by a factor $(1 - 1 / 28 \sqrt{m})$ . Together with the fact that we're solving both the primal and dual programs this means $56 \ln (2) \sqrt{m}$ Newton steps per bit. Each Newton step in turn requires some amount of CG iterations that depends on the condition number of $H_{(i)}$ and the quality of the preconditioner.

We can derive how many iterations are necessary to find the solution for each Newton step to within machine precision with single-precision floating point numbers using the following:

${(1 - \frac{2}{\sqrt{κ (M^{- 1} H_{(i)})}})}^{j} = 2^{- 23} \Rightarrow j \approx \frac{23}{2} \ln (2) \sqrt{κ (M^{- 1} H_{(i)})} \approx 8 \sqrt{κ (M^{- 1} H_{(i)})}$

$κ$ computes the condition number of a matrix and $M$ is whatever preconditioner we're using. It is assumed $| h_{(i)} |$ does not change too much between steps. This obviously does not hold for $h_{(0)}$ and therefore it likely involves more CG iterations. Due to the nature of floating point numbers it cannot involve more than 256 times the number of CG iterations than the average $h_{(i)}$ does, and therefore it amortizes.

All this taken together we expect the number of CG iterations $j_{max}$ to solve our linear program to $L$ bits of accuracy to be bracketed by

$2000 \leq j_{max} \leq 309 L \sqrt{κ (M^{- 1} H) m}$

where $κ$ is understood as the average condition number across all steps.

Preconditioner

I mentioned preconditioners earlier. Preconditioning is an important way to speed up the conjugate gradient method. A preconditioner is an operator that approximates the inverse of the system to be solved. Ways of maintaining a good preconditioner is where the algorithmic progress for interior point methods occurs today.

Diagonal preconditioner

A diagonal preconditioner is one of the simplest preconditioners available. It is simply the diagonal of $H_{(i)}$ which can be computed with $O (q n)$ work and then trivially applied. Because of its cheapness we compute it before every Newton step.

Incomplete Cholesky factorization

The incomplete (sparse) Cholesky factorization is a much better preconditioner than the diagonal, but it is also more costly to compute. First we will note that thanks to the structure of $S$ , forming $H_{(i)}$ explicitly is not too expensive. Assuming the non-zeroes of $S$ are distributed uniformly randomly then by the binomial distribution it can be shown that

$nnz (H_{(i)}) \approx n + (n^{2} - n) (1 - (1 - q / m)^{q}) \approx n + (n^{2} - n) q^{2} / m \approx n^{2} q^{2} / m$

when $q ≪ m$ . The following Octave program demonstrates this:

q = 30;
for n=2.^(10:18)
 m = 2*n;
 S = sprand(m,n,q/m);
 B = S'*S;
 est = n^2*q^2/m;
 printf("n=%6i ratio=%f\n", n, nnz(B)/est);
endfor

Which outputs the following:

n=  1024 ratio=0.812704
n=  2048 ratio=0.897969
n=  4096 ratio=0.948767
n=  8192 ratio=0.974487
n= 16384 ratio=0.988381
n= 32768 ratio=0.995698
n= 65536 ratio=0.998564
n=131072 ratio=1.000623
n=262144 ratio=1.001329

As $n$ (and $m$ ) grows the estimate improves. This means $H_{(i)}$ is quite sparse and memory use is $O (q^{2} n)$ when $n \propto m$ . Computing it involves $O (q n^{2})$ work. The sparsity pattern of $H_{(i)}$ does not change, which means computing $H_{0}$ amortizes and subsequent $H_{(i)}$ require only $O (q^{3} n)$ work ( $O (q)$ per non-zero).

It may be possible to speed up the computation of $H_{0}$ by exploiting its symmetry and the structure of $S$ . The columns and rows of $S$ may be ordered and indexed such that large swaths of $H_{0}$ may quickly be deduced to be zero. It might be possible to cluster $S$ so that most information in $H_{0}$ is close to the diagonal.

Once we have $H_{(i)}$ we can use it to compute the incomplete Cholesky factorization $L_{(i)} L_{(i)}^{T} \approx H_{(i)}$ with $O (nnz (H) n) = O (q^{2} n^{2})$ work. Such preconditioners remain useful only for so long however, and $L_{(i)}$ should be recomputed as the solver progresses. Since we're doing all this in parallel what we're really talking about is $L_{(i - τ)}$ , meaning Cholesky factorizations that arrive slightly "late". We can update these "late" preconditioners with low-rank adjustments using the Woodbury matrix identity, something that is often done in the literature on interior point methods.

I will only use the diagonal preconditioner for the rest of this post since it is much easier to analyze. I found the result for the density of $H$ useful enough to leave this section intact.

Parallel CG iteration

Assume a cluster of $P = M \times N$ nodes. Each CG iteration involves computing

$q_{(j)} = S^{T} Λ_{(i)}^{- 2} S p_{(j)}$

and applying the preconditioner

$z_{(j)} = D_{(i)}^{- 1} r_{(j)}$

where $D_{(i)}$ is diagonal. There are many ways to parallelize this. What I show here is merely one way. Note also that in addition to these computations there are global calls to MPI_Allreduce for scalars like $α_{(j)}$ and $β_{(j)}$ , and some trivial node-local vector arithmetic. This gets absorbed in the big-O notation.

Splitting of matrices

$S$ is split into blocks $S_{k, l}$ , one for each node, $1 \leq k \leq M$ and $1 \leq l \leq N$ . Each node also has $Λ_{{(i)}_{k}}^{- 2}$ , a block on the diagonal of $Λ_{(i)}^{- 2}$ corresponding to the horizontal strip $S_{k, *}$ .

Communicators

In addition to the global communicator MPI_COMM_WORLD there are two more sets of communicators: rowcomm and colcomm. Each rowcomm groups nodes belonging to one horizontal strip of $S$ , meaning ${rowcomm}_{k} \Leftrightarrow S_{k, *}$ . For colcomm it is similar, but with vertical strips: ${colcomm}_{l} \Leftrightarrow S_{*, l}$ .

$Communicators and S_{k,l}$

Computing D_(i)

Computing $D_{(i)}$ involves $O (q n / P)$ node-local work and a call to MPI_Allreduce along each colcomm. Because the amount of work and communication is trivial and amortizes it makes no difference big-O wise.

Computing q_(j)

For practical reasons the multiplication must be broken up into two parts that look as follows:

$t_{(j)} = Λ_{(i)}^{- 2} S p_{(j)}$

$q_{(j)} = S^{T} t_{(j)}$

Then each node performs the following series of operations:

$t_{k} = Λ_{{(i)}_{k}}^{- 2} S_{k, l} p_{l}$
$t_{k} = MPI_Allreduce (t_{k}, MPI_SUM, {rowcomm}_{k})$
$q_{l} = S_{k, l}^{T} t_{k}$
$q_{l} = MPI_Allreduce (q_{l}, MPI_SUM, colcomm)_{l})$

Note how each communicator only has to deal with a subset of $t$ and $q$ . This turns out to be important for being able to scale the system up.

Complexity

The amount of computation per node is $O (nnz (S) / P) = O (q n / P)$ assuming $S$ is perfectly split up among them. This can usually be accomplished by swapping rows and/or columns and adjsting the shape of the grid.

For communication we need to know how MPI_Allreduce works. A pessimistic estimate is that it is implemented by pairwise summing in a hypercube grid. In each rowcomm $t_{k}$ is summed in $\log_{2} N$ exchanges, each one involving $m / M$ elements. Similarly with colcomm. Let $K$ be the time necessary to transmit one value between two nodes in the cluster, in seconds. Then the time used by each pair of MPI_Allreduce calls is $K m / M \log_{2} N + K n / N \log_{2} M)$ .

If we take the above, let $M \approx N \approx \sqrt{P}$ then we get the following overall complexity:

$O (q n / P + K n / \sqrt{P} \log P) = \tilde{O} (q n / P + K n / \sqrt{P})$

We therefore expect linear speedup for small clusters, transitioning into $\sqrt{P}$ speedup for large clusters.

Strip based splitting

Another possibility of splitting $S$ is into horizontal and vertical strips, $S_{k, *}$ and $S_{*, l}$ respectively. This turns out to be worse than the method presented above except for very small clusters. I therefore only present it in short.

The per-node operations look as follows:

$t_{k} = Λ_{{(i)}_{k}}^{- 2} S_{k, *} p_{(j)}$
$t_{(j)} = MPI_Allgather (t_{k})$
$q_{l} = S_{*, l}^{T} t_{(j)}$
$q_{(j)} = MPI_Allgather (q_{l})$

This means $t_{k}$ and $q_{l}$ are computed at each node and then MPI_Allgather'd into $t_{(j)}$ and $q_{(j)}$ on all nodes. Allgather is assumed to be implemented by pairwise exchange of exponentially growing collections of $t_{k}$ 's. The resulting complexity is as follows:

$O (q n / P + K \sum_{k = 0}^{\log_{2} P} 2^{k} n / P) = O (q n / P + K n)$

This is clearly worse than the blocklevel splitting presented earlier, except when $P$ is very small and $\log_{2} P$ dominates. Another way to view the communication bound is that every node must receive $n + m$ elements per iteration which clearly takes $O (K n)$ time.

Results

The following table summarizes my estimates for how large systems can be solved on Kebnekaise for three of the node types in the cluster: Skylake, Compute-skylake and Large Memory. For those interested in the details the LibreOffice spreadsheet used can be downloaded here. I have chosen the average number of non-zeros per column to be $q = 160$ just like in previous posts, meaning the number of inputs and outputs per production method.

	Compute	Compute-skylake	Large Memory
M (#horizontal strips)	28	10	6
N (#vertical strips)	14	5	3
P (#nodes, M*N)	392	50	18
m (#constraints)	46*10⁹	12*10⁹	79*10⁹
n (#variables)	23*10⁹	6.2*10⁹	39*10⁹
Estimated CG iteration time (s)	9.25	3.28	29.45

Keep in mind these are estimates. To truly test this I would need to apply for a small project allocation via SNIC.

The most fruitful appears to be the Compute nodes which are capable of handling 23,000,000,000 variables in less than ten seconds per iteration. With the previously stated 2000 empirical iterations per solve this amounts to five hours of computation time. Very comfortable. Most of this is just sending data around. Only ~5% of the time is spent actually computing. This suggests that a cluster optimized for this kind of problem should have a much faster interconnect.

The upper bound is trickier since we don't know the condition number $κ (D^{- 1} H)$ . Let's guess it is 1000. In that case the number of iterations for 7 bits of accuracy is 15 billion, or 4300 years (!). Clearly we hope that $S$ is not structured so that predictor-corrector methods cannot be used or else plans of this size cannot be computed even with exascale clusters.

One thing to keep in mind is that we are likely to reuse previous solutions, and only perturbing the system in ways like tightening/slackening constraints or adding/removing production methods. Solving such a perturbed system is much cheaper than what has been presented here.

For those curious how long a post like this takes to write, I started writing this one on 2022-01-25. So about a week and a half off and on.

Addendum (2022-02-07)

After going through the literature some more I came across the paper A lower bound on the number of iterations of long-step primal-dual linear programming algorithms (doi:10.1007/BF02206818) by Michael J. Todd and Yinyu Ye which provides a worst-case bound on the number of predictor-corrector steps of $Ω (\sqrt[3]{m})$ . If I understand their results correctly then the duality gap shrinks by a factor of 3 every $K = ⌊ min (δ \sqrt[3]{m}, (1 - γ)^{2} n / (8 α)) ⌋$ steps where $δ = 25 / (32 α)$ , $α \geq 4$ and $1 / 2 < γ < 1$ . For sufficiently large linear programs this simplifies to $K \leq 25 / 128 \sqrt[3]{m}$ . Combining this with previous results in here we get:

$j_{m a x} \leq 25 / (16 \log_{2} 3) L \sqrt[3]{m} \sqrt{κ (M^{- 1} H)}$

Using the same $κ = 1000$ guess and $L = 7$ bits we get 780,000 iterations or 84 days. Not bad! Keep in mind this is an upper bound. In practice predictor-corrector does even better.

I've also been reading up on Katta Murty's sphere methods. The zero-radius version is identical to an idea I had that appeared to run in $O (n^{ω + 1})$ (strongly polynomial) time. I couldn't prove this, and by studying Murty's references I came across one paper that proves it takes exponential time for certain types of linear programs, just like the simplex method. Bummer.

Murty's more sophisticated sphere methods are based on a centering strategy that involves only projection and linear programs in two dimensions. They involve very few if any matrix inversions / linear system solves. But the number of steps appears to be up to $2 m$ which may be prohibitive communication-wise even if the steps themselves are cheap.