Feasibility is optimal

In this post I will critique some points I have made in favor of optimality. Examples can be seen here and here.

In the literature and present discourse on calculation in kind there is much debate around optimization. Leonid Kantorovich's "plan rays" is the start of this discussion as far as I can tell. They get picked up again by Paul Cockshott and Allin Cottrell in Towards a New Socialism, and more recently by Phillip Dapprich. I have suggested optimizing on labour, for example here. More recently in our Marx22 presentation me and Dave Zachariah have suggested that different regions might optimize on different things. But there are some problems with these notions of optimization.

Bad data

A major problem with pushing too hard for "optimal" plans is that they depend on the quality of the data. Whatever you have in your database is only ever an approximation of reality. The map is not the terrain.

In real systems with humans in the loop there will inevitably be incorrect data entered either by accident or deliberately in order to game the system. These situations can to some extent be detected using statistics, but it's never perfect.

Incorrect data directly results in incorrect constraints, which in turn may result in plans that cannot be satisfied in the real world. Assuming perfect data therefore threatens viability.

Unexpected things may also happen, and if the currently computed plan is too close to a constraint that is affected by such events then it will be more difficult to plan around them. A concrete example could be a solver that optimizes on labour choosing to centralize distribution in a continent to a single location because it is more economic in terms of labour to do so. As soon as such a distribution hub is hit by a disaster the problem with that approach will quickly become apparent.

Graphically these situations look something like this:

An ill-conditioned linear program represented by a very acute triangle with an X near the tip

We are optimizing in the direction of $c$ and x marks the current plan, quite close to optimum. The sides E and F of the triangle form an acute angle. In a situation like this the linear program that the triangle DEF represents is called ill-conditioned. A small change to the system may cause a large change in the solution.

Suppose that E is is disturbed for some reason such that its angle relative to D decreases. In that case we will get a situation like this:

The new linear program with the X moved a large distance

In order for the plan to still be viable it has to be updated so that it sits in the triangle DE'F. It should be apparent that this requires that the plan changes considerably.

Stay in the middle (or: Keep these constraints away from me!)

If the corners are dangerous then the thought arises that perhaps it is best if we stay as far away from them as possible? One way to achieve this is to stay as far away from any constraint as possible. This is an idea central to so-called barrier methods for linear programming.

The triangle is filled with an elliptic gradient. The X sits in the center of the gradient towards the short side D

A function is defined that for each point in the triangle expresses how far it is from all constraints. This function can be defined such that there is one and only one point in the system that is the "furthest" away from all constraints. This point is known as the centroid of the system. The mathematics of this is elaborated on further down.

Suppose the same thing happens to this system, namely that the angle between D and E decreases:

A similar situation as last time, except X has to move much less

The solution will have to move much less in this case in order to reach its desired position. Our ill-conditioned system is suddenly well-conditioned. In practical terms this may mean that the solver deliberately chooses to spread goods out a bit in order to have sufficient leeway. This leeway also has political implications. It allows for greater autonomy in workplaces without this imposing huge hanges to the overall plan. Local variety thereby gets attenuated.

The middle is good. The middle is safe.

Computational aspects

One practical reason to reconsider optimality is the computational cost of it. Due to Nesterov and Todd it is known that barrier methods for solving convex programming require at most $O (L \sqrt{m})$ Newton steps for $L$ bits of accuracy. Todd and Ye have shown that the problem requires $Ω (L \sqrt[3]{m})$ Newton steps for some (not all) programs when using predictor-corrector methods. In other words convex programming cannot be guaranteed to require fewer Newton steps than somewhere between the cube root and square root of the number of constraints, times some constant, times the desired digits of accuracy. In practice the problem is easier than this, but it is also much more difficult than the point I am getting to.

If we abandon optimality for a moment and take from barrier methods only the concept of centrality, then we'll have reduced the required computations immensely. This means we can offer faster feedback to users. To see why this is the case it is necessary to know that barrier methods rely on Newton iteration, which has quadratic convergence near the solution. This means that after each iteration the number of digits that are correct doubles.

One key question is what is meant by being "near" the solution. Renegar shows that it is possible to move a constraint at least $1 / 13 \sqrt{m}$ of the way toward the current center of the system and that one needs only a single Newton step to re-center in the new system. With $m = 46 * 10^{9}$ this is quite a short distance, less than one millionth. In practice it is possible to move the constraints much further at the cost of more Newton iterations. But it is also known that for some programs performing such aggressive updates will yield Newton steps that end up outside the feasible region.

In a draft version of this text I had an estimate what the required computations will have shrunk to (3¼ hours), but after re-reading Renegar I realized that estimate was too optimistic. For planning it may still be valid, but I prefer more guarded language.

In favor of optimality

Now that I have spilled some ink critiquing optimality in favor of centrality, why is the title of this section pro-optimality? The reason is because the above amounts to optimizing on feasibility.

Barrier methods approach finding the solution that maximizes $c^{T} x$ by defining a barrier function $f (x)$ that quantifies "how far" $x$ is from all constraints in the system $A x \geq b$ . It looks as follows:

$f (x) = \sum_{i} \log (a_{i}^{T} x - b_{i})$

where $a_{i}^{T}$ is each row in $A$ .

Renegar shows that $f (x)$ is upward convex (strictly concave), and because of this there exists a unique solution that maximizes $f (x)$ which Renegar calls $ξ$ .

If one has a measure of uncertainty associated with each constraint then the system can be rescaled such that the optimal solution $ξ$ is the one that maximizes certainty. I consider such certainty maximization to be equivalent to maximizing feasibility, which in turn is useful for ensuring viability.

Some readers may wonder what the point of defining $f (x)$ is when it comes to solving linear programs. The answer is that if one adds a constraint $c^{T} x \geq k^{(j)}$ and continually increases $k^{(j)}$ then the resulting $ξ^{(j)}$ will eventually maximize $c^{T} ξ^{(j)}$ . The increasing of $k$ acts as a "broom" that pushes $ξ$ towards the optimum along what is known as the central path. Predictor-corrector methods speed this process up via extrapolation. But we're of course not seeking to maximize $c^{T} x$ but to maximize peace of mind.

A pedagogical point struck me the other day which I will elaborate on. The definition of $f (x)$ looks somewhat arbitrary and doesn't lend itself to much intuition at first glance. But if we rearrange it a bit then we get the following:

$f (x) = \sum_{i} \log (a_{i}^{T} x - b_{i}) = \log \prod_{i} a_{i}^{T} x - b_{i}$

Maximizing $f (x)$ is equivalent to maximizing

$F (x) = e^{f (x)} = \prod_{i} a_{i}^{T} x - b_{i}$

In other words what we're doing is multiplying together the distances from the current solution to all constraints and looking for the solution that maximizes that product! This means that $F (x)$ is in units of the products of the units of all constraints. If we have two constraints in terms of volume (m³) and five constraints in terms of mass (kg) then $F (x)$ has units of m⁶kg⁵.

In a system with $m$ constraints doubling the distance to one constraint only requires reducing the distance to the others by roughly a factor $\sqrt[m - 1]{2} \approx 1 + 0.69 / m$ . This is the source of the "attenuation" I spoke of earlier. With a million workplaces, each workplace has to shoulder less than one millionth of the burden of a change to any other workplace.

What we have is effectively a blessing of dimensionality. The extreme points of the system get factorially further and further away from the central ellipsoid the higher the number of dimensions. When an extreme solution is sought, as is typically the case in linear programming, this fact is a huge problem and part of why solving LP is so difficult. If one instead seeks centrality this fact becomes a benefit. It no longer becomes possible that multiple disasters occurring at the same time should threaten the viability of the system, precisely because the extreme points are very far away from the center.

Finally all this does not exclude that it might be necessary to push for very "stiff" plans in certain locations, especially locations where rapid industrialization is deemed necessary. In other locations, for example here in Sweden, a more lax approach as I describe above may be called for. Both approaches can be accommodated within this formalism.

Another barrier?

I suspect that it is possible to pick another barrier function such that Newton convergence can be guaranteed for more aggresive updates. If this can be done then the required solver time drops from months to hours. One barrier function I have in mind looks like this:

$f (x) = \sum_{i} \log (a_{i}^{T} x - b_{i}) - g_{i} [d_{i} / 2 - (a_{i}^{T} x - b_{i})]^{2}$

where $d_{i}$ is a vector of "diameters" computed by tracing lines from the current solution in the direction towards and away from each constraint. The square term thus is likely (but not guaranteed) to have its maximum within the feasible region. What I suspect is that with appropriately chosen large-ish $g_{i}$ convergence will be rapid, because without the log term Newton iteration finds the maximum in a single step. The log term is there only to guarantee that this maximum is within the feasible region.

Whether this barrier function is worthwhile is something I would have to investigate.