Pandemic Uncovers the Limitations of Superforecasting

Apr 18, 2020

If you prefer to listen rather than read, this blog is available as a podcast here. Or if you want to listen to just this post:

Or download the MP3

As near as I can reconstruct, sometime in the mid-80s Phillip Tetlock decided to conduct a study on the accuracy of people who made their living “commenting or offering advice on political and economic trends”. The study lasted for around twenty years and involved 284 people. If you’re reading this blog you probably already know what the outcome of that study was, but just in case you don’t or need a reminder here’s a summary.

Over the course of those twenty years Tetlock collected 82,361 forecasts, and after comparing those forecasts to what actually happened he found:
The better known the expert the less reliable they were likely to be.
Their accuracy was inversely related to their self-confidence, and after a certain point their knowledge as well. (More actual knowledge about, say, Iran led them to make worse predictions about Iran than people who had less knowledge.)
Experts did no better at predicting than the average newspaper reader.
When asked to guess between three possible outcomes for a situation, status quo, getting better on some dimension, or getting worse, the actual expert predictions were less accurate than just naively assigning a ⅓ chance to each possibility.
Experts were largely rewarded for making bold and sensational predictions, rather than making predictions which later turned out to be true.

For those who had given any thought to the matter, Tetlock’s discovery that experts are frequently, or even usually wrong was not all that surprising. Certainly he wasn’t the first to point it out, though the rigor of his study was impressive, and he definitely helped spread the idea with his book Expert Political Judgement: How Good Is It? How Can We Know? Which was published in 2005. Had he stopped there we might be forever in his debt, but from pointing out that the experts were frequently wrong, he went on to wonder, is there anyone out there who might do better? And thus began the superforecaster/Good Judgement project.

Most people, when considering the quality of a prediction, only care about whether it was right or wrong, but in the initial study, and in the subsequent Good Judgement project, Tetlock also asks people to assign a confidence level to each prediction. Thus someone might say that they’re 90% sure that Iran will not build a nuclear weapon in 2020 or that they’re 99% sure that the Korean Peninsula will not be reunited. When these predictions are graded, the ideal is for 90% of the 90% predictions to turn out to be true, not 95% or 85%, in the former case they were under confident and in the latter case they were overconfident. (For obvious reasons the latter is far more common). Having thus defined a good forecast Tetlock set out to see if he could find such people, people who were better than average at making predictions. He did. And it became the subject of his next book Superforecasting: The Art and Science of Prediction.

The book’s primary purpose is to explain what makes a good forecaster and what makes a good forecast. As it turns out one of the key features of that was that superforecasters are far more likely to predict that things will continue as they have. While those forecasters who appear on TV and who were the subject of Tetlock’s initial study are far more likely to predict some spectacular new development. The reason for this should be obvious, that’s how you get noticed. That’s what gets the ratings. But if you’re more interested in being correct (at least more often than not) then you predict that things will basically be the same next year as they were this year. And I am not disparaging that, we should all want to be more correct than not, but trying to maximize your correctness does have one major weakness. And that is why, despite Tetlock’s decades long effort to improve forecasting, I am going to argue that Tetlock’s ideas and methodology have actually been a source of significant harm, and have made the world less prepared for future calamities rather than more.

II.

To illustrate what I mean, I need an example. This is not the first time I’ve written on this topic, I actually did a post on it back in January of 2017, and I’ll probably be borrowing from it fairly extensively, including re-using my example of a Tetlockian forecaster: Scott Alexander of Slate Star Codex.

Now before I get into it, I want to make it clear that I like and respect Alexander A LOT, so much so that up until recently, and largely for free (there was a small Patreon) I read and recorded every post from his blog and distributed it as a podcast. The reason Alexander can be used as an example is that he’s so punctilious about trying to adhere to the “best practices” of rationality, which is precisely the position Tetlock’s methods hold at the moment. This post is an argument against that position, but at the moment they’re firmly ensconced.

Accordingly, Alexander does a near perfect job of not only making predictions but assigning a confidence level to each of them. Also, as is so often the case he beat me to the punch on making a post about this topic, and while his post touches on some of the things I’m going to bring up, I don’t think it goes far enough, or offers its conclusion quite as distinctly as I intend to do.

As you might imagine, his post and mine were motivated by the pandemic, in particular the fact that traditional methods of prediction appeared to have been caught entirely flat footed, including the Superforecasters. Alexander mentions in his post that “On February 20th, Tetlock’s superforecasters predicted only a 3% chance that there would be 200,000+ coronavirus cases a month later (there were).” So by that metric the superforecasters failed, something both Alexander and I agree on, but I think it goes beyond just missing a single prediction. I think the pandemic illustrates a problem with this entire methodology.

What is that methodology? Well, the goal of the Good Judgement project and similar efforts is to improve forecasting and predictions specifically by increasing the proportion of accurate predictions. This is their incentive structure, it’s how they’re graded, it’s how Alexander grades himself every year. This encourages two secondary behaviors, the first is the one I already mentioned, the easiest way to be correct is to predict that the status quo will continue, this is fine as far as it goes, the status quo largely does continue, but the flip side of that is a bias against extreme events. These events are extreme in large part because they’re improbable, thus if you want to be correct more often than not, such events are not going to get any attention. Meaning their skill set and their incentive structure are ill suited to extreme events (as evidenced by the 3% who correctly predicted the magnitude of the pandemic I mentioned above).

The second incentive is to increase the number of their predictions. This might seem unobjectionable, why wouldn’t we want more data to evaluate them by? The problem is not all predictions are equally difficult. To give an example from Alexander’s list of predictions (and again it’s not my intention to pick on him, I’m using him as an example more for the things he does right than the things he does wrong) from his most recent list of predictions, out of 118, 80 were about things in his personal life, and only 38 were about issues the larger world might be interested in.

Indisputably it’s easier for someone to predict what their weight will be or whether they will lease the same car when their current lease is up, than it is to predict whether the Dow will end the year above 25,000. And even predicting whether one of his friends will still be in a relationship is probably easier as well, but more than that, the consequences of his personal predictions being incorrect are much less than the consequences of his (or other superforecasters) predictions about the world as a whole being wrong.

III.

The first problem to emerge from all of this is that Alexander and the Superforecasters rate their accuracy by considering all of their predictions regardless of their importance or difficulty. Thus, if they completely miss the prediction mentioned above about the number of COVID-19 cases on March 20th, but are successful in predicting when British Airways will resume service to Mainland China their success will be judged to be 50%. Even though for nearly everyone the impact of the former event is far greater than the impact of the latter! And it’s worse than that, in reality there are a lot more “British Airways” predictions being made than predictions about the number of cases. Meaning they can be judged as largely successful despite missing nearly all of the really impactful events.

This leads us to the biggest problem of all, the methodology of superforecasting has no system for determining impact. To put it another way, I’m sure that the Good Judgement project and other people following the Tetlockian methodology have made thousands of forecasts about the world. Let’s be incredibly charitable and assume that out of all these thousands of predictions, 99% were correct. That out of everything they made predictions about 99% of it came to pass. That sounds fantastic, but depending on what’s in the 1% of the things they didn’t predict, the world could still be a vastly different place than what they expected. And that assumes that their predictions encompass every possibility. In reality there are lots of very impactful things which they might never have considered assigning a probability to. That in fact they could actually be 100% correct about the stuff they predicted but still be caught entirely flat footed by the future because something happened they never even considered.

As far as I can tell there were no advance predictions of the probability of a pandemic by anyone following the Tetlockian methodology, say in 2019 or earlier. Or any list where “pandemic” was #1 on the “list of things superforecasters think we’re unprepared for”, or really any indication at all that people who listened to superforecasters were more prepared for this than the average individual. But the Good Judgement Project did try their hand at both Brexit and Trump and got both wrong. This is what I mean by the impact of the stuff they were wrong about being greater than the stuff they were correct about. When future historians consider the last five years or even the last 10, I’m not sure what events they will rate as being the most important, but surely those three would have to be in the top 10. They correctly predicted a lot of stuff which didn’t amount to anything and missed predicting the few things that really mattered.

That is the weakness of trying to maximize being correct. While being more right than wrong is certainly desirable. In general the few things the superforecasters end up being wrong about are far more consequential than all things they’re right about. Also, I suspect this feeds into the classic cognitive bias, where it’s easy to ascribe everything they correctly predicted to skill while every time they were wrong gets put down to bad luck. Which is precisely what happens when something bad occurs.

Both now and during the financial crisis when experts are asked why they didn’t see it coming or why they weren’t better prepared they are prone to retort that these events are “black swans”. “Who could have known they would happen?” And as such, “There was nothing that could have been done!” This is the ridiculousness of superforecasting, of course pandemics and financial crises are going to happen, any review of history would reveal that few things are more certain.

Nassim Nicholas Taleb, who came up with the term, has come to hate it for exactly this reason, people use it to excuse a lack of preparedness and inaction in general, when the concept is both more subtle and more useful. These people who throw up their hands and say “It was a black swan!” are making an essentially Tetlockian claim: “Mostly we can predict the future, except on a few rare occasions where we can’t, and those are impossible to do anything about.” The point of the Taleb’s black swan theory and to a greater extent his idea of being antifragile is to point out that you can’t predict the future at all, and when you convince yourself that you can it distracts you from hedging/lessening your exposure to/preparing for the really impactful events which are definitely coming.

From a historical perspective financial crashes and pandemics have happened a lot, business and governments really had no excuse for not making some preparation for the possibility that one or the other, or as we’re discovering, both, would happen. And yet they didn’t. I’m not claiming that this is entirely the fault of superforecasting. But superforecasting is part of the larger movement of convincing ourselves that we have tamed randomness, and banished the unexpected. And if there’s one lesson from the pandemic greater than all others it should be that we have not.

Superforecasting and the blindness to randomness are also closely related to the drive for efficiency I mentioned recently. “There are people out there spouting extreme predictions of things which largely aren’t going to happen! People spend time worrying about these things when they could be spending that time bringing to pass the neoliberal utopia foretold by Steven Pinker!” Okay, I’m guessing that no one said that exact thing, but boiled down this is their essential message.

I recognize that I’ve been pretty harsh here, and I also recognize that it might be possible to have the best of both worlds. To get the antifragility of Taleb with the rigor of Tetlock, indeed in Alexander’s recent post, that is basically what he suggests. That rather than take superforecasting predictions as some sort of gold standard that we should use them to do “cost benefit analysis and reason under uncertainty.” That, as the title of his post suggests, this was not a failure of prediction, but a failure of being prepared, suggesting that predicting the future can be different from preparing for the future. And I suppose they can be, the problem with this is that people are idiots, and they won’t disentangle these two ideas. For the vast majority of people and corporations and governments predicting the future and preparing for the future are the same thing. And when combined with a reward structure which emphasizes efficiency/fragility, the only thing they’re going to pay attention to is the rosy predictions of continued growth, not preparing for dire catastrophes which are surely coming.

To reiterate, superforecasting, by focusing on the number of correct predictions, without considering the greater impact of the predictions they get wrong, only that such missed predictions be few in number, has disentangled prediction from preparedness. What’s interesting is that while I understand the many issues with the system they’re trying to replace, of bloviating pundits making predictions which mostly didn’t come true, that system did not suffer from this same problem.

IV.

In the leadup to the pandemic there were many people predicting that it could end up being a huge catastrophe (including Taleb, who said it to my face) and that we should take draconian precautions. These were generally the same people who issued the same warnings about all previous new diseases, most of which ended up fizzling out before causing significant harm, for example Ebola. Most people are now saying we should have listened to them. At least with respect to COVID-19, but these are also generally the same people who dismissed previous worries as being pessimistic, or of panicking, or of straight up being crazy. It’s easy to see they were not, and this illustrates a very important point. Because of the nature of black swans and negative events, if you’re prepared for a black swan it only has to happen once for your caution to be worth it, but if you’re not prepared then in order for that to be a wise decision it has to NEVER happen.

The financial crash of 2007-2008 represents an interesting example of this phenomenon. An enormous number of financial models was based on this premise that the US had never had a nationwide decline in housing prices. And it was a true and accurate position for decades, but the one year it wasn’t true made the dozens of years when it was true almost entirely inconsequential.

To take a more extreme example imagine that I’m one of these crazy people you’re always hearing about. I’m so crazy I don’t even get invited on TV. Because all I can talk about is the imminent nuclear war. As a consequence of these beliefs I’ve moved to a remote place and built a fallout shelter and stocked it with a bunch of food. Every year I confidently predict a nuclear war and every year people point me out as someone who makes outlandish predictions to get attention, because year after year I’m wrong. Until one year, I’m not. Just like with the financial crisis, it doesn’t matter how many times I was the crazy guy with a bunker in Wyoming, and everyone else was the sane defender of the status quo, because from the perspective of consequences they got all the consequences of being wrong despite years and years of being right, and I got all the benefits of being right despite years and years of being wrong.

The “crazy” people who freaked out about all the previous potential pandemics are in much the same camp. Assuming they actually took their own predictions seriously and were prepared, they got all the benefits of being right this one time despite many years of being wrong, and we got all the consequences of being wrong, in spite of years and years, of not only forecasts, but SUPER forecasts telling us there was no need to worry.

I’m predicting, with 90% confidence that you will not find this closing message to be clever. This is an easy prediction to make because once again I’m just using the methodology of predicting that the status quo will continue. Predicting that you’ll donate is the high impact rare event, and I hope that even if I’ve been wrong every other time, that this time I’m right.

We Are Not Saved

Discussion about this post

Ready for more?