The “fundamentals” models, in fact, have had almost no predictive power at all. Over this 16-year period, there has been no relationship between the vote they forecast for the incumbent candidate and how well he actually did — even though some of them claimed to explain as much as 90 percent of voting results.It's an interesting post, and I certainly agree that calling people on their predictions is a useful thing to do. I started taking notes...and then John Sides beat me to the response. Which is just as well; John knows this part of the world quite a bit better than I do, anyway, and I agree with almost everything he said.. So rather than composing a proper post, I'm just going to splatter my notes out here, starting with three general points, all of which John covered better than I do:
1. This can't be said enough times, and I wish Silver had said it up front. Election prediction models are a tiny sliver of what political scientists do. To the extent that they're looking on as anything but a parlor game, it's as a guide to explanation, not prediction.
2. And with that: it's not my field, and I don't keep up with it nearly as closely as I perhaps should, but really political scientists know quite a lot about voter behavior and elections. Almost all of that has nothing to do -- nor, really, should it -- with making the best election prediction models.
3. What's more the near-consensus among political scientists is pretty simple: the economy plays a major role in elections, but campaign and candidate level effects can also be real.
4. On prediction systems. There are two reasons prediction systems can fail: because the thing they're trying to predict is not predictable, or because they're not very good models. The first would be true if, for example, candidate and campaign effects were very large -- but it also would be true if perfectly predictable effects of economic performance depended on data that were not available until after the event (or even before the event but after the prediction).
5. If the problem was that the models stink, then what we might find are some predictors that do much better than others, even if it just means they stink less.
6. That appears to be the case. Three of the predictors -- Abromowitz, Wlezien & Erickson, and Hibbs -- do quite well. Their average error (not RMSE, just simple of the error Silver reports) for the two-party margin of difference is 3.3, 3.7, and 4.6, respectively. Out of fifteen predictions among them, only a couple are stinkers -- Hibbs on Gore/Bush missed by 8.7 points, and W & E miss that one by 9.5. And they get the winner right every time.
7. Then again, Hibbs is using two variables and no polling. With that, you can get an average miss of under five points? I'll take it!
8. Major warning: it's certainly possible that those three have just been lucky. However, what's reassuring is that they are consistently among the best. So that even in 2000, when none of them do particularly well, they rank 3rd, 4th, and 6th out of 9 predictions.
9. This makes me strongly suspect -- but doesn't prove -- that what we have are good and bad predictors, not an overall failure of prediction-systems-in-general, or something that is impossible to predict. Again, it doesn't prove it!
10. However, treating these three systems as similar to, say, the Lockerbie predictor -- which has missed by 19.3, 12.6, and 8.9 points in its three trials -- doesn't make a whole lot of sense to me. If someone publishes poorly constructed Senate election projections that perform poorly, does it make Nate Silver's Senate projections worse? Which suggests it's not good enough to just look at and average all predictions; you need to look at and critique which ones are well constructed and consistent with what we know generally about elections.
11. Not that I'm doing that in this post!
12. The one disagreement I have with John's response is that he gives the models a pass because they almost invariably have at least picked the right winner, and after all that's what we really care about. I'm not sure that's right. For one thing, it depends a lot on what the point is of doing the prediction. If it's to test what we think we know by projecting out into the future, then our demands are very different than if the reason is to satisfy our curiosity. Both of those are legitimate things to do, but they imply different standards to use in evaluating a predictor.
13. Silver makes much of the distinction between pure fundamentals predictors which include only non-campaign indicators, and those which incorporate polling information. That's reasonable, but I believe (and I haven't looked at all of them) that there's a wide range in how these predictors use polling. Generally, if someone uses horse race numbers from September (and I don't know that any of the models Silver uses do that) then it's not going to be nearly as useful as one that looks at presidential approval many months out.
14. Basically, what we want for an explanatory-type predictor, it seems to me, is something that excludes the influences of the campaign and of non-incumbent candidates. A pre-campaign presidential approval number essentially incorporates the effects of whatever events have happened during the campaign along with any residual popularity of the incumbent. I can see both advantages and disadvantages compared to either ad-hoc dummy variables (for, say, incumbent party while a city was flooded) or ignoring all the events that can't be systematically accounted for (such as the economy or wars).
15. By the way: if the question is whether the models in general work, then I think Silver makes the wrong choice by including multiple version by the same author(s). If the models were updates, then only the last one should count; if they were released together...I'd probably just dismiss those altogether. Silver is of course correct that anyone who releases multiple versions and then only touts the winners is misbehaving. But that doesn't really speak to the question about whether the models as a whole are capturing anything.
16. That said, as I eyeball it, I'm not seeing that it matters much; again, just from a very quick glance, it doesn't appear that the highest-number version does (much? any?) better.
17. Silver mentions the issue of revised data. This is, again, a fairly big deal, although how it affects any of these predictions I have no idea. My general memory of these things is that there were significant post-election revisions in three of the five cycles. In 1992 the revisions were improvements, which would have made Bush a more likely winner and, again just eyeballing, tending to hurt the models' performance. In 2000 the revisions were downward, which would have helped W. and made most of the models' performance better; in 2008, the revisions were again downward, which would have helped Obama, helping some predictors and hurting others. Caveat: that's again based on both my memory and on eyeballing -- and of course different predictors use different numbers, so the revisions could easily affect different models differently.
18. Of course, if the purpose is to successfully predict the future, then it's a fairly big deal if it turns out that the economic data just aren't good enough quickly enough to be able to do so. If it's explanatory, then it's no big deal at all to go back and plug in the right numbers after the fact.
19. I'd be interested to see if the separation between "good" and "bad" models I noted above holds up if final economic numbers are plugged in. Not enough interested to do the work myself.
20. Silver makes much of the spread between the models -- not only are their errors large, but they don't vary together. That's a good point -- but again, if some of these models are better than others, then it's not all that interesting to know that the "bad" predictors are all over the place. The three "good" models are relatively tightly bunched in all five cycles, with no more than a six point spread.
21. I think I'm repeating this for the third time, but it's important: I don't know that the best-performing predictors are actually good, or that the worst-performing are bad. Could be luck.
22. And last point. I can't find it, but if I recall correctly someone (Nyhan?[See update below]) showed that a weighted average of the predictors does an excellent job. Silver says the predictors are still mostly useless averaged; is that true with a weighted average, in which the "bad" models would count for much less than the "good" ones?
UPDATE: Make that this post, reporting on this article.