Reviewing the FiveThirtyEight 2018 Election Forecasts

Nate Silver’s¬†published forecasts of the 2018 mid-term elections, predicting the outcome of races for House, Senate, and Governor. The site also made it easy to download their projection data, and so I’d like to look at their forecasts and see how their models did. Silver made three variants of his model, dubbed “Lite”, “Classic”, and “Deluxe”. Lite basically uses only polling (and, where polling is scarce or non-existent, comparisons to similar districts which have polling). Classic adds in other fundamental data, like candidate fund raising and historical voting patterns. Finally, Deluxe factors in expert ratings from the Cook Political Report, Inside Elections, and University of Virginia political scientist¬†Larry Sabato’s Crystal Ball. Silver’s expectation is that while all three models should be good, adding additional complexity to the models should improve their accuracy.
The top line results are quite good, as all three models came very close to the number of seats won in each case:

Office   Party    Lite  Deluxe Classic Party    Lite  Deluxe Classic Party    Lite  Deluxe Classic
Governor R  20    18.3    18.8    19.0 D  16    17.7    17.2    17.0
House    R 203   202.1   203.6   200.6 D 232   232.8   231.4   234.3
Senate   R  11     9.7     9.5     9.5 D  22    23.3    23.5    23.5 I   2     2.0     2.0     2.0

This table allocates the 9 uncalled races to the current vote total leader, and so counts the Mississippi Senate special election which is headed for a run-off as a Republican seat. To compute the expected number of seats for a party, I added the chance the model gave that party’s candidate of winning the race.
The models gave probabilistic forecasts for each race, showing not only a chance of each candidate winning, but also expected vote share, as well as 10th and 90th percentile vote shares, to give a sense of the possible range of outcomes. So next I thought
I’d check how well the model did at setting these percentiles:

Model  Total    < 10%      < 50%     > 90%
Lite    1233  100 8.1%  639 51.8%  102 8.3%
Classic 1233   77 6.2%  634 51.4%   74 6.0%
Deluxe  1233   84 6.8%  641 52.0%   81 6.6%

So for the 1233 candidates for whom they made projections, just over half of them in each model had a vote share under their projection, about what I’d expect. Interestingly, the results tended not to contain as many surprises as the model expected. Just over 8% of the Lite projections fell above or below the 80% confidence interval, and for Deluxe and Classic those totals fell below 7%. So the models were projecting more uncertainty than we actually saw this year.
Another way to look at this is to count how often the model’s favorite did not win, and compare that with the model’s expected number of upsets. Expected upsets is the sum of the odds of the non-favorite candidate winning each race. If the model were perfectly calibrated, and results normally distributed around expectations, we should see expected upsets match actual upsets.

Model     Upsets   Expected Total
Lite     25  4.9% 40.5  8.0% 506
Deluxe   17  3.4% 29.9  5.9% 506
Classic  20  4.0% 34.6  6.8% 506 

All three models predicted more upsets than we actually saw, which is consistent with the model being not confident enough. As Silver expected, the frequency of both projected and actual upsets for each model decreases as the complexity increases. In simpler English, adding more to a model makes it better at prediction: Lite had the most upsets, Deluxe the fewest, and Classic was in the middle.
I also broke down the above table into the four categories the site was showing on election night: Toss Up (no candidate has a 60% or better chance to win), Lean (favorite is 60-75% likely), Likely (favorite is 75-95% to win), and Solid (favorite expected to win more than 95%):

                 Toss Up                      Lean
Model     Upsets   Expected Total   Upsets   Expected Total
Lite     16 53.3% 13.6 45.3%  30    5 13.2% 11.9 31.4%  38
Deluxe    6 40.0%  6.8 45.0%  15    7 21.2% 11.1 33.6%  33
Classic   9 36.0% 11.1 44.4%  25    7 23.3%  9.8 32.6%  30
                  Likely                     Solid
Model     Upsets   Expected Total   Upsets   Expected Total
Lite      4  4.1% 12.7 12.9%  98    0  0.0%  2.3  0.7% 340
Deluxe    4  5.0% 10.5 13.2%  80    0  0.0%  1.5  0.4% 378
Classic   4  4.7% 12.0 14.1%  85    0  0.0%  1.7  0.5% 366 

None of the solid favorites lost in any of the models, although with well over 300 races, the probabilities given would have suggested 1-2 longshot upsets. Toss Up races were good, with the favorite expected to lose about 45% of the time, and, albeit in small sample sizes, they did lose about that often, or sometimes more. It’s the Lean and especially Likely categories where we see the big gap in upset races. Rather than winning about 1 in 7 or 1 in 8 Likely races, the underdog won only about 1 in 20, only a third as often. For Lean races we’d expect about 1 in 3 to be an upset, but the favorites won 3 in 4 or better.
So in terms of both extremity of vote share, and frequency of updates, I found that the models predicted much more uncertainty than election results showed.
Does that mean they were poorly calibrated? You can’t really tell from a single election. Often polling in retrospect underestimates one party or the other across the board (which one it favors is basically a coin flip), and in such environments you’d see more upsets. Before the election Silver talked about the downside risk for Republicans being much larger than for Democrats – that is, more GOP-held seats were Likely or Lean, and so if results proved more Blue, we would have seen many more seat gains for the Democrats, but if results were more Red, Democrats were still likely to gain House seats, just not so many.
I tweaked my analysis code to allow a parallel shift in vote share – that is I take the actual results, and then take, say, 2 points from the Democrat and give it to the Republican. So this is an effective 4 point swing towards Republicans. Here’s what would have happened in that scenario:

Model     Upsets   Expected Total
Lite     29  5.7% 40.5  8.0% 506
Deluxe   31  6.1% 29.9  5.9% 506
Classic  30  5.9% 34.6  6.8% 506
Office   Party    Lite  Deluxe Classic Party    Lite  Deluxe Classic Party    Lite  Deluxe Classic
Governor R  22    18.3    18.8    19.0 D  14    17.7    17.2    17.0
House    R 220   202.1   203.6   200.6 D 215   232.8   231.4   234.3
Senate   R  14     9.7     9.5     9.5 D  19    23.3    23.5    23.5 I   2     2.0     2.0     2.0
Model  Total    < 10%      < 50%     > 90%
Lite    1233  157 12.7%  644 52.2%  153 12.4%
Deluxe  1233  144 11.7%  643 52.1%  136 11.0%
Classic 1233  139 11.3%  643 52.1%  142 11.5%

Now the Deluxe model almost exactly nails the number of upsets, while Classic is close, and only Lite is markedly low. In this more red environment, the GOP narrowly holds the House, with 220 seats, and it wins 14 Senate seats, for a 4 seat gain there. For all three models, the number of candidates with vote shares outside the 80% range is a little more than you’d expect in each direction.
If the error were in the opposite direction, and we saw Democrats do 2 points better across the board and Republicans 2 points worse, this is what I get:

Model     Upsets   Expected Total
Lite     33  6.5% 40.5  8.0% 506
Deluxe   27  5.3% 29.9  5.9% 506
Classic  32  6.3% 34.6  6.8% 506
Office   Party    Lite  Deluxe Classic Party    Lite  Deluxe Classic Party    Lite  Deluxe Classic
Governor R  16    18.3    18.8    19.0 D  20    17.7    17.2    17.0
House    R 184   202.1   203.6   200.6 D 251   232.8   231.4   234.3
Senate   R   8     9.7     9.5     9.5 D  25    23.3    23.5    23.5 I   2     2.0     2.0     2.0
Model  Total    < 10%      < 50%     > 90%
Lite    1233   86 7.0%  643 52.1%   89 7.2%
Deluxe  1233   80 6.5%  651 52.8%   89 7.2%
Classic 1233   75 6.1%  648 52.6%   76 6.2%

This environment also produces more upsets, although each of the models still expected a few more upsets than occurred in this scenario.
Now instead of Republicans narrowly holding the House, Democrats flip more than 50 House seats, and also barely take control of the Senate by 1 seat. Interestingly, I still see fewer than 10% of candidates getting above the 90th percentile or below the 10th percentile expected vote share.
These scenarios also show how seat total projections would be impacted by a systematic error. Instead of virtually nailing the number of House seats, the models would be off by 15-20 or so, and they would do worse at calling Governor and Senate races.
This year was quite good overall for polling accuracy, and so the models overstated the uncertainty for this year’s election. Even in a different environment, the models only get to about the number of upsets predicted. In order start to see more upsets than the models predicted, I need to shift each race at least 3-4 points towards one party of the other, and that would be a *very* large polling miss. So while the models did quite well in projecting the actual outcome, I think they were likely overstating uncertainty.