Comparing Replacement Level for Starters and Relievers

There was a discussion at Tom Tango’s blog which led to an interesting question: what’s the difference in effectiveness between a replacement-level starter and a replacement-level reliever?
To answer this, you’d need to have some way of estimating replacement level for starters and relievers. Since I have a database with boxscore data since 2010, I set out to do exactly that. First, I compiled player statistics by role, either starter or reliever (the same player might appear in both roles), and then I sorted players by some effectiveness statistic (runs per 9 innings, ERA, FIP, whatever). I bucketed players in groups – the top 10%, next 10%, and so on, and took the average effectiveness statistic for each group.
I figured the bottom 10% effectiveness should be a good proxy for replacement level: if you’re performing in the bottom 10%, you’re replaceable. Sounds good. Here’s a table from 2013 data using FIP:

Bucket # Start FIP # Relief FIP Diff IP Start IP Relief
0 30 2.688 53 1.705 0.983 2928.34 919.67
1 30 3.313 52 2.506 0.807 4933.33 1967.33
2 29 3.584 53 2.962 0.622 4331.67 1959.68
3 30 3.890 52 3.251 0.639 3424.66 2416.67
4 29 4.090 53 3.589 0.501 3641.68 2464.00
5 30 4.359 52 3.924 0.435 3153.33 1644.34
6 30 4.629 53 4.280 0.349 2924.67 1473.32
7 29 4.999 52 4.973 0.026 1944.33 1245.66
8 30 5.691 53 6.221 -0.531 1058.33 659.00
9 29 7.635 52 9.597 -1.963 361.00 233.33

Immediately it looks like small sample size here is a big problem: the 52 relievers in the worst bucket average under 5 IP each, and the 29 starters average just about 12 IP. I tinkered for a while with adding arbitrary innings limits, like, say, 100 IP as a starter, and 30 IP as a reliever, ignoring players that don’t reach these totals. That gave more stable looking numbers, but the difference was greatly affected by where I might draw the line, and that was a purely arbitrary choice.
When I model fantasy sports pricing, I use the roster sizes for determining replacement level for a given league, so I thought I’d try that here. Each of the 30 teams typically uses 5 starters and carries a bullpen of at least 6 relievers. So I changed my filter to look at only the top 150 starters and 180 relievers by innings pitched. That led to this table:

Bucket # Start FIP # Relief FIP Diff IP Start IP Relief
0 15 2.714 18 1.985 0.729 2716.67 1092.00
1 15 3.260 18 2.632 0.628 2846.00 1045.00
2 15 3.452 18 2.884 0.568 2830.00 1124.67
3 15 3.596 18 3.087 0.509 2425.00 1121.67
4 15 3.833 18 3.251 0.582 2515.99 1181.00
5 15 4.010 18 3.437 0.573 2454.00 1057.00
6 15 4.136 18 3.612 0.524 2181.01 1025.67
7 15 4.354 18 3.808 0.546 2185.33 1088.67
8 15 4.555 18 4.133 0.422 2194.00 911.66
9 15 5.010 18 4.871 0.139 1954.00 888.66

That looks much better. The cutoff was 79 IP for a starter, and 36.67 IP for a reliever.
Here’s the same algorithm, just computing/bucketing by RA9 instead of FIP:

Bucket # Start RA9 # Relief RA9 Diff IP Start IP Relief
0 15 2.718 18 1.684 1.035 2579.00 1122.34
1 15 3.292 18 2.374 0.918 2706.67 1099.34
2 15 3.530 18 2.732 0.798 2605.34 1060.66
3 15 3.737 18 3.024 0.713 2620.34 1226.34
4 15 3.936 18 3.322 0.614 2641.33 1159.66
5 15 4.187 18 3.625 0.562 2530.00 1003.00
6 15 4.419 18 3.976 0.443 2220.00 1025.33
7 15 4.806 18 4.315 0.491 2236.00 1038.67
8 15 5.296 18 4.640 0.656 2200.67 975.67
9 15 6.103 18 5.509 0.594 1962.66 825.01

In the discussion on Tango’s site I had speculated that the difference between elite relievers and elite starters was likely greater than the overall average, and these tables support that view: the top bucket showed the widest gap in both FIP and RA9.
The top buckets aren’t relevant to the question of replacement level, but it is good to see that in general teams give more innings to pitchers who are performing better. That the best bucket doesn’t have the highest IP is also expected, as being the very best in performance rate requires some good luck as well as skill, and it’s easier to post very good rate stats in smaller slices of playing time.
Next I’ll track just the bottom bucket, but over the past 4 years:

Bucket # Start FIP # Relief FIP Diff IP Start IP Relief
2010 15 5.375 18 5.392 -0.016 1981.67 906.33
2011 15 5.095 18 5.131 -0.036 2113.33 850.67
2012 15 5.411 18 5.031 0.380 1858.67 893.33
2013 15 5.010 18 4.871 0.139 1954.00 888.66

And the same years using RA9:

Year # Start RA9 # Relief RA9 Diff IP Start IP Relief
2010 15 6.195 18 6.279 -0.084 2071.67 926.00
2011 15 5.945 18 5.898 0.047 2036.00 824.00
2012 15 6.320 18 6.075 0.245 1744.33 960.00
2013 15 6.103 18 5.509 0.594 1962.66 825.01

The minimum innings threshold varied from 79 to 85.33 for starters, and from 34.33 to 37 for relievers, as each year I looked at just the top 180 relievers and 150 starters by IP. It’s interesting to see that in both FIP and RA9, there was basically no difference in bottom percentile performance in 2010 and 2011 between starters and relievers. Since then, however, relievers have done a little better than starters, which is what I’d have expected.
This analysis is sensitive to where you draw the lines to exclude players for low playing time. Increasing the numbers in the pool (either by straight numerical count or by lowering an innings threshold) will make that group of pitchers look worse relative to the other group.