Table of Contents

Friday, October 19, 2007

What do we mean by replacement players?

What follows is something that I originally did in the process of putting together my player value series, which will continue tomorrow night. However, this part evolved into its own little (if 3000 words can be little) stand-alone paper. It's not perfect, but it's a nice little study that can help us understand something about how players are used in the big leagues.

Introduction

Comparison to replacement level has become a common practice among statistically oriented fans and analysts when trying to understand player value. The motivation behind this technique goes something like this: we can estimate, to a great deal of accuracy, how many runs a team has generated in a season from their offensive stats (# of singles, doubles, home runs, walks, etc), using any number of run estimation stats (runs created, linear weights, etc). Furthermore, we can estimate these same runs estimates for individual players, giving us an idea of how many runs each player contributed to a particular team's offense.

However, when you try to use these runs estimates to understand player value to a team, you run into a problem: playing time has as much, if not more to do with a player's runs created as their actual performance, because players are credited for every single "Good Thing" that they do. For example, this season, Orlando Cabrera was second on the LA Angels in runs created (93) despite hitting for just a 0.742 OPS. He did rank behind Vladimir Guerrero, but ranked well ahead of Chone Figgins (0.825 OPS) and Casey Kotchman (0.840 OPS). The reason? Cabrera led the team with 701 plate appearances compared to Figgins' 503 and Kotchman's 508. It's not that Cabrera was bad this season at the plate, but he wasn't good, and certainly wasn't the Angels second most valuable hitter to most observers' eyes. And while durability certainly has value, if a durable player doesn't hit (or play defense) any better than a veteran AAA call-up would, he's not doing his teams any favors.

This is where comparisons to replacement players come in. In Cabrera's case, what we're interested in knowing is not how many total Good Things he did; we're interested in how many more Good Things he did compared to the number of Good Things a veteran AAA replacement player would be expected to do. It's this performance above replacement-level production that we're actually interested in, because that's the difference between someone who should be starting (or not) in the big leagues, and someone who should be riding the pine or playing in AAA.

The problem when it comes to actually quantifying performance above replacement level is that there's disagreement about how well replacement players perform. For example, folks typically estimate replacement level as some percentage of league-average production. However, you'll find a lot of disagreement about what percentage to use. Keith Woolner's VORP uses 80% of league average for most positions. In contrast, two well-respected amateur baseball researchers (Tom Tango and Brandon Heipp) recommend using a value closer to 73%. Each of these variants has its own reasonable justification, but most researchers also seem to agree that there's not a distinctly correct point at which we should assign the cutoff. So, given this uncertainty, I decided to take a look at some data describing player performance over the last several years to try to figure out how replacement players really do perform.

Methods

My study assumes that, in general, playing time reflects player quality, such that starters, bench players, and replacement players can be identified based on how often they play. Briefly, my methods, which are similar those of some of the previous replacement player studies, are as follows: I pulled '04-'07 stats from THT's stats pages for all players in both leagues. After generating custom linear weights for each league over the 4-year time span in question, and then calculating runs per game for each player, I sorted players within each year by plate appearances (my indication of playing time). I then examined successive "slices" of players, removing one player for each team in the league (i.e. 14 players per slice from the AL, 16 players per slice from the NL).

For the ease of description, I'm defining the first 8 (in NL) or 9 (in AL) slices as "starters," and the next 3 (in AL) or 4 (in NL) slices as top "bench" players, making for 12 slices that consist of players that are virtually guaranteed roster spots on their teams. The next 3 slices were defined as "fringe" players, who may or may not make their teams' rosters over the course of the season depending on how many pitchers a team carries and the particular needs of the team over the course of a season. Slices below these fringe slices (slice #16 and higher) were defined as "scrubs." Note, I did not keep track of teams or positions when doing this slicing; it was strictly based on plate appearances within years (there were a variety of reasons that I went this route, but theoretical and practical). Nevertheless, post hoc assessments of positional composition, at least, showed that it was fairly consistent across positions, especially in the AL.

Now, it true that this study is somewhat limited by the fact that it assigns player roles after the fact, rather than ahead of time, and therefore suffers from sampling bias--players who might have been counted on for substantial playing time may have faltered and lost their jobs to a another player who might have previously been considered a replacement player. There are other potential problems as well. For example, if a starter is injured mid-way through the season, he will be classified as a part-time player in this scheme, while his backups will be ranked higher than they probably deserve. Nevertheless, especially after seeing and looking through the data, I do not think that this is a catastrophic problem for this dataset, or its ultimate findings. Nevertheless, a study similar to magpie's (which was not published until after I'd completed this study), in which players are classified into their roles prior to considering their playing time that season, would be a very nice complement to this one. Maybe I'll try to do that someday, though don't let me hold you back if you're interested! ;)

Finally, a brief word on the standard I used for league average: in this study, all players were compared to league average production by position players only. If one uses true league average runs per game, NL players are compared to a standard that includes a significant number of plate appearances by pitchers, and thus is about 0.3 runs per game lower than I think it should be (NL rates from '03-'07: 4.6 r/g with pitchers, 4.9 r/g without pitchers). We're interested in performance relative to other hitters, not performance relative to pitchers pretending to be hitters.

Expectations

Before we look at the results, let's think about what we're looking for in terms of replacement-level production. Here are a couple of ways that we might identify replacement level, based on the slicing procedure I'm using:
  • Replacement = Fringe: Replacement level should match the production of "fringe" players, who may or may not make the big league roster depending on the needs of their team (# of pitchers, lefty/righty composition of bench, speed on bench, defensive skill on bench, etc).
  • Replacement = Scrub: Replacement level should match that of "scrub" players, who will be unlikely to make the big league roster except at those times when the big club needs to plug an emergency hole caused by injury, trade, etc.
  • Replacement = Not Starter or Bench: Replacement level should be the average production of anyone not identified as a "starter" or "bench" player.
Think about which of these you most agree with. Or, if you disagree with all of these definitions, decide how you would identify a replacement player in this study before reading further.

Ready? Ok, go ahead and read on!

Results

Here is how those slices broke down, by league. %LgAvg is the important number with respect to replacement level. "+- Field" is a composite fielding stat and is described in a section below.

American League
National League
Slice PA PA/Plyr R/G %LgAvg OBP SLG OPS +-Field
Slice PA PA/Plyr R/G %LgAvg OBP SLG OPS +-Field
Starter1 39699 709 6.05 124% 0.367 0.476 0.843 -3.4
Starter1 45132 705 6.04 123% 0.365 0.489 0.854 2.6
Starter2 37161 664 5.65 116% 0.353 0.469 0.822 -5.1
Starter2 42042 657 5.69 116% 0.358 0.476 0.835 4.0
Starter3 35277 630 5.50 113% 0.350 0.461 0.812 -4.2
Starter3 39090 611 5.68 116% 0.357 0.477 0.834 -1.3
Starter4 32803 586 5.35 110% 0.348 0.452 0.799 -1.9
Starter4 35852 560 5.03 102% 0.343 0.442 0.785 -0.4
Starter5 30676 548 5.18 106% 0.345 0.444 0.789 -3.5
Starter5 32152 502 4.83 98% 0.340 0.428 0.768 0.5
Starter6 28307 505 4.93 101% 0.335 0.437 0.772 -0.5
Starter6 29015 453 4.77 97% 0.337 0.428 0.766 1.9
Starter7 25105 448 4.64 95% 0.329 0.421 0.750 0.6
Starter7 26118 408 4.71 96% 0.338 0.418 0.756 1.1
Starter8 22194 396 4.55 93% 0.326 0.413 0.739 0.2
Starter8 22607 353 4.62 94% 0.330 0.425 0.755 -0.6
Starter9 19489 348 4.52 93% 0.326 0.413 0.738 -0.2
Bench1 19350 302 4.55 93% 0.327 0.424 0.751 0.5
Bench1 16329 292 4.18 86% 0.318 0.394 0.711 -0.7
Bench2 16727 261 4.58 93% 0.331 0.420 0.751 0.8
Bench2 13280 237 4.08 84% 0.313 0.393 0.706 -2.6
Bench3 14357 224 4.23 86% 0.322 0.403 0.725 0.8
Bench3 11067 198 3.84 79% 0.306 0.382 0.688 0.5
Bench4 12435 194 4.18 85% 0.317 0.403 0.720 -0.1
Fringe1 9555 171 3.85 79% 0.304 0.384 0.688 3.1
Fringe1 10415 163 4.03 82% 0.318 0.387 0.705 2.4
Fringe2 7771 139 3.61 74% 0.297 0.373 0.669 -1.8
Fringe2 8648 135 3.75 76% 0.309 0.373 0.681 1.0
Fringe3 5907 105 3.31 68% 0.286 0.357 0.643 2.6
Fringe3 6959 109 3.92 80% 0.317 0.377 0.694 1.5
Scrub1 4487 80 2.99 61% 0.281 0.334 0.615 4.1
Scrub1 5292 83 3.13 64% 0.290 0.343 0.633 2.6
Scrub2 3398 61 3.33 68% 0.295 0.350 0.645 -1.5
Scrub2 4004 63 3.23 66% 0.295 0.343 0.638 2.6
Scrub3 2649 47 2.82 58% 0.274 0.329 0.604 1.4
Scrub3 2920 46 2.62 53% 0.269 0.322 0.591 3.4
Scrub4 1841 33 2.38 49% 0.268 0.285 0.554 5.5
Scrub4 2100 33 2.32 47% 0.266 0.290 0.555 -22.5
Scrub5 1237 22 1.95 40% 0.254 0.265 0.519 -8.9
Scrub5 1364 21 2.31 47% 0.263 0.293 0.556 -3.0
Other 1309 23 2.04 42% 0.248 0.281 0.529 -2.9
Other 1097 17 2.31 47% 0.253 0.305 0.558 1.9

As you can see, in both leagues, the overall amount of production, as measured by percent of league average, gradually decreases as one moves down the playing time slices. This means that teams generally gave the most plate appearances to their best hitters! Nice to see. There are certainly good performances (injuries, rookie half-seasons, etc) in some of the lower slices, and bad performances in some of the upper slices (underperforming players, hidden injuries, Adam Everett-esque defensive specialists, etc). But the overall trend is clear, and consistent with what one would expect if performance is associated with playing time. In fact, given everything that I was worried about that could go wrong with this study, I think these data look remarkably clean.

Let's now take a graphical look at offensive production and try to see what these data might say about replacement-level performance. Below I've plotted player slices against offensive performance, relative to average within each league. Vertical bands identify the different groups of slices as defined above (starter, bench, fringe, and scrub players), while the gray horizontal band denotes the range of replacement values that I've seen suggested by other researchers (73% to 80%). Colored horizontal lines indicate weighted league averages within each of the four major slice categories.
This figure does a great job illustrating how smoothly production seems to drop among player slices, which indicates to me that there is useful signal here despite potential confounds. Furthermore, I think it's remarkable how consistent the two leagues were. The primary divergences between the two leagues were among starters 4-6, two of the bench players (slices 10 and 12), as well as in the 15th player slice (fringe #3). I'm not certain what is causing the divergences at those spots, although the first separation could be related to an unusual spike in catcher numbers among the NL slices 5-7 that doesn't occur in the AL slices (though it doesn't match up to that spike perfectly, either). Alternatively, because these production estimates look only at offense, perhaps this reflects some divergence in emphasis between offense and fielding between the two leagues? More on that in a bit.

Ok, based strictly on these empirical data, here's how I would define replacement players for each of the perspectives I mentioned above:
  • Replacement = Fringe: production level typically in the range of 68-82% of league average, with a weighted mean checking in at 74% of league average offense in the AL, but 81% of league average offense in the NL. Splitting the difference puts us at ~78%.
  • Replacement = Scrub: production level typically in the range of 40-68% of league average, with a weighted mean checking in at 57% of league average offense in the AL, and 59% in the AL, or 58% overall.
  • Replacement = Not Starter or Bench: the weighted mean of all players that are defined here as either fringe or scrub players was 68 to 71% (AL vs. NL) of league averages. Over the past four years, 80593, or 11% of all plate appearances across the two leagues came from players that fell into this category.
Clearly, traditional conventions related to replacement level (73 to 80% of league average), as denoted by the gray horizontal band through the figure, best match the "Fringe" player slices. Yet some of you may have picked one of the other expectations I outlined above regarding replacement players. After all, isn't replacement level supposed to represent a minimum level of MLB player production? Why is it that we're seeing so many players performing below replacement level?

Here's one explanation: replacement level tries to describe the level of production below which players will tend to be replaced by other players. After all, if someone is producing below replacement level, you should be able to get a different player at no additional cost that will perform better. Therefore, it does make sense that the group of players who did not have playing time sufficient to rank within the top 14 or so player slices will be primarily made up of guys who didn't perform up to this minimum standard. In those cases, teams apparently decided to give more playing time and roster space to players who could actually at least perform at replacement level.

Replacement-Level Fielding

To this point, I've been looking exclusively at offensive production of position players. However, players are not given playing time simply because of their offense--defense is important as well! So how to do these same slices of players fare with respect to their fielding performance?

To look at that, I've calculated fielding stats based on conversions of the THT zone rating data using a method described here. The average, in this case, is the mean across both leagues within a particular season. Also, I only pulled these data for the players' primary positions, which means that the league totals will not sum to zero. Also, because I'm comparing across all positions here, I added a rough positional adjustment based on the difficulty of fielding different positions, as estimated by Tom Tango, which is part (though not all) of the reason that AL slices have lower-rated defense below: there are DH's in that league, and they get a hefty negative positional adjustment to put them on an even playing field with other players. Because playing time was so different across slices, the runs saved estimates are standardized to represent per-season-per-player rates.

You can see the fielding data in the table above, but here are they are in graphical form. As before, slices on on the x-axis, and colored horizontal lines indicate league averages within each major slice category:
Several things to note. First, unless there are massive park factor differences across the leagues, the NL and AL seem to value defense a bit differently. In the National League, the top slices of players in terms of plate appearances tend to be not only outstanding hitters relative to their league, but they also tend to be outstanding fielders. Below the top two slices, defense is apparently given less of a premium. Among American League clubs, however, the ~5 players per team receiving the most plate appearances tend to be below-average defenders. It's not until the bottom four starters that you tend to reach average fielding performance.

Second, with respect to replacement level, it looks like we can assume that bench and fringe players tend to be fairly average defenders. There might be a slight tendency for fringe players to be slightly above average, though it's not by more than a run or two. Furthermore, as you can see, the fielding values get rather volatile on the right side of the figure (almost certainly due to sample size--fielding stats are more volatile than hitting stats), so I think it's safest to assume that replacement level fielding is essentially the same as MLB-average fielding. This finding is consistent with similar work by Tom Tango. One thing that is very clear is that replacement players are not massively below average fielders. This puts into question the relevance of systems like BPro's Fielding Runs Above Replacement (FRAR), which sets replacement-level fielding to an approximate league minimum.

Closing Thoughts

I won't pretend that this is The Definitive Study on replacement level. There are certainly a variety of concerns that one can raise with respect to sampling, player identification, etc. Nevertheless, I think it is a good study that gives us an empirical grounding on how players perform relative to the amount of playing time they are granted by their teams and their health. From these data, we can draw some conclusions about how to best estimate replacement-level performance.

On offense, the slices that seem to best fit the description of replacement players--guys who are on the bubble of making a big league team--tended to hit at levels ranging from 68% to 82% of league average, with a rough mean of ~78%. Both of the popular standards I've seen, 73% (endorsed by Tango and Patriot) and 80% (endorsed by Woolner), fall nicely within this range. And, of course, both have their own theoretical and/or empirical justifications. But if you ask me for my recommendation after doing this study, I would recommend the lower figure of 73%. It is still a level of production above that which the "scrub" players, as I defined them here, produce. And if the primary function of a replacement level paradigm is to recognize a production threshold at which any freely available talent is likely to perform, it makes sense to me to be somewhat conservative in that threshold--that way you don't ignore production by, for example, a bench player hitting at 82% of league average, who might turn out to be difficult to replace.

With defense, it seems clear based on this and other studies that replacement players play roughly league average defense. Therefore, if you are interested in describing a player's total value above replacement level, I recommend that you follow this procedure: calculate their runs on offense relative to a hitter at 73% of position player league average runs per game, and then add to that value their fielding vs. average, as well as a positional adjustment to account for the difficulty of playing that fielder's position. More on that in coming days in my player value series.

References, Resources, and Acknowledgments

All player statistics were pulled from the stats pages at The Hardball Times.

Custom linear weights were calculated for each league via base runs using a spreadsheet created by U.S. Patriot, using initial "B" coefficients pulled from this article by Tom Tango, and using league totals pulled from Doug's Stats.

This work was stimulated, in part, by a great e-mail conversation I've been having with skyking162 the past several weeks about how to value players. Sky also provided some helpful comments on a draft of this article.

Patriot has written an extremely helpful essay on baselines at his site, along with many other excellent articles on related topics.