Friday, October 19, 2007

What do we mean by replacement players?

What follows is something that I originally did in the process of putting together my player value series, which will continue tomorrow night. However, this part evolved into its own little (if 3000 words can be little) stand-alone paper. It's not perfect, but it's a nice little study that can help us understand something about how players are used in the big leagues.

Introduction

Comparison to replacement level has become a common practice among statistically oriented fans and analysts when trying to understand player value. The motivation behind this technique goes something like this: we can estimate, to a great deal of accuracy, how many runs a team has generated in a season from their offensive stats (# of singles, doubles, home runs, walks, etc), using any number of run estimation stats (runs created, linear weights, etc). Furthermore, we can estimate these same runs estimates for individual players, giving us an idea of how many runs each player contributed to a particular team's offense.

However, when you try to use these runs estimates to understand player value to a team, you run into a problem: playing time has as much, if not more to do with a player's runs created as their actual performance, because players are credited for every single "Good Thing" that they do. For example, this season, Orlando Cabrera was second on the LA Angels in runs created (93) despite hitting for just a 0.742 OPS. He did rank behind Vladimir Guerrero, but ranked well ahead of Chone Figgins (0.825 OPS) and Casey Kotchman (0.840 OPS). The reason? Cabrera led the team with 701 plate appearances compared to Figgins' 503 and Kotchman's 508. It's not that Cabrera was bad this season at the plate, but he wasn't good, and certainly wasn't the Angels second most valuable hitter to most observers' eyes. And while durability certainly has value, if a durable player doesn't hit (or play defense) any better than a veteran AAA call-up would, he's not doing his teams any favors.

This is where comparisons to replacement players come in. In Cabrera's case, what we're interested in knowing is not how many total Good Things he did; we're interested in how many more Good Things he did compared to the number of Good Things a veteran AAA replacement player would be expected to do. It's this performance above replacement-level production that we're actually interested in, because that's the difference between someone who should be starting (or not) in the big leagues, and someone who should be riding the pine or playing in AAA.

The problem when it comes to actually quantifying performance above replacement level is that there's disagreement about how well replacement players perform. For example, folks typically estimate replacement level as some percentage of league-average production. However, you'll find a lot of disagreement about what percentage to use. Keith Woolner's VORP uses 80% of league average for most positions. In contrast, two well-respected amateur baseball researchers (Tom Tango and Brandon Heipp) recommend using a value closer to 73%. Each of these variants has its own reasonable justification, but most researchers also seem to agree that there's not a distinctly correct point at which we should assign the cutoff. So, given this uncertainty, I decided to take a look at some data describing player performance over the last several years to try to figure out how replacement players really do perform.

Methods

My study assumes that, in general, playing time reflects player quality, such that starters, bench players, and replacement players can be identified based on how often they play. Briefly, my methods, which are similar those of some of the previous replacement player studies, are as follows: I pulled '04-'07 stats from THT's stats pages for all players in both leagues. After generating custom linear weights for each league over the 4-year time span in question, and then calculating runs per game for each player, I sorted players within each year by plate appearances (my indication of playing time). I then examined successive "slices" of players, removing one player for each team in the league (i.e. 14 players per slice from the AL, 16 players per slice from the NL).

For the ease of description, I'm defining the first 8 (in NL) or 9 (in AL) slices as "starters," and the next 3 (in AL) or 4 (in NL) slices as top "bench" players, making for 12 slices that consist of players that are virtually guaranteed roster spots on their teams. The next 3 slices were defined as "fringe" players, who may or may not make their teams' rosters over the course of the season depending on how many pitchers a team carries and the particular needs of the team over the course of a season. Slices below these fringe slices (slice #16 and higher) were defined as "scrubs." Note, I did not keep track of teams or positions when doing this slicing; it was strictly based on plate appearances within years (there were a variety of reasons that I went this route, but theoretical and practical). Nevertheless, post hoc assessments of positional composition, at least, showed that it was fairly consistent across positions, especially in the AL.

Now, it true that this study is somewhat limited by the fact that it assigns player roles after the fact, rather than ahead of time, and therefore suffers from sampling bias--players who might have been counted on for substantial playing time may have faltered and lost their jobs to a another player who might have previously been considered a replacement player. There are other potential problems as well. For example, if a starter is injured mid-way through the season, he will be classified as a part-time player in this scheme, while his backups will be ranked higher than they probably deserve. Nevertheless, especially after seeing and looking through the data, I do not think that this is a catastrophic problem for this dataset, or its ultimate findings. Nevertheless, a study similar to magpie's (which was not published until after I'd completed this study), in which players are classified into their roles prior to considering their playing time that season, would be a very nice complement to this one. Maybe I'll try to do that someday, though don't let me hold you back if you're interested! ;)

Finally, a brief word on the standard I used for league average: in this study, all players were compared to league average production by position players only. If one uses true league average runs per game, NL players are compared to a standard that includes a significant number of plate appearances by pitchers, and thus is about 0.3 runs per game lower than I think it should be (NL rates from '03-'07: 4.6 r/g with pitchers, 4.9 r/g without pitchers). We're interested in performance relative to other hitters, not performance relative to pitchers pretending to be hitters.

Expectations

Before we look at the results, let's think about what we're looking for in terms of replacement-level production. Here are a couple of ways that we might identify replacement level, based on the slicing procedure I'm using:
  • Replacement = Fringe: Replacement level should match the production of "fringe" players, who may or may not make the big league roster depending on the needs of their team (# of pitchers, lefty/righty composition of bench, speed on bench, defensive skill on bench, etc).
  • Replacement = Scrub: Replacement level should match that of "scrub" players, who will be unlikely to make the big league roster except at those times when the big club needs to plug an emergency hole caused by injury, trade, etc.
  • Replacement = Not Starter or Bench: Replacement level should be the average production of anyone not identified as a "starter" or "bench" player.
Think about which of these you most agree with. Or, if you disagree with all of these definitions, decide how you would identify a replacement player in this study before reading further.

Ready? Ok, go ahead and read on!

Results

Here is how those slices broke down, by league. %LgAvg is the important number with respect to replacement level. "+- Field" is a composite fielding stat and is described in a section below.

American League
National League
Slice PA PA/Plyr R/G %LgAvg OBP SLG OPS +-Field
Slice PA PA/Plyr R/G %LgAvg OBP SLG OPS +-Field
Starter1 39699 709 6.05 124% 0.367 0.476 0.843 -3.4
Starter1 45132 705 6.04 123% 0.365 0.489 0.854 2.6
Starter2 37161 664 5.65 116% 0.353 0.469 0.822 -5.1
Starter2 42042 657 5.69 116% 0.358 0.476 0.835 4.0
Starter3 35277 630 5.50 113% 0.350 0.461 0.812 -4.2
Starter3 39090 611 5.68 116% 0.357 0.477 0.834 -1.3
Starter4 32803 586 5.35 110% 0.348 0.452 0.799 -1.9
Starter4 35852 560 5.03 102% 0.343 0.442 0.785 -0.4
Starter5 30676 548 5.18 106% 0.345 0.444 0.789 -3.5
Starter5 32152 502 4.83 98% 0.340 0.428 0.768 0.5
Starter6 28307 505 4.93 101% 0.335 0.437 0.772 -0.5
Starter6 29015 453 4.77 97% 0.337 0.428 0.766 1.9
Starter7 25105 448 4.64 95% 0.329 0.421 0.750 0.6
Starter7 26118 408 4.71 96% 0.338 0.418 0.756 1.1
Starter8 22194 396 4.55 93% 0.326 0.413 0.739 0.2
Starter8 22607 353 4.62 94% 0.330 0.425 0.755 -0.6
Starter9 19489 348 4.52 93% 0.326 0.413 0.738 -0.2
Bench1 19350 302 4.55 93% 0.327 0.424 0.751 0.5
Bench1 16329 292 4.18 86% 0.318 0.394 0.711 -0.7
Bench2 16727 261 4.58 93% 0.331 0.420 0.751 0.8
Bench2 13280 237 4.08 84% 0.313 0.393 0.706 -2.6
Bench3 14357 224 4.23 86% 0.322 0.403 0.725 0.8
Bench3 11067 198 3.84 79% 0.306 0.382 0.688 0.5
Bench4 12435 194 4.18 85% 0.317 0.403 0.720 -0.1
Fringe1 9555 171 3.85 79% 0.304 0.384 0.688 3.1
Fringe1 10415 163 4.03 82% 0.318 0.387 0.705 2.4
Fringe2 7771 139 3.61 74% 0.297 0.373 0.669 -1.8
Fringe2 8648 135 3.75 76% 0.309 0.373 0.681 1.0
Fringe3 5907 105 3.31 68% 0.286 0.357 0.643 2.6
Fringe3 6959 109 3.92 80% 0.317 0.377 0.694 1.5
Scrub1 4487 80 2.99 61% 0.281 0.334 0.615 4.1
Scrub1 5292 83 3.13 64% 0.290 0.343 0.633 2.6
Scrub2 3398 61 3.33 68% 0.295 0.350 0.645 -1.5
Scrub2 4004 63 3.23 66% 0.295 0.343 0.638 2.6
Scrub3 2649 47 2.82 58% 0.274 0.329 0.604 1.4
Scrub3 2920 46 2.62 53% 0.269 0.322 0.591 3.4
Scrub4 1841 33 2.38 49% 0.268 0.285 0.554 5.5
Scrub4 2100 33 2.32 47% 0.266 0.290 0.555 -22.5
Scrub5 1237 22 1.95 40% 0.254 0.265 0.519 -8.9
Scrub5 1364 21 2.31 47% 0.263 0.293 0.556 -3.0
Other 1309 23 2.04 42% 0.248 0.281 0.529 -2.9
Other 1097 17 2.31 47% 0.253 0.305 0.558 1.9

As you can see, in both leagues, the overall amount of production, as measured by percent of league average, gradually decreases as one moves down the playing time slices. This means that teams generally gave the most plate appearances to their best hitters! Nice to see. There are certainly good performances (injuries, rookie half-seasons, etc) in some of the lower slices, and bad performances in some of the upper slices (underperforming players, hidden injuries, Adam Everett-esque defensive specialists, etc). But the overall trend is clear, and consistent with what one would expect if performance is associated with playing time. In fact, given everything that I was worried about that could go wrong with this study, I think these data look remarkably clean.

Let's now take a graphical look at offensive production and try to see what these data might say about replacement-level performance. Below I've plotted player slices against offensive performance, relative to average within each league. Vertical bands identify the different groups of slices as defined above (starter, bench, fringe, and scrub players), while the gray horizontal band denotes the range of replacement values that I've seen suggested by other researchers (73% to 80%). Colored horizontal lines indicate weighted league averages within each of the four major slice categories.
This figure does a great job illustrating how smoothly production seems to drop among player slices, which indicates to me that there is useful signal here despite potential confounds. Furthermore, I think it's remarkable how consistent the two leagues were. The primary divergences between the two leagues were among starters 4-6, two of the bench players (slices 10 and 12), as well as in the 15th player slice (fringe #3). I'm not certain what is causing the divergences at those spots, although the first separation could be related to an unusual spike in catcher numbers among the NL slices 5-7 that doesn't occur in the AL slices (though it doesn't match up to that spike perfectly, either). Alternatively, because these production estimates look only at offense, perhaps this reflects some divergence in emphasis between offense and fielding between the two leagues? More on that in a bit.

Ok, based strictly on these empirical data, here's how I would define replacement players for each of the perspectives I mentioned above:
  • Replacement = Fringe: production level typically in the range of 68-82% of league average, with a weighted mean checking in at 74% of league average offense in the AL, but 81% of league average offense in the NL. Splitting the difference puts us at ~78%.
  • Replacement = Scrub: production level typically in the range of 40-68% of league average, with a weighted mean checking in at 57% of league average offense in the AL, and 59% in the AL, or 58% overall.
  • Replacement = Not Starter or Bench: the weighted mean of all players that are defined here as either fringe or scrub players was 68 to 71% (AL vs. NL) of league averages. Over the past four years, 80593, or 11% of all plate appearances across the two leagues came from players that fell into this category.
Clearly, traditional conventions related to replacement level (73 to 80% of league average), as denoted by the gray horizontal band through the figure, best match the "Fringe" player slices. Yet some of you may have picked one of the other expectations I outlined above regarding replacement players. After all, isn't replacement level supposed to represent a minimum level of MLB player production? Why is it that we're seeing so many players performing below replacement level?

Here's one explanation: replacement level tries to describe the level of production below which players will tend to be replaced by other players. After all, if someone is producing below replacement level, you should be able to get a different player at no additional cost that will perform better. Therefore, it does make sense that the group of players who did not have playing time sufficient to rank within the top 14 or so player slices will be primarily made up of guys who didn't perform up to this minimum standard. In those cases, teams apparently decided to give more playing time and roster space to players who could actually at least perform at replacement level.

Replacement-Level Fielding

To this point, I've been looking exclusively at offensive production of position players. However, players are not given playing time simply because of their offense--defense is important as well! So how to do these same slices of players fare with respect to their fielding performance?

To look at that, I've calculated fielding stats based on conversions of the THT zone rating data using a method described here. The average, in this case, is the mean across both leagues within a particular season. Also, I only pulled these data for the players' primary positions, which means that the league totals will not sum to zero. Also, because I'm comparing across all positions here, I added a rough positional adjustment based on the difficulty of fielding different positions, as estimated by Tom Tango, which is part (though not all) of the reason that AL slices have lower-rated defense below: there are DH's in that league, and they get a hefty negative positional adjustment to put them on an even playing field with other players. Because playing time was so different across slices, the runs saved estimates are standardized to represent per-season-per-player rates.

You can see the fielding data in the table above, but here are they are in graphical form. As before, slices on on the x-axis, and colored horizontal lines indicate league averages within each major slice category:
Several things to note. First, unless there are massive park factor differences across the leagues, the NL and AL seem to value defense a bit differently. In the National League, the top slices of players in terms of plate appearances tend to be not only outstanding hitters relative to their league, but they also tend to be outstanding fielders. Below the top two slices, defense is apparently given less of a premium. Among American League clubs, however, the ~5 players per team receiving the most plate appearances tend to be below-average defenders. It's not until the bottom four starters that you tend to reach average fielding performance.

Second, with respect to replacement level, it looks like we can assume that bench and fringe players tend to be fairly average defenders. There might be a slight tendency for fringe players to be slightly above average, though it's not by more than a run or two. Furthermore, as you can see, the fielding values get rather volatile on the right side of the figure (almost certainly due to sample size--fielding stats are more volatile than hitting stats), so I think it's safest to assume that replacement level fielding is essentially the same as MLB-average fielding. This finding is consistent with similar work by Tom Tango. One thing that is very clear is that replacement players are not massively below average fielders. This puts into question the relevance of systems like BPro's Fielding Runs Above Replacement (FRAR), which sets replacement-level fielding to an approximate league minimum.

Closing Thoughts

I won't pretend that this is The Definitive Study on replacement level. There are certainly a variety of concerns that one can raise with respect to sampling, player identification, etc. Nevertheless, I think it is a good study that gives us an empirical grounding on how players perform relative to the amount of playing time they are granted by their teams and their health. From these data, we can draw some conclusions about how to best estimate replacement-level performance.

On offense, the slices that seem to best fit the description of replacement players--guys who are on the bubble of making a big league team--tended to hit at levels ranging from 68% to 82% of league average, with a rough mean of ~78%. Both of the popular standards I've seen, 73% (endorsed by Tango and Patriot) and 80% (endorsed by Woolner), fall nicely within this range. And, of course, both have their own theoretical and/or empirical justifications. But if you ask me for my recommendation after doing this study, I would recommend the lower figure of 73%. It is still a level of production above that which the "scrub" players, as I defined them here, produce. And if the primary function of a replacement level paradigm is to recognize a production threshold at which any freely available talent is likely to perform, it makes sense to me to be somewhat conservative in that threshold--that way you don't ignore production by, for example, a bench player hitting at 82% of league average, who might turn out to be difficult to replace.

With defense, it seems clear based on this and other studies that replacement players play roughly league average defense. Therefore, if you are interested in describing a player's total value above replacement level, I recommend that you follow this procedure: calculate their runs on offense relative to a hitter at 73% of position player league average runs per game, and then add to that value their fielding vs. average, as well as a positional adjustment to account for the difficulty of playing that fielder's position. More on that in coming days in my player value series.

References, Resources, and Acknowledgments

All player statistics were pulled from the stats pages at The Hardball Times.

Custom linear weights were calculated for each league via base runs using a spreadsheet created by U.S. Patriot, using initial "B" coefficients pulled from this article by Tom Tango, and using league totals pulled from Doug's Stats.

This work was stimulated, in part, by a great e-mail conversation I've been having with skyking162 the past several weeks about how to value players. Sky also provided some helpful comments on a draft of this article.

Patriot has written an extremely helpful essay on baselines at his site, along with many other excellent articles on related topics.

7 comments:

  1. Nice work. I wonder if a study was done using the same methodology for the time period of say 1984-1987 if the percentage of playing time for 'scrubs' would be the same then as it is now? Since one of the goals of statistical analysis is to be able to better identify players that can play, it would seem logical to assume that the percentage would be lower. For example, if a team has a whole at second base, shouldn't statistical analysis help identify and fix that problem more readily than simply plugging in guys until you stumble upon someone who can get the job done? If the percentage of playing time given to scrubs is the same now as it was then, does it mean that statistical analysis has failed in that area? Or maybe it just means the tools provided by statistical analysis aren't being used properly.

    ReplyDelete
  2. Nice point. There actually is (at least) one directly similar study that I'm aware of, by U.S. Patriot, which was done on '90-'93 data. He did things slightly differently, but found that the mean performance of players from the 14th slice onward was 73.8% of league average in the AL. In my study, individuals in the 13th slice onward (not starters or bench) in the AL performed at 68% of league average. If anything, you'd expect my number to be slightly higher, because it includes one fewer player per team as Patriot's.

    So there's definitely some evidence for what you're suggesting. Pretty neat, actually--I was a bit confused as to why my estimate was higher than his. Your explanation seems pretty reasonable!
    -j

    ReplyDelete
  3. Justin, what are your thoughts regarding the lack of position context in studies such as this. A guy who hits .270/.330/.450 would be a horrible LF or 1B, but is perfectly acceptable at SS or at C.

    If the Angels had a corner OF hitting .300/.400/.500 in Salt Lake, he's not really an option to unseat Cabrera at SS.

    While I'm not sure how to approach it methodologically outside of a separate study by position or position group, I'm not sure how much insight we can truly get from an aggregate analysis.

    ReplyDelete
  4. First, position frequency didn't vary much across slices, especially in the American League (in the NL, there was a spike in the number of catchers relative to other players in slices 5-8...the same was not true in the AL). So I don't think there's any particular bias here.

    Second, I think that positional adjustments are a red herring when you're looking at offensive replacement level.

    The purpose of a positional adjustment is to account for differences in defensive difficulties at a position. Players aren't born as "shortstops" or "second basemen." As Keppinger showed this season, you can stick a player at an unfamiliar position if you want. And if defense didn't matter, you would always make sure that you placed your best offensive players in the lineup no matter where you had to play them.

    The reason you don't is that a corner outfielder, for example, would cost the team more in bad defense at shortstop than he provide the team in good offense. Therefore, any positional adjustments should really be applied to fielding performance, rather than offensive performance. This, in fact, is what I did in this study when you look at those fielding totals.

    Basically, here's how I am inclined to assess player value:
    [totalvalue] = [offense-vs-replacement] + [fielding-vs-average] + [positional fielding adjustment]

    I'll also say that I think the idea of comparing offense at a particular position, like VORP does, is flawed because some positions are simply more talented than others. For example, several lines of evidence indicate that center field is the most difficult defense position to play on the ball field. And yet, center fielder tend to be ~league average hitters. This means that center fielders tend, on the whole, to be above average players. Therefore, if you base their positional adjustment on their offense, you're undervaluing them...especially relative to a weak-talent position like second base, who play an easier position while also not hitting as well!

    I'll have more on this stuff in my next two player value pieces, starting tonight with baselines.
    -j

    ReplyDelete
  5. Sorry for the delay on the baselines article. It was set to go before I noticed something about using runs per game vs. runs per ab to calculate runs above average or runs above replacement. That's ended up pushing me back about two days, but the end result should be better and more interesting in the end. -j

    ReplyDelete
  6. I wonder how much your results would vary if you measured quality by R/PA rather than R/G.

    I think that's where the variance between the leagues for the lower-ranked starters arises; lower-ranked starters in the NL are more likely to be pulled in a double-switch than their counterparts in the AL, and therefore accumulate fewer runs per game, even if they get the same rate of runs per plate appearance.

    At a glance, the PA are much lower for those slices in the NL than in the AL; are the G played (which you don't list) also lower?

    Secondly, the feature of AL starters being worse fielders: is it just that the top starters are more likely to be DHs than the lower ones, after all, DHs don't get injured as much as position players. You comment that the AL will look worse overall because of the DH, but surely there are more appearances by DHs in some slices than others?

    I think you could address the second problem by ignoring fielding of players who are primarily DHs in the fielding measurement. Take Giambi as an example: he plays some 1B, but mostly DH and he wouldn't play regularly if it weren't for the DH; if he was in the NL, he wouldn't play anything like as much.

    Incidentally, one thing I've noticed is that the NL puts a big premium on big-hitting 1Bs who are competent fielders (like Pujols), and so 1Bs gravitate to the NL if they can field well, and to the AL if they can't (because they can DH).

    ReplyDelete
  7. Hi Richard,

    The problem with using R/G instead of R/PA is that I'm using absolute runs numbers... And those numbers don't take into account the full negative value of an out (the "inning killer" component that Tom Tango mentions in his pieces on run creation). Therefore, R/PA will tend to artificially up-weight individuals with low OBP's, and down-weight individuals with high OBP's. Here's a forum link where I wrestled through those issues..

    With respect to the DH's...First, remember, I only considered a fielder's primary position. Giambi is a primary DH, so his fielding at 1B was not included (he was rated +0 on fielding vs. his position). I did, however, include a negative position fielding modifier for DH's, which did pull down the AL fielding numbers...Nevertheless, when I looked at these data before adding that modifier, and it still showed AL starters as being sub-par fielders. I can probably resurrect those unmodified data if you're interested.
    -j

    ReplyDelete