Table of Contents

Monday, June 09, 2008

Why do I keep using OPS?

In a comment to my latest "weekly" stat review for the Reds, Bluzer politely critiqued my use of OPS (and, by extension, PrOPS). OPS is certainly not a great statistic. For details on some of the arguments made against it, check out Patriot's fine assault on it here.

But, in general, OPS seems to work well, and it's widely used... and it's something I've been looking at for a long time to judge players, so I'm comfortable with it. And that's pretty much the whole reason that I've continued to use it around here.

But how much am I missing by relying on it so much? Would I be better off scrapping it altogether and sticking to linear weight-based statistics like wOBA or R/G? I'm sure the answer is "yes," but how much of a difference does it make? I generally do report R/G with my player reports, but even so, I'll admit that my eye goes first to OPS when I judge a hitter. Old habits die hard.

I decided to run a quick study to see what difference it makes. I pulled MLB team offense totals, 2005-2007. No particular reason to use three years, I just figured that I wanted to include more than one year in case there was something peculiar about a given year. And you can quibble with whether team totals are a good way to evaluate stats for individual players--after all, team stats are less variable than hitter stats (update: in this sample, team OPS ranged from 0.708 for the '05 Nationals to 0.829 for the '07 Yankees). But team stats provide a clean measure of actual runs scored, which allows us to ask how well a given rate stat predicts actual runs scored. And scoring runs is pretty much the definition of good offense, right?

Anyway, below are Pearson's correlations between runs scored and a variety of rate stats that I've seen proposed and used here and there around the internets. Numbers closer to 1.0 indicate a closer relationship between runs scored and the statistic in question:

AVG: 0.7244
OBP: 0.8340
SLG: 0.8421
TA-orig (~TB/PA): 0.8739
BOP (~TB/outs): 0.8955
FOES: 0.8990
OPS: 0.9231
ABSO: 0.9250
EqA: 0.9268
OBP*SLG: 0.9310
2OPS (2*obp+slg): 0.9323
GPA ([1.8*obp+slg]/4): 0.9326
RAA/PA: 0.9344
wOBA: 0.9350
R/G: 0.9355

Note: RAA/PA and R/G use my custom linear weights for 2003-2007 MLB. They have about twice as many parameters as wOBA (including stolen bases, gdp's, etc), and they're more specifically tuned to this particular era. Also note that because EqA is so damn hard to calculate, I used Patriot's conversions to get from EqRAW to EqR. So, EqA's spot in this list might be improved by a more careful calculation. Sorry 'bout that...but I just didn't have the patience to calculate it properly, as I'm not a particularly big fan of it because of its unnecessary complexity.

Here are those data graphically, which I think helps with interpretation:

That's pretty much the order I expected to see (though I had no idea where Bluzer's stats would fall out). Big improvement by going from AVG to either OBP or SLG. Pretty big jumps from SLG on up to OPS. But not much of an improvement after you hit OPS. The most accurate stats proved to be those based on linear weights: RAA/PA, wOBA, and R/G. That's gratifying, as I've been treating them as the gold standard statistics on this site for rating hitters... but they're incremental improvements, at best, over plain old OPS in terms of predicting runs scored.

I did find it interesting that Base-Out Percentage (BOP) and Total Average (TA), two representatives of the "bases divided by outs or PA" group of statistics that recently were discussed here, did so poorly. I didn't expect them to do as well as the linear weights statistics, but I didn't really expect OPS to do so much better than they did either.

In his comment, Bluzer recommended his latest statistic, ABSO. It is true that it did a tad better than OPS in this analysis at predicting runs scored. But at the same time, equally easy-to-calculate statistics like OBP*SLG or 2OPS did even better. GPA did the best of all statistics based on OBP and SLG, which is consistent with its reputation. But again, everything from OPS to R/G gives you very similar rankings. So, the lesson is that it really doesn't matter all that much which one you choose!

That's not a novel insight by any fact, after I wrote this, I discovered this two-year old study by Dan Fox that reports much the same thing, and takes it a step further by showing how OPS can break down into something very much like linear weights. But I like to do things myself, and it's nice to see the same trend born out in yet another study.

So, in the end, I'm going to keep using OPS as one of the ways that I judge hitters on this site. I will keep on reporting R/G as well, as it has several other advantages in addition to being more accurate that come into play now and then (especially in terms of easy park factor corrections). But for most purposes, OPS is going to do just fine. Yay.

Update: In case anyone wants to play, I've uploaded a copy of the spreadsheet I used to calculate these results to this location. It also includes data based on 5-year team totals, which I calculated after the 3-year totals as a check. They conform very closely to the results I posted here, though wOBA beats out my R/G stat (barely), and EqA looks worse. Thanks to Baseball Reference for the data.

Update2: Victor Wang pointed out another relevant study that I'd missed (and I'm sure there are others as well), which he published in a recent SABR newsletter. He looks at the coefficient used to weight OBP vs. SLG in OPS calculations, and finds that the best coefficient varies considerably from era to era. In my dataset, 2005-2007, the 1.8 weighting (like what is used in GPA) works best. But I also have a 2003-2007 dataset (see the spreadsheet I linked above), and correlations from it it match Wang's finding that 1.6 works better in those years than 1.8. But, as is the theme here, it really doesn't make all that big of a difference...


  1. And yet, no love for GPA? :)

  2. Justin, thanks for doing this. It confirms my claim that absolute average is ABSOlutely better than OPS, although (in this sample, at least) not by as much as I would have expected. I took your advice and downloaded the Lahman database, and am currently working on my own historical study (1901 through 2007) to compare the two. I'll be looking at individual player stats rather than team stats, which may make a difference. I also think that the linear weights-based stats may have benefitted from the fact that the time period you used was roughly the same period over which the linear weights were derived.

  3. @Dave, basically, if I'm going to use something other than OPS, I might as well be using a linear-weights based statistic. :) I'm just so used to looking at OPS's scale that it's hard to use anything else that's not a dramatic improvement. Even linear weights aren't dramatic improvements... And besides, I'm so divorced from batting average at this point that I can't get used to evaluating players on that scale again!

    @bluzer, the linear weights one uses should vary depending on the run environment. So yes, I'm sure that their accuracy would not be as good if used in a different era. But that's not an appropriate use of the statistics, so it's hardly a knock against them.

    It's worth noting that OPS and your ABSO stat are also essentially a form of linear weights. You're combining values by summing them, not multiplying them in a way that allows for interactions (see Dan Fox's article). Therefore, one can't predict that OPS or ABSO will hold up any better than linear weights in different run environments. It's true that you can calculate ABSO or OPS (or lwts) for any player in history. And in general, the rankings within a year (and especially within a team) will be pretty good using OPS or ABSO or LWTS. But one shouldn't compare the raw values across eras, because the run environment is different in each era.

    For example, Jimmy Wynn's 0.269/0.406/0.537 performance in 1969 looks impressive even today. But how much more awesome does it look when you realize that that a) he played in the astrodome, and b) NL hitters averaged 0.250/0.319/0.369 and scored 4.05 runs per game in 1969. Not 0.266/0.334/0.423 and scoring 4.71 r/g like they did in 2007. If you don't adjust for that difference, you miss out on a lot of his value. FWIW, B-Ref's "neutralize stats" button, set to 2007 NL in a neutral park, puts Wynn's 1969 performance at 0.309/0.485/0.582. ... which, if even close to being accurate, is pretty awesome, no?

    I really think that it would be worthwhile for you to consider using linear weights to evaluate players. Honestly, I just don't see any reason not to do so. They are well-established in serious research circles, they're extremely accurate, and they're easy to use. And, with them, you can calculate custom linear weights for each runs environment (be it by league or by team, depending on your needs), adjust easily for park factors with decent accuracy (Patriot has published regressed park factors going back to ~1900 IIRC), and adjust to a relevant baseline for any given era (e.g. vs. average, or vs. replacement). There are established methods for doing all of these things (though the replacement level stuff is always contentious). And if you get really ambitious, you can try to adjust for quality of competition as well (which has increased quite a bit over time thanks to the huge influx of talent post-segregation and now with latino/asian players). So, again, I just can't see any reason not to use least not if you're interested in assessing actual value, rather than finding some stats that better conforms to perceived value.

  4. That last bit sounds too harsh to me as I re-read it. I think my main point is just that if one is going to do a historical study comparing players across eras, one should really make at least some effort to adjust for the different environments across years, leagues, and teams. Simply comparing OPS or ABSO (or, for that matter, raw R/G or wOBA) won't let you do that. But you can use a combination of custom linear weights, park factors, and performance vs. baselines to do this.

    So while you don't *have* to use linear weights by any means, there are a lot of advantages to doing so that extend beyond just the accuracy vs. runs scored argument (which is also a reason to use linear weights and not something else). So why not use them?

  5. the (pseudo) linear weights formula i debeloped based on BA/OBP/SLG had a R=.9329 for AL 2007.

    i agree that linear weights is the way to go, especially when comparing across eras because just looking at the neutralized B-R stats doesn't take into account the more subtle differences in event values

  6. Hi there,

    Hadn't seen that, but thanks for the heads up.

    I will say that I don't think you can compare that correlation to those I reported here, as they're based on different samples and sample sizes. Correlations are notoriously "relativistic" stats, so you probably want to use 2005-2007 MLB team totals to get an apples to apples comparison.

    I just tried to plug in your equation to my spreadsheet to get that correlation, but I keep getting a negative number, so I must be doing something wrong. Here's my equation:
    =((-0.3149*(1-W4)) + (((((0.022*((X4/V4)^2)-(0.152*(X4/V4))+(0.607)))))*(X4))+(0.32*((W4-V4)/(1-V4))))

    W = OBP, X = SLG, V = BA. I need to get going for a bit here, so I'll check this again later. But if you see my error, please let me know.


  7. Justin, I think I may have misled you with my "1901 through 2007" comment. I'm actually not comparing stats across eras at all. (I know that I made another comment elsewhere about eventually wanting to be able to do this, but I'm not attempting it here.) I am only comparing stats within the same season and the same league. I think you'll like what I'm up to, actually. I guess we'll find out when I get it done. I'm up through about 1950 now.

    Yes, I agree that absolute average approximates linear weight values in the way that I pointed out in my comment on the earlier thread: "Since OBP is undervalued in OPS, and walks are overvalued in OBP, it makes sense to correct the discrepancy by increasing the weight of the most valuable component of OBP, i.e. hits."

  8. Justin - Don't worry about using OPS. People have been questioning it since the beginning of time, and the answers to the questions remain the same. Yes, OPS is not properly weighted. Yes, it doesn't use any units. Neither of those things matter if you just use it as a quick 'n' dirty estimator. Obviously, no one should use OPS in a serious study. But there's no need to fix OPS when it's used as a rough estimate. And if something more exact or complicated than OPS is needed, you might as well use something a lot more exact and complicated.

  9. Justin,

    I understand the differences in samples. I will test it against other years, I just have to alter the coefficients a bit for the different run environments.

    The equation you were using looks right, but in order to compare it against Runs per game, you have to multiply it by (27*(1+OBP)) to get runs per 27 outs and add 4.90.

  10. "(27*(1+OBP))"

    You want 27/(1-OBP)

    So, if you have a .500 OBP, you need 27 outs and 27 times on base.

    Greg Andrew is right that if all you want is Q&D, then OPS is fine. The problem is when people use it beyond the Q&D, and then we have to put up with crappy analysis as a result.

    So, since people can't make the distinction, just take the weapon out of their hands and stop using OPS.

  11. @HFB:

    Ok, I get these correlations:

    RPA (per AB): 0.9194
    RPA into runs: 0.9198
    RPA into runs (tango's conversion, which I think isn't quite right in this case): 0.9201

    Better-tuned coefficients might work better, of course, as this was set only for 2007 AL (if I understand correctly).


    Aside from my own personal OPS habit, a problem I run into is that a lot of the folks I target with my blog are only comfortable with OPS. If I start restricting myself to wOBA or R/G with everything I do, I'll have a harder time communicating with them.

    Still, any time I do any kind of real analytical work, which isn't often these days, I do focus on R/G or wOBA... -j

  12. As long as it's purely Q&D, then ok. Any other reason, then it's not ok.

    Even here, you are testing at the team level, even though you have a large group of players that perform at outside the range of even the most extreme of teams. So, all you've really done here is test for players in the .320 to .360 OBP level (or whatever it is). But, the guys that interest us are outside this range.

    And, there are teams that are outside this range. Create teams at the game level, by selecting those games where the OBP was at least .350 and SLG at least .450.

    In fact, I already did all that in my Runs Created series, and those measures that aren't well constructed break down at those levels. And this includes Linear Weights and Runs Created.

  13. The derivation and logic behind the 1.7 or 1.8*OBP + SLG is detailed on my blog