Table of Contents

Friday, March 30, 2007

How should we calculate Zone Rating? (Part II)

In a previous post, I looked at a variety of ways to calculate zone rating--a fielding statistic--using the new dataset that is freely available at The Hardball Times. Tonight, I'd like to report some of the continued work I've done on this subject.

Introduction & Goals (you can skim this)

The first fielding statistic that I really felt at home with was devised by Baseball Info Solutions and published in The Fielding Bible last year. It had just about everything I'd want in a stat -- the methodologies behind it are straightforward (even if I can't replicate them myself), and it's easy to interpret, thanks to the fact that it's reported in terms of +/- plays vs. average (i.e. a +2 rating means that a player made two more plays than an average fielder would, given his chances).

Unfortunately, BIS decided not to release these statistics to the public this year. I inquired about buying them electronically, but they wanted at least $100 for my personal use--and probably would not have been pleased if I'd posted them here. Pretty steep price given that these stats were just a part of a $20 book last year. :)

Thankfully, the folks at The Hardball Times purchased detailed Zone Rating statistics from BIS. While they're not the same as the stats that were the highlight of the fielding bible, they're still very good: essentially they assign each fielder a zone (or rather, a set of zones) on the field and assess how many balls hit into that zone the fielder converts into outs. This has advantages over more traditional fielding stats like fielding percentage because it incorporates fielder range into the estimate of fielder quality in addition to his sure-handedness and ability to throw accurately. And it's better than range factor because it accounts for the number of balls a player had the opportunity to field, rather than just assuming that all players get the same number of chances at a given position.

The THT stats, along with David Pinto's PMR stats, give us, the public, two high-quality fielding statistics that are available for free! The only problem is that it's not entirely straightforward how to use them. As I wrote about in my last piece, the ZR ratio calculated on THT.com doesn't include balls hit out of the player's zone, which seems like something we'd like to account for to understand good fielding (see Bill Hall at shortstop last year, for example). But how should we account for these out of zone balls? Furthermore, it has a similar problem to range factor in that each position has a different standard for what a "good" ZR is.

I'm excited to have a chance to use ZR stats this year as I evaluate players. But in order to use it effectively, I wanted to:
  1. incorporate, in an appropriate way, out of zone plays as well as plays made within a players' zone in the Zone Rating Estimate.
  2. report the statistics in an easy to use +/- format.
Methods

Converting ZR to a +/- system.

This is a pretty easy thing to do because THT reports not only the classic ZR ratio (plays made / ball in zone) statistic, but also all the stats that are used to calculate it: balls hit into a player's zone (BIZ), plays made on balls in the player's zone (PLAYS), and plays made on balls out of zone (OOZ). This gives us the tools we need to convert ZR into a +/- statistic.

First, we'll calculate the average proportion of PLAYS per BIZ at each position by simply summing up the total number of PLAYS and total number of BIZ across all players in major league baseball at each position (easily attainable from THT's stats output) and taking the ratio. Here are the expected (average) ratios for each position based on 2006 data (I haven't confirmed that these numbers are stable across years...I'd guess that they don't vary too much though, except when you have a high-leverage player like Albert Pujols or Adam Everett throwing off the mean...which we do):

PositionExpected PLAYS/BIZ
1B
0.799
2B
0.820
3B
0.706
SS
0.818
LF
0.608
CF
0.811
RF
0.638

Once we've done this, we can easily convert a players' PLAYS data into a +/- Plays statistic using this equation:

+/- Plays = PLAYS(actual) - [BIZ(actual) * ExpPRATIO]

where PLAYS(actual) is the actual number of plays a player made, BIZ(actual) is the actual number of balls hit into a player's zone, and ExpPRATIO is the expected ratio of PLAYS per BIZ calculated reported in the table above. Really easy stuff, and something anyone can do really quickly once you have the expected ratios.

Update: Tangotiger extended this work by reporting PLAYS/BIZ values from 2004-2006 on his blog. If you're going to use this approach on years other that 2006, you should probably use those data rather than what is in the table above.

We can do the same sort of thing with OOZ to estimate the average number of OOZ plays made per BIZ (this makes an assumption that the number of balls in hit into the player's zone will be tightly correlated to the number of balls hit just outside his zone). Here are those ratios:

PositionExpected OOZ/BIZ
1B
0.415
2B
0.096
3B
0.150
SS
0.126
LF
0.030
CF
0.162
RF
0.031
(you almost have to wonder if they should add another zone of responsibility to 1B's)

Update: Tangotiger's post on his blog also provides the '04-'06 data, which would be better to use if you're doing these stats on years other than 2006. Note that he reports the expected OOZ (see below) rate as OOZ/(PLAYS + OOZ), but he reports the total BIZ and OOZ data over that time period that you'd need to calculate expected OOZ in a fashion like I did here.

We can then get a +/- figure for a players' OOZ plays as:

OOZ(actual) - [BIZ(actual) * ExpORATIO]

where OOZ(actual) is the actual number of out of zone plays a player made, BIZ(actual) is as above, and ExpORATIO is the expected ratio of OOZ per BIZ reported above.

How should we combine +/-PLAYS and +/-OOZ?

I took three approaches to go about finalizing this +/- statistic:
  • Ignore OOZ plays entirely and use only +/-PLAYS.
  • Simply add +/-PLAYS and +/-OOZ together.
  • Estimate a coefficient for OOZ via a regression on Pinto's PMR statistic (this was done at the suggestion of Dave in my prior post--thanks Dave!).
The first two are really straightforward. The last one is slightly more complicated. Basically, here's what I did:
  1. Converted PMR to a +/- outs statistic by subtracting actual outs from expected outs for each player (all of the needed data to do this are provided by Pinto in his reporting).
  2. Consolidated players at each position to match those reported by Pinto in his 2006 reports over the winter (his cutoff was that the player was present in the field for 1000 balls in play). I also had to calculate season totals for a number of players who switched teams during the year, since THT reports them on separate rows while PMR reports them as one row.
  3. Ran a multiple regression, setting +/-PMR as the dependent variable and setting +/-PLAYS and +/-OOZ as predictors.
  4. So long as both the effects of PLAYS and OOZ were significant, I recorded the regression coefficients from this regression. Then, to make calculations simple and straightforward (i.e. usable by others), I then divided both coefficients by the coefficient for PLAYS, such that PLAYS would have a new coefficient of 1.0, while OOZ would vary from 1.0 depending on its weight in the regression equation.
  5. Used this OOZ coefficient calculate a combined +/- statistic (I'll call it +/-ZR_ADJ for now) that incorporates both PLAYS and OOZ:
    • ADJ_ZR = PLAYS(+/-) + WEIGHT * OOZ(+/-)
Finally, once I had all three +/- estimates based on the THT data, I compared these estimates to Pinto's +/-PMR data to see which estimates followed his work the closest. The assumption here is that, since PMR and ZR are supposed to measure the same thing, the +/-ZR estimate that is the closest fit to PMR is the best one to use.

Results

The results vary by position. I'll divide them into three groups: middle infielders, corner infielders, and outfielders.

In the graphs that follow for each group (sorry about the small size), I report the closeness of fit between Pinto's +/- PMR data and the three +/- ZR-based fielding stats as "R2" (should be R-squared), which is defined as the proportion of variation in PMR data explained by the ZR data (ranges from 0 to 1, with 1.0 being a perfect match). I list the three ZR stats as ZR_THT (+/-PLAYS only, no OOZ), ZR_DIAL (simple summing of +/-PLAYS and +/-OOZ, named after Chris Dial as in my previous post), and ZR_ADJ (weight coefficient modifying +/- OOZ, added to +/-PLAYS).

Middle Infielders

As you can see, while there was a sizable improvement (double the R2) in the fit between ZR and PMR when you include the OOZ data. There was very little difference between simply summing PLAYS and OOZ (ZR_DIAL) and using the regression-based weighting coefficient to modify OOZ before adding this to PLAYS (ZR_Adjusted). Among middle infielders, these coefficients are fairly close to one (1.10 for 2B's and 1.28 for shortstops), so it's not surprising to see such a close correspondence. There's very little here to argue against just summing PLAYS and OOZ.

Corner Infielders

Oh, weirdness. First basemen are actually fairly similar to middle infielders, in that adding OOZ data makes a huge improvement in the R2. This is probably tied into the fact that first basemen typically make a huge number of plays outside of their zone, as currently defined (see OOZ/BIZ table above). Not much improvement when I used the weighting coefficient to modify OOZ in ZR_Adjusted.

Third basemen, however, are very different. First, there's a remarkably good correspondence between the ZR data and the PMR data. No idea why it'd be so good for third basemen over other positions. Second, you get a substantial improvement in fit when you add weighted OOZ to the PLAYS data (ZR_Adjusted), but not when you just add unweighted OOZ to PLAYS. The reason is that the coefficient modifying OOZ at 3B is the farthest from one among any of the infield positions (= 0.42). Therefore, it seems that among third basemen, OOZ plays should effectively be divided by two when calculating our ZR estimates.

By the way, that ridiculous point in the upper-right for all the 1B graphs? That's Albert Pujols. He made 39 more outs than expected of an average first baseman last year according to both PMR and ZR. Given his offensive production, that's just absurd. It's unfair to everyone else. He's definitely my pick for MVP last season.

Outfield
Here's where things get hard. As you can see, I didn't report ZR_ADJ for outfielders. The reason was that the regression equations never showed a significant effect of OOZ in any of the outfield positions. This makes anything you do with the OOZ coefficient pretty unreliable, so I didn't use them.

Explanatory power of the ZR stats is good in left field, and improves slightly when OOZ data are included. Even though OOZ didn't contribute a significant addition to the regression model, I think it's worth including them in the ZR estimate--they do improve the fit (slightly), and conceptually, I can't help but think that we should take those sorts of plays into account.

In center, the only significant relationship was between ZR_DIAL and PMR, and even then it only explains 10% of the variation. I'm not sure why center field is so hard to understand, though I imagine part of it must be due to the huge range of locations on the field where a center fielder might position himself throughout a game compared to other positions. It also may have to do with the variety of types of balls that are hit to center field. ZR does a poor job of dealing with different types of fly balls, whereas PMR seems to do a better job of at least accounting for this in its calculations. Looking at the ratings of center fielders, it's hard for me to say which stat is working better...though ZR's negative rating on Beltran does raise my eyebrows:
RankZRPMR
1st
Willy Taveras (+34, rated +10 by PMR)
Beltran (+18 , rated -5 by ZR)
2nd
Curtis Granderson (+34, rated +4 by PMR)
Corey Patterson (+16)
3rd
Corey Patterson (+22)
Joey Gathright (+15, rated +1 by ZR)
4th
Juan Pierre/Reggie Abercrombie (+15, both rated +4 by PMR
Coco Crisp (+13, rated -15 by ZR)
5th
Brian Anderson (+14, rated +4 by PMR)
Aaron Rowand/Johnny Damon (+12, rated -3 and +1 by ZR, respectively)

Right field looks similarly bad at first, but much of the issues there are driven by a single outlier who is rated highly by ZR and poorly by PMR: Brian Giles (rated as +10 by ZR_DIAL, but -21 by PMR). If you remove him from the dataset, the R2 increased from 0.29 for ZR_THT data to 0.34 for ZR_DIAL. OOZ is still not significant in the multiple regression. But overall, as long as you ignore whatever the heck is happening with Giles, right field is very similar to left field--just with a bit weaker fit.

Discussion

Recommendations for ZR calculation

First, I personally find the +/- conversion to be a huge improvement from the traditional way the ZR data are reported, which is the ratio between plays made and balls hit into the zone. A +/- conversion allows anyone to immediately and intuitively assess a fielder's abilities. I highly encourage THT to consider adding two columns to their stats output that automatically report +/- PLAYS and +/- OOZ.

Second, in terms of how to incorporate the OOZ data into the ZR fielding estimate, the data seem to indicate that simply summing the two is perfectly adequate--and advantageous compared to not using OOZ data at all--for all positions except 3B. There, it's best to multiply +/- OOZ by 0.4 to down-weight its effects. I don't have any good ideas why OOZ plays should receive so little weight in this case--maybe it has to do with charging plays on bunts in the infield?

Update: Based on the discussion below, I'm torn about whether to use the coefficient on 3B. While it did result in a better fit with PMR, it could be that PMR behaves strangely with 3B and therefore that we're decreasing the accuracy of the ZR statistic by using that coefficient. Since we don't have an independent reason for adding the 0.4 coefficient (i.e. a "baseball" reason why OOZ plays shouldn't have as much of an impact on ZR at 3B), I'm now inclined to calculate it line all the other positions: [+/- ZR] = [+/- PLAYS] + [+/- OOZ].

Finally, while I've reported all values here in absolute terms, it is possible to convert +/- ZR numbers to a rate-like statistic. My suggestion is to divide the +/- ZR statistic for a player by his BIZ, and then multiply by 400 -- a number that seems to be at the upper end of how many balls a fielder (well, at least non-first basemen) will see in a full season. The actual number you multiply by doesn't matter, of course, as long as you're consistent across all individuals. You could even vary it by position, though I'm not sure if that's really worth doing.

ZR Top-5 2006 Fielders

As a diagnostic, I wanted to close by having a quick look at the top player rankings at each position according to ZR (calculated as recommended above), PMR, and the Fielding Bible (FB; extracted from the Bill James 2007 Handbook).

PositionRankZR+/-PMRFielding Bible
1B
1st
Albert Pujols (+39)
Albert Pujols (+39)
Albert Pujols (+19)

2nd
Doug Mientkiewicz (+24)
Lyle Overbay (+16)
Doug Mientkiewicz (+16)

3rd
Chris B Shelton (+21)
4-TIED at +13
Kevin Youkilis (+10)

4th
Richie Sexson (+19)
Niekro, Morales
3-TIED at +7:

5th
Travis Lee (+15)
Dan & Nick Johnson
Garciaparra/Hatteberg/Lee
2B
1st
Aaron Hill (+21)
Orlando Hudson (+32)
Jose Valentin (+22)

2nd
Jamey Carroll (+16)
Jamey Carroll (+26)
Aaron Hill (+22)

3rd
Jose Valentin (+16)
Chase Utley (+25)
Chase Utley (+19)

4th
Tony Graffanino (+16)
Aaron hill (+24)
Mark Ellis (+13)

5th
T-Ellis/Polanco (+15)
Mark Grudzielanek (+22)
Tony Graffanino (+13)
3B
1st
Scott Rolen (+27)
Joe Crede (+38)
Brandon Inge (+27)

2nd
Brandon Inge (+25)
Pedro Felix (+28)
Pedro Felix (+25)

3rd
Joe Crede (+20)
Brandon Inge (+26)
Adrian Beltre (+23)

4th
Mike Lowell (+17)
Adrian Beltre (+22)
Joe Crede (+22)

5th
Morgan Ensberg (+14)
Freddy Sanchez (+19)
Nick Punto (+15)
SS
1st
Adam Everett (+39)
Adam Everett (+35)
Adam Everett (+43)

2nd
Craig Counsell (+25)
Bill Hall (+28)
Clint Barmes (+27)

3rd
Bill Hall (+23)
Yuniesky Betancourt (+27)
Bill Hall (+18)

4th
Clint Barmes (+23)
Craig Counsell (+19)
Craig Counsell (+17)

5th
Alex Gonzalez (+18)
Clint Barmes (+17)
T-Reyes/Bartlett (+13)
LF
1st
Dave Roberts (+28)
Melky Cabrera (+18)
Dave Roberts (+16)

2nd
Garret Anderson (+18)
Matt Diaz (+13)
Carl Crawford (+15)

3rd
Juan Rivera (+18)
Reed Johnson (+13)
Alfonso Soriano (+15)

4th
Matt Diaz (+16)
Dave Roberts (+12)
Ryan Lanerhans (+15)

5th
Alfonso Soriano (+14)
T-Murton/Fahey (+10)
Jason Bay (+14)
CF
1st
Curtis Granderson (+34)
Carlos Beltran (+18)
Corey Patterson (+34)

2nd
Willy Taveras (+34)
Corey Patterson (+16)
Andruw Jones (+30)

3rd
Corey Patterson (+22)
Joey Gathright (+15)
Juan Pierre (+25)

4th
Juan Pierre (+15)
Coco Crisp (+14)
Curtis Granderson (+18)

5th
Reggie Abercrombie (+15)
T-Damon/Rowand (+12)
Willy Taveras (+17)
RF
1st
Jose Guillen (+23)
Juan Encarnacion (+12)
Randy Winn (+22)

2nd
Randy Winn (+20)
Damon Hollins (+9)
Alexis Rios (+20)

3rd
Reggie Sanders (+17)
Ichiro Suzuki (+9)
J.D. Drew (+19)

4th
Trot Nixon (+17)
Tied-Four @ +8:
Brian Giles (+18)

5th
T-JDDrew/JJones (+16)
Drew/Jones/Freel/Quintin
Ichiro Suzuki (+17)
Interesting to see Brian Giles on the fielding bible list for top right fielders given how much he messed up the ZR vs. PMR calculations. :)

At first blush, neither PMR or ZR seem to follow the Fielding Bible's +/- ratings more closely than the other. To check this, I calculated correlations between PMR, ZR, and the Fielding Bible based on individuals in the Bill James Handbook's 10-top lists (actually, they are partial correlations to factor out the influence of position). Here's the correlation matrix:

FB ZR PMR
FB 1.00 0.42 0.49
ZR 0.42 1.00 0.40
PMR 0.49 0.40 1.00
It turns out that all three variables have almost equal correlations to one another, all ranging between 0.40 and 0.49. There might be a slightly higher correlation between PMR and FB, but not enough for me to worry about.

What this means to me is that we should incorporate both PMR and ZR into our evaluations of player performance. In fact, if you run a general linear model regressing PMR and ZR onto the FB values, both PMR and ZR are both highly significant, with almost identical sums of squares values. This indicates that both contribute useful, independent information than better helps us predict FB (often regarded as the best available fielding stat), and suggests that they should be weighted equally when interpreting player fielding performance.

This has been a monster of a post. :) But with this information in hand, we can make the most of the fielding stats that are available to us! I'm planning to put them towards a review of the 2006 Reds fielding--I hope I can get it done before the start of the regular season! :D

Update: One can convert the +-Plays values into an estimated +-Runs statistic using the runs per play values in this article by Chris Dial. They're probably not perfect conversions, as they're based on a different set to data (Stats Inc.'s zone rating rather than BIS's zone rating), but I bet their close enough.