Table of Contents

Monday, June 30, 2014

Zimmerman's Tommy John Primer

Ulnar Collateral Ligaments
are fragile when you throw
really hard.
The Reds have fortunately managed to largely avoid entanglement in ulnar collateral ligament repair surgery this year, but it is a major topic of interest in baseball because of the recent spike in the surgery that we are seeing.  Jeff Zimmerman penned the first part of an excellent primer on Tommy John Surgery at Hardball Times.  Of note: 20% of pitchers who have the surgery just don't make it back, velocity does NOT increase in the years following the surgery, and pitchers typically struggle their first year back.

The latter is probably well-known.  But the first point is not well appreciated, and the second is an important myth.  A few month ago, on Effectively Wild, Stan Conte, the VP of Medical Services of the Los Angeles Dodgers, was on for an interview about pitcher injuries.  One of the points he made is that the myth of increased velocity resulting from Tommy John surgery has resulted in some ridiculous scenarios.  Among them are stories of fathers requesting the surgery for their completely healthy sons in hopes of helping them to add velocity.  Can you imagine?

Zimmerman's next piece will investigate probable causes of the injuries.  The biggest culprit in conversations I've read and heard are fastball velocities, which continue to rise.  I'm looking forward to seeing the data on that, and whether those claims hold up.

Update:
I received this tweet last night.  It provides two great links to Jon's work that addresses my last paragraph:



Update #2:
Here's Zimmerman's second part, with an excellent summary of his findings (all of which I agree with):
  1. Pitchers are developing their shoulders better. This leads to a pitcher’s elbow giving out before his shoulder.
  2. Pitchers are throwing at a higher velocity, which means more stress on the elbow. More stress means more of a chance for a snap or rip.
  3. Younger pitchers are putting undo stress on their arms at too early an age, with too many innings and not enough rest time. The pitchers are damaged goods before they make it to the majors.

Monday, June 23, 2014

Branch Rickey & Allen Roth in 1954

In the history track in Sabermetrics 101 this week (the 4th week), we read an article by Branch Rickey that appeared in Life Magazine in 1954 describing his and Allen Roth's efforts to develop a model that would predict team success.  Here's the model that is at the heart of the article:
To break it down:
  • The top row is the Offense term.  It is essentially OBP + 0.75*ISO + Clutch.  Clutch was a catch-all term that tracked how often a team scored once runners were on base, and includes clutchiness, baserunning, luck, etc.
  • The bottom term is the Defense term.  It includes opponent batting average, the Walk+HBP term of opponent OBP, Opponent "Clutch", and a strikeout term (weight 1/8th...this was presumably necessary because it's just the extra value of a strikeout over and above what is already tracked in the batting average term).  F is fielding (independent of the other values), which Rickey & Roth basically punted.  In fact, they have a great line in the article: "There is nothing on earth anybody can do with fielding."  They just assigned it a zero and moved on, hoping it wouldn't matter that much.
Therefore, the equation amounts to:

Offense (O) - Defense (D) = G

Where G is a stat that will track run differential quite well.

Neat, right?  

There are some problems that I see.  First, it seems like the ISO term is confounded with the R term, because a lot of the value of extra-base hits lies in driving runners home (and vice versa).  The second is unquestionably the over-emphasis on BABIP when tracking pitching performances (especially when they start relating this to individual pitcher performances; this was pre-Voros McCracken, after all!).  And there's also the lack of separation of the unique effect of the home run.  And finally, the units are sort of a mish-mash of arbitrary ratio units, rather something that has immediate meaning like runs or wins.

In short, it's not Base Runs.  But it seems to work pretty well, based on the work they did on it in the 50's.  The article itself is a great read, with a ton of great quotes.  I highly recommend it.  It's neat to think that this kind of thing was happening 60 years ago...and how hard it must have been to do the analysis, before the days of excel, mysql, and statistical packages!***

***At one point in the article, they mentioned sending off their data for six weeks(!) to a stat department at an institution for "correlation analysis."  What would have taken a couple of hours today (mostly just getting the data together) took WEEKS of work using mechanical calculators, slide-rules, and lots of paper computation.

Thursday, June 19, 2014

Reds Odds Rising

Just because I wanted to capture it...the Reds are back to 0.500 entering today's game, and  playoff odds are at their highest since the season began:

They have a tough opponent coming to GABP for tomorrow's game against the Toronto Blue Jays.  And they're still a bit of a long shot.  But it's fun to start dreaming again. :)

Tuesday, June 17, 2014

Brayan Pena having great year behind the dish

Early on, the story with Brayan Pena was the surprising offensive performance the Reds got from him during April.  His bat has cooled off a lot since then, and his overall hitting numbers are now more or less on par with his career numbers: 80 wRC+ in 2014, 73 wRC+ career, 80-81 wRC+ projected by ZiPS and Steamer over the rest of the season.  That's good for something like a 0.3-0.4 WAR catcher by the end of the season.

I noticed this today, however:

Pena (highlighted) is already at +9.5 context-specific runs saved by his framing skills!  If you remove context (specific count), he's only +3 overall.  But that's still very solid when we're just over a third of the season played.

Entering the season, Harry Pavlidis projected Pena at +4 runs from framing alone.  Therefore, he's probably already shot past that projection.  This is great to see, and is consistent with what some of the twitter-verse told me as the season began.

Pena's other fielding numbers have been good: 36% CS rate (26% league average), and +4 runs by Chris Dial's catching system against stolen bases, passed balls, and errors.  I don't like him that much as the team's backup first-baseman, but he's been everything you could ask in a backup catcher.  And, very clear, he has been a terrific guy in the clubhouse.

Quick bit of prognostication: if we split the difference between the context-specific and context-neutral framing numbers (+6 runs), give him +4 runs for his other fielding performances, that gives him +10 runs on the season.  If he can double that, he'd end up a +20 fielder, which is worth 2 WAR.  I don't know if he'll do that, but even 1-1.5 WAR by season's end would be a solid performance from a backup!

Pirates Series Preview

Today I previewed the Pirates series at Red Reporter.  They're kind of the anti-Reds: great offense, lousy starting pitching.  The net result is pretty much the same; both teams are a tick below .500.
The Pirates were forecasted by many to regress from their fairy-tale season last year, and they have.  They sit a game below 0.500, a consequence primarily of their struggling pitching staff. While their rotation was a strength last season, they have struggled this year to put together good outings.  Even more problematic, their top two starters, Francisco Liriano and Gerrit Cole, recently found themselves on the disabled list with a strained oblique and shoulder fatigue, respectively. 
The good news for the Pirates is that their offense has been dynamite.  Their 105 wRC+ is the second-best in the league, and the arrival of Gregory Polanco might mean they could get even better.  The Reds' pitching staff could well have their hands full trying to contain the Pirates' bats.

Sunday, June 15, 2014

Replay Is Working, but Not For Reds

The implementation of expanded replay has come under a fair amount of fire this year.  We've all seen a scenario like this: after the obligatory dawdle while his video guy looks at the play, the manager challenges a ruling on the field, and we all watch a number of angles at home while the umpire at MLBAM makes his decision.  The replays seem to clearly show that the call was incorrect...but, after a minute or two of a wait, the call in the field is upheld.  The reason?  No clear and indisputable evidence that the call was wrong, despite it being pretty clear that it actually is.

This scenario has both hurt and helped the Reds this year, but either way it is maddening to watch.  Replay is supposed to help get calls right.  If it's upholding the wrong call, then what is the point of going through this?

On this point, I saw this tweet come across my twitter feed a few days ago:
I thought it was a really good point.  Replay isn't perfect.  But with every overturned call, we're seeing mistakes being fixed that otherwise would have been upheld.  Umpires still blow calls, and replay still fails to correct some of them.  But we've already seen hundreds of incorrect calls fixed this season.  Hundreds!

The link in that post is to Baseball Savant, which is happily tracking all of the replay challenges this year via pitchf/x data.  It's a great resource.  As of now, the number of overturned calls has risen to 246 calls, which is 47% of all challenges.  Here's how that number has evolved this year:


The blue line is a cumulative average for the entire season.  What we can see is that the rate of calls being overturned started at or around 40% in April, but then rose steadily through May and has flat-lined around the current 47%.

The 9-day moving average is a bit more revealing.  It seems as though the replay umpires began the season being pretty conservative for the first two weeks.  Then, they started to routinely overturn just over half of the calls from that point on...with a strange dip that occurred at the end of May/beginning of June (~23 May through 3 June).  I don't know what happened during that dip, but the rate of overturned calls dropped down to 30% for a brief period.  I'm going to speculate that the umpires are still making adjustments to their internal criteria for when to overturn calls.  Therefore, we might have seen a short-term adjustment to be more conservative, which was then quickly relaxed.


Replays and the Reds

So, if the MLB average is that just shy of half of all challenges are successful in getting a call overturned, why does it seem like the Reds almost always lose their challenges?

Probably, it's because the Reds have almost always lost their challenges.  The Reds are tied for the fewest challenges in all of baseball with 9 (the highest are the Cubs & Rays with 23 each).  They have won just 2 of their challenges (22%).  In contrast, the teams the Reds have played have made 14 challenges against the Reds, and have won 8 of them (57%).  Replay has not been kind to the Reds thus far.

Who is to blame?  The Reds' video guy?  Small sample sizes?  Bad luck?  I'd lean toward the latter explanations.  But...yeesh, let's hope their luck changes soon.

Friday, June 13, 2014

Optimism, Carlos Gomez, and Projections

I surprised myself with a surprisingly upbeat preview of the Brewers vs. Reds series today:
Going into their game against the Mets last night (as I wrote this), the Brewers had a 7.5-game lead on the Reds.  Now, one can't really hope for a sweep.  But can you imagine?  That'd suddenly put the Reds 4.5 games back.  With the Cardinals still hovering around 0.500, the Reds have an opportunity here.  I like the Reds pitching match-ups in all three games.  If the offense can build on what it did the last two games of the Dodger series, this could be an exciting weekend for the Reds.  Here's to optimism! Go Reds!
Somewhere recently, I saw someone write (I think for another team) that it's more fun to be optimistic and wrong than pessimistic and right.  I'm trying to adopt that view. :)

Carlos Gomez: A lesson on the importance of
patience, scouts, and tools.
Photo Credit: Keith Allison
I also wrote a bit about Carlos Gomez.  He fits into the mold of a former top prospect who had been given up on by nearly everyone, only to discover himself.  Others include Jose Bautista and Edwin Encarnacion, although to his credit, Gomez was never DFA'd.
Some of Lucroy's best competition in the MVP race is his center fielder, Carlos Gomez.  I think Gomez is fascinating.  I find it almost impossible to believe that he's only in his age-28 season, because it seems like he's been around forever.  He was the key acquisition in the Twins' deal that sent Johan Santana to the Mets before the 2008 season.  Despite playing in the majors as a 22-year old, he was largely considered a flop in the aftermath of that trade.  He earned a strong reputation as a great defensive center fielder, but he didn't hit a lick.  The Brewers acquired him in exchange for J.J. Hardy, a deal that many (myself probably included) panned.  Hardy was often-injured, but he was a quality offensive shortstop coming off a bad season, while Carlos seemed like the definitive no-hit defensive center fielder.  Then, something happened in the second half of 2012: Gomez started to hit for power.  From July through September, Gomez slugged 15 home runs (previous season high was 8).  He continued to show that power through last all of last season, and FanGraphs estimated his season value at 7.6 fWAR (though that might be a bit inflated by a +24 run fielding rating...though b-ref gave him +38 runs in the field and 8.9 WAR, so....).  He's one of the best players in baseball.
This is the kind of thing that has taken me forever to wrap my head around.  I am a projection guy, and rely on them to guide my evaluations of players to avoid getting overly excited about new changes in performance.    In fact, MGL just wrote a great piece on the importance of this approach, which was then summarized by Dave Cameron.  But there are guys for whom I am bound to miss with this approach, and Carlos Gomez (and Bautista and Encarnacion) are cases in point.  Therefore, while I use projections, I also try to keep an eye out for toolsy guys who seem to finally be figuring it out.  In my view, the lesson is to trust your projection tools, but to still be cautiously conservative in their ability to forecast the future.  I'll always have a blind spot for this kind of player, but hopefully I'm not as blind as I once was.

Brandon Phillips might have been my first lesson in that.  Here's what I wrote when the Reds acquired him:
Previously a highly-touted prospect out of the Montreal farm system, this guy is apparently going to sit on our bench this year. He is out of options, and the Reds seem to have acquired him from the Indians because they didn't want to lose him via waivers and he's not good enough to be on their team. Phillips had a very good half-year as a 21-year old in the then-Expos' AA franchise, hitting 0.327/0.380/0.506/0.886 in 245 at bats in '02. But that's the last time he had a really outstanding, prospect-like season. He was probably rushed a bit to the majors in '03, but was thoroughly ineffective there and hasn't done a whole lot since. His '04 campaign with the Indians' AAA affiliate was decent (0.353 OBP), but he regressed a bit in '05. Krivsky mentioned in his rain delay interview with Marty Brennaman today that Phillips hit 15 homers last year in AAA, which is good. But his slugging percentage was a fairly poor 0.409, particularly given those HR totals. Overall...I wouldn't expect much from this guy's bat this season, and perhaps not ever.
Phillips might be overrated by some of the Cincinnati media, but there's no question that I missed the mark badly on this one.  Phillips has posted five seasons as an above-average hitter during his time with the Reds, has played legitimate gold-glove defense, and has topped 3 WAR five times (and posted 2.9 fWAR a sixth time).  He's been an excellent player for the Reds since almost the first day he donned the Reds' uniform.  The scouts were right.

Jesse Winker: a near-future answer in left field?

Reds' prospect Jesse Winker has been having a terrific season in high-A Bakersfield.  He was a 1st-round (49th overall) compensation pick in the 2012 draft when the Reds lost Ramon Hernandez, and so far has vastly outperformed the Reds 1st selection that draft, Nick Travieso**.  He was the Reds' consensus #4 prospect entering the season, and had really strong preseason Oliver projections (0.331 wOBA, *if* he played in MLB this season).  Well, he hasn't disappointed:


And that performance is despite missing a week or so with a concussion following a collision with a wall.  Apparently, it's not currently a factor.

Doug Gray had a nice piece yesterday comparing his raw performance to that of other recent uber-Reds prospects, namely Joey Votto, Jay Bruce, and Devin Mesoraco.  The short version?  Albeit with caveats, he's comparing extremely well.  Excellent walk rates, solid strikeout rates, and excellent power (as mesaured by ISO).

What are the caveats?  Three big ones:

  • Competition level.  Jay Bruce did spend some time at high-A, but also spent time at more advanced levels.  However, Votto and Mesoraco were primarily at low-A in their age-20 seasons, meaning that Winkers numbers are even more impressive in comparison with him.
  • Run environment.  Because the Reds' High-A affiliate is now in Bakersfield, Winker gets to play in the California league.  That league has the highest run environment among leagues above rookie ball (~5.3 runs/game when I looked at this between 2007-2009).  And furthermore, Bakersfield is a moderate hitters park for that league (runs park factor 1.03), and is especially favorable to the home run (1.15).  By comparison, the midwest league is quite pitcher friendly, even if Dayton is a bit of a hitter's park (park factor 1.07), meaning that the games played by Votto and Mesoraco are close to neutral.  When Bruce played in high-A, he was playing in Sarasota with the Florida State League, which is about as pitcher-friendly of a league as you can find.
  • Sample size.  Doug is comparing Winker's 239 PA's against a full minor league season by Bruce, Votto, and Mesoraco.
Nevertheless, despite all of that, it's very encouraging to see Winker performing so well!  I think it's reasonable to expect a mid-season promotion to AA at this point.  If he hits well there, he could potentially be at least a long-shot for left field coming into spring training in 2015.  This possibility was discussed by Joel Luckhaupt in a recent Redleg Nation Radio.  Goodness knows that the Reds could use some help in left field!

** This was not to condemn Nick Travieso.  After a slow start to his career, Travieso is having a quality season at Dayton this year as a control-oriented starter.  His strikeout rate is still far from where I'd like it to be, but he could potentially still work out to be a back-end starter.  Doug Gray has said that he's seen Travieso throw in the mid-90's over the past year (albeit inconsistently), so there might still be stuff there to succeed.

Thursday, June 12, 2014

Cameron: Johnny Cueto's Fastball Unhittable

Dave Cameron pointed out today that a big part of what has made Johnny Cueto so amazing this year is that his four-seam fastball has been nearly unhittable:
I mean, sheesh.

Cueto is known as a sinkerballer, and that's still very much what he is.  His ground-ball percentage still stands at 53%, and he still throws his sinker a healthy portion of the time (22.5%).  But Cueto has been seeing uncanny results with his four-seam fastball, which he is throwing a bit more often this season:

***Note also: Tony Cingrani also appears on this graph of low wOBA allowed.


That increase in his four-seamer frequency seems to part of why he's showing a 1 mph increase in his "fastball" velocity this season, although all of his pitches are seeing small velocity increases:


I tend to expect that outliers as substantial as Cueto is in that first graph are due for a regression.  That shouldn't be a surprise, because Cueto has been so unbelievable this year.  It's pretty rare for a pitcher to be almost twice as good as everyone else on something.

Nevertheless, the Dodgers announcers were talking a lot about Cueto and the deception in his delivery last night.  I think that Cueto's increased velocity, coupled with the deception of how well he hides the ball during his delivery, are a bit part of why he's been able to accomplish what he has this year.  As Cameron states, this is not just a BABIP-inspired oddity.  I can't explain it either, but I am enjoying the ride.

Saturday, June 07, 2014

Reprints: How Hitting Statistics Explain Runs Scored

This week in the SABR101x course, we're covering hitting statistics.  The lesson was very similar to an article I wrote 6 years ago on this site comparing the ability of different offensive stats to predict runs scored (many others have written such articles; it's a classic approach to addressing the question of hitting stat quality).

In it, I argued that there wasn't much of a problem to using OPS if it improved communication because the gains from it to other, better stats (like wOBA) were so meager.  Fortunately, in the time since, FanGraphs has popularized wOBA so much that I feel pretty comfortable just reporting it and ignoring OPS altogether.  And in many cases, I've even moved on to using wRC+ to get the advantages of park controls and run environment-neutrality.  OPS just isn't necessary anymore.

In any case, I thought it would be fun to reproduce that article here.  Here's the most relevant graph.  The rest appears below the jump.


Thursday, June 05, 2014

Selling Jeans: Ballplayer Height, Weight, and BMI

So, my question for today is: how have the physical attributes of ballplayers changed over the years?  Let's look at this graphically.

Player Height

I'm reporting all dates as birth year, as that seemed a logical way of organization players.  I'm also throwing out the edges of the database that contain fewer than 50 players per birth year.  That means, for recent years, I'm not including anyone born after 1990 (i.e. 24 years old in 2014).

We can see that, after an initial surge of the extremely short in late-1880's ball as baseball became more professional and required players to be top athletes, average player height quickly reached 70 inches (5'10", aka jinaz-standard height) and then progressively have gotten taller, on average, as a group.  Currently, baseball players average just shy of 74" (6'2").

No real surprises here.  Thanks to some combination of improved diet, sanitation, medicine, and social programs, average human height has increased four inches in the past 100 years, and ballplayers are right on track with that increase:
Furthermore, major league baseball players tend to be taller than the average population.  Current average height is 5'10", while ballplayers today average 6'2".  Among those born in the 1920's, average height was around 170 cm on this graph (67", or 5'7"), while baseball players.averaged 5'10".

Weight

This one's a bit more interesting:
So, we have a steady increase to 1920, then a slight increase that follows height...and then BOOM, something happens.  Weight shoots from 188 lbs to 206 lbs in a matter of 13 years (1965-1979 birth years).  That corresponds to players who played their age-27 years between 1992 and 2006.  What gives?

Before we address that question, let's first look at one more graph


Body Mass Index


So here, we're seeing a metric that tracks both height and weight in the same number.  And again, we're seeing a steady drop in BMI as the game becomes professional, a flat-lined BMI for many decades, and then a spike again once we hit 1965 babies.

The knee-jerk reaction is to claim that this matches up pretty well with the PED era.  There are no clear fenceposts for when that era began and ended, but I tend to think of the steroid era running from around 1994 (the year of The Strike) until the advent of MLB's testing program in 2003.  The steep part of the slope begins and ends, more or less, with players who peaked during that period of time (1992 through 2006).

The interesting thing is that it hasn't really dropped that much since MLB started its testing program.  Average weight of players has decreased slightly since its peak in 1982 babies (208.7 lbs) through 1989 babies (205 lbs).  Height has also dropped slightly during that time (0.3 inches), so BMI changes very little in that time.  That span describes players who are currently ages 25-33.  These are players that, by and large, have played their careers during a setting in which drug testing was a thing.  And yet, while they've declined, we're a far cry from where we might expect to be before that spike.  If the spike in weight and BMI occurred due to steroids taking over the game, and if the current testing program works well enough that steroids are now largely NOT a part of the game, we'd predict weight and BMI to return to pre-steroid levels.

My feeling is that some of this could be steroids.  But I think there's two other, important factors that could be involved:
  1. A shift in training regimes of players: an emphasis on being bigger, stronger, and faster through weight lifting and nutrition...and for scouts to prefer bigger players.
  2. An influx of international talent (including lots of big guys) that push up the pool of available talent.  If you have more players available to choose from, and baseball favors larger humans, you'll be able to shift up the averages by casting a larger net when selecting players.
The latter one seems like it could be a big deal, and I can think of a few ways to test it.  But I'll need to sharpen up my database skills a bit better to do so. :)

Thoughts?

What ex-Ballplayer Was Born at Sea?

So, I was playing around with some basic queries in the Lahman Database for the SABR101x course.  I decided to do a search on the birthCountry column.  Here's something that caught my eye:

Who is that?  Well:
Ed Porray pitched 10 innings for the Buffalo Buffeds in 1914.  The Buffeds were part of the Federal League.  He finished the season, and his big league career, with a 4.35 ERA and a 6.99 FIP.  And, he's a now the answer to a trivia question!

Wednesday, June 04, 2014

The First Week of SABR101x

We're at the end of the first week of +Andy Andres' SABR101x course, offered through +edX.  Having completed the materials, I thought I'd share a few reflections.

The EdX Platform & Distance Education

The discussion forums are an important part of what
makes courses on edX work.
This is my first edX course, but I don't think it will be my last.  I'm pretty impressed with it as a platform.  It strikes me as an excellent learning platform, with the ability to deliver a tightly organized course that presents information in multiple ways.  Furthermore, it allows students to interact with assessments via multiple choice-style questions as well as text entry, and to interact with each other via targeted discussion boards that can be inserted into specific stopping points within lectures.  

I'm a college professor in my day job.  I teach brick-and-mortar classes, and have avoided digging into the realm of online classes.  One of the things that I'll be doing is taking a look at how the course is constructed, both in terms of information progression as well as the mechanics of how Andres presents the material.  There's a lot to like, here.  The lectures are presented in short video format that usually runs 5-13 minutes in length.  In between, there are at least a few quick assessment questions, which gives students a chance to think about and process what they've just learned.  And intermixed with the lectures are short, 1-2 page written explanations that complement, but are not redundant with, the lecture material.

One thing that I didn't anticipate was how much I like having the narration to go along with the video/audio.  As a learner, I know that I do best when I can both see and hear something.  But, aside from video games, it's rare that I've had the chance to watch and listen to a narrative at the same time.  I can tell that I can grasp concepts much better when getting to read and see at the same time.  I don't plan to turn on the substitles on my home TV any time soon, but it's great for an education setting.  Along the way, I'm keeping a google doc window open where I can take notes as well.

As a side note, reading the narration allows for fun little quirks.  I feel bad for whoever they had transcribing all of the words, as that doesn't seem like a fun job.  But watching them try to spell Voros McCracken's name was funny ("Vhoorees," I think?).


SABR101x Content, Week 1

This week began with a basic introduction to sabermetrics.  Andres started exactly where I tend to start most of my courses: by defining terms.  He spent a lot of time looking at dictionary definitions, as well as definitions from those in the disciplines covered here: sabermetrics, statistics, data science, and big data.  These kinds of discussions always seem a little bit laborious.  But at the same time, they provide the opportunity to dispel a lot of misconceptions.  They also help enforce the idea what we need to be precise with our language.  I found the definitions when discussing databases to be particularly helpful, because I have very little background in that area.

Beyond definitions, this week was pretty light in content.  We took our first stab at running some MySQL queries in the BUx SQL Sandbox that Andres and his team set up on edX, and it worked well enough.  They set up a Lahman database and got everything set so that users needed only to type in the queries as presented in order to retrieve their data.  There is no inherent need to set up one's own SQL server/workbench to complete the course (although I did just that; see below).

Assessment thus far has been pretty light.  Some of the questions have been recall of minutia.  For example, one of the first questions asks you to report the year in which Bill James coined the term sabermetrics.  Good grief. :)  But most answers have been readily apparent from the videos, if one is paying attention & taking at least light notes.  Coding submissions are graded based on the output MySQL server stemming from your query, as far as I can tell.  So far, the coding assignments have basically been copy-and-paste exercises that require almost nothing from the student.  Still, a glance at the discussions shows that students are still having trouble with this.  Therefore, basic practice in syntax and input is probably appropriate at this stage of the course.  Future modules will almost certainly require a bit more thought in the assessment sections.

There is also a History of Sabermetrics track in the course.  This week's focus was on Henry Chadwick. Chadwick is sometimes known as the Father of Baseball, and is sometimes mentioned as an early pioneer of baseball in the same breath as Abner Doubleday (who, for a moment, was confused in my mind with Albus Dumbledore!  Go figure!).  But, as Andres notes, he was also the first real sabermetrician.  While he might not have actually invented box scores, he established a careful approach to observing, recording event, and the reporting on games that was pioneering.  He also was instrumental in carefully recording and refining the rules of the game.  Furthermore, through his writing in newspapers and his books on baseball, he was instrumental in publicizing and popularizing reports of baseball.  He's a guy that I've read a bit about before, most notably in Alan Swartz's Number's Game (which I read close to a decade ago!  'Tis a bit fuzzy).  Nevertheless, I found it a neat little foray into baseball history to learn more about him.  I'm looking forward to more of these history segments.

My Own MySQL Workbench

A local copy of MySQL Workbench offers a lot of
usability advantages over running from the course sandbox.
In order to get more practice, and to be set up to work on my own, I did opt to get a MySQL server running on my own computer.  I went to MySQL's website and downloaded their installer for "MySQL on Windows."  It was pretty easy to set up, although there was one hiccup where a certain "ODBC Connector" file (whatever that is) was not found by the installer and I had to download and install it manually.  Once installed, I launched the program and s elected Database-->Connect to Database from the menu.  That launched the workbench, which gave options to "Startup/Shutdown" the server.  Once started, my next step was to install the Lahman database (Andres provided a specific one to users of the site--they apparently made some changes?  There were two files...I went ahead and installed both as a SABR_101x schema in MySQL, seems ok!).

Now, I'm set to run queries!  Everything that works in the course works on my rig, although mine was installed such that all table names are lowercase.  There's an option in the server settings to not do that, but things were getting screwy when I changed that.  So...I'm just going to remember that this is a difference between the course and my computer.  I actually prefer this, because tables are not case sensitive on my system.  But I'm sure I'll get a few submissions wrong in the course as a result!

The interface of this workbench is light-years nicer than what I used when following Colin Wyers' instructions some years ago to install the Essentials SQL server/workbench.  There are options to save queries as script files, which is huge.  As you're editing, the editor color-codes commands, and offers pop-up help whenever you put your cursor on specific functions or operators.  I also love the schema view: you can select multiple columns in a table--or even multiple tables--with your mouse, right click, and it will automatically add the appropriate bare-bones SELECT text.  It's very nice.



That's all I have for now.  If you're on the edX course, I'm going by Justin90 there.  Please feel free to say "hi" if you see me on the forums.  Or, of course, just chime in here!

Big Data Coming to Baseball


Baseball sabermetrics has really always been about fairly large datasets.  A single batter's hitting line often reaches 700 plate appearances in the course of a season.  There are over 1000 players who have donned a cap in major league baseball this year, with thousands more in the minor leagues.

One of this week's topics in +Andy Andres ' Sabermetrics 101x course this week is Big Data, with the capital B and D.  There are a number of good points made in his introduction, but one of the most important is that big data does not always mean better data.  Big data can be fraught with bias (lack of controls, systematic bias in data collection), a problem that Colin Wyers has long railed against with our favorite defensive metrics.  Big data can also lead to false conclusions because statistical approaches no longer work well.  When your sample size gets into the thousands, P-values below 0.05 get easy to reach, even when there is no actual meaning to the difference found.

This spring, MLBAM announced a new stream of data with its new player tracking system.  This system, which is already in operation in a few parks this year, will track virtually everything on the baseball field: player position, ball trajectories from the pitchers' hand and from the bat into the field.  It is essentially a replacement of the pitchf/x, hitf/x, and fieldf/x system we've been using (or, in the latter two cases, at least hearing about).  

It's incredibly exciting to hear that this will be used.  The question will be how much of these data will be available to the fan community at large.  We probably don't need to be able to download all 7 TB of data.  But hopefully, we'll get something.  My wish list:
  • Everything we currently have with the pitchf/x system for tracking pitches.
  • The equivalent of pitchf/x, but for batted balls.  Therefore, we'd get vertical trajectory, vector the ball was hit, velocity, spin, and landing location/hang time.  
  • For fielding, ball landing location alone would be a great step forward.  It could be fed directly into our UZR-like algorithms, and immediately improve them by removing systematic bias in our hit location data.  Initial player position could also be really useful, if nothing else than to help distinguish shifts when they happen.
Some of the other measures they show in the videos, like acceleration, reaction time, and maximum speed, could be interesting, and it would be great if they were released along with the stuff above.  But I'm also not sure, for the time being, that I'll be all that interested in them.  With what is described above, we'd have a wealth of useful analytical information at our hands.  From the perspective of valuing players, which is often my main interest, it really doesn't matter if a player gets from point A to point B because they get a great jump, because they have great route efficiency, or because they're fast.  What matters is that they get there.  

Of course, if I'm a team, I care a lot more about the minutia.  It might be that maximum speed can't be taught.  But reaction speed might be able to be taught, and route efficiency almost certainly can be (right?).  But personally, I'm most interested in just evaluating player value.

The concern, of course, is how much of the data will actually be available to us.  I'm frankly a bit scared about this.  Potentially, MLBAM could be really stingy with these data, and we might end up with LESS information than we have now through pitchf/x.  I'm hopeful that this won't happen, however, and I've no doubt that folks like +David Appelman at +FanGraphs will be doing his best to have access to, or perhaps even license, some of the critical info that we as a community want.  

Once the data ARE available, the question will be how well the community makes use of them.  I have no doubt that we'll see a lot of spurious conclusions in the early goings.  Fortunately, the sabermetric community is pretty good at policing itself, and correcting its past mistakes.  We have a lot of bright minds in this community, and hopefully, within a few years, we'll have a good grasp of what these data can and cannot do.

Postscript

I embedded one of the videos that MLBAM released above.  What follows (below the break) are the some of the other interesting ones.  It's really exciting stuff!

Tuesday, June 03, 2014

Giants Series Preview

Pablo Sandoval: Striking out at career norms!
Photo Credit: S.D. Dirk
Here's the series preview that I wrote for the Giants/Reds series this week.  It included an update to my piece on strikeout rate risers from last month:
I'm pretty sure that, based on his preseason weight reports, this was supposed to be a good Pablo Sandoval year.  He had a really tough first month, and I noted that on May 8th, his strikeout rate had risen as high as 22% while wOBAing 0.240.  He's made a nice turn-around since then, however, and is regressing quickly back up to his career norms.  So, maybe a good year isn't out of the question after all!
The lesson here is that even when you're looking at appropriate sample size (based on Pizza Cutter's study) for a given statistic, there's still plenty of reason to expect players will project more toward their career averages than their most recent rates.  In Sandoval's case, he posted a 20.6% strikeout rate in April, but followed up with a 13% strikeout rate in May.  His career strikeout rate?  13%.

Monday, June 02, 2014

Stephen Drew Signs; Draft Pick Penalty is Terrible, Should Go

Stephen Drew finally found a home, right back
where he started.  Photo credit: Keith Allison
I'd missed the initial news, but Stephen Drew apparently has finally found a job.  After the draft pick penalty prevented him from finding gainful employment with a number of teams who otherwise could have used his services, he ended up signing again with the Boston Red Sox.  Because they are re-signing their player, the Red Sox do not give up a compensation pick to sign Drew, but they also don't get a pick that would have otherwise come to them from Drew's signing team (although the chances of that were almost zilch at this point).

As I wrote in March, this system is terrible.  I like the idea of giving an extra draft pick to the team who loses a quality player via free agency.  Furthermore, I *really* like the qualifying offer system as a way to let the market dictate the value of a player, as opposed to using Elias's horrific player rating system.  But I see no reason why the team that signs a player should be penalized, as this ultimately results in a penalty to the player.  Stephen Drew might not be an elite player, but baseball is better with him in the game.  I'd love to see the lost-pick penalty go immediately.

Everything I've heard about it, however, indicates that nothing will change until the current CBA expires after 2016.  That means that we're in for another two offseasons of this nonsense.  Stephen Drew and the other players in his situation were effectively denied their free agency.  It's not collusion, because teams are acting in their individual best interests.  But the effect seems very much the same.

Sunday, June 01, 2014

Cueto getting early Cy Young respect

Johnny Cueto has had an amazing first third of a season.  To wit:

I mean, sheesh.  His velocity continues to be the highest it's been in the past three years, he control has been pinpoint, and he's striking guys out at easily the highest rate of his career.

David Pinto has produced a Cy Young tracker script that employs both Bill James' and Tom Tango's Cy Young point systems.  These systems are designed to predict Cy Young winners, not establish who is the best player.  Despite only having a 5-4 record, which matters for these point systems, Cueto has a sizable lead using Tom Tango's Cy Young points.

At least one mainstream outlet (albeit not necessarily a prototypical "mainstream writer" in Dayn Perry) has noticed, and tapped Cueto as the first third Cy Young winner:
This award belonged to Adam Wainwright before his Friday night stinker against the Giants. Now it falls to Cueto, who, that previous qualifier notwithstanding, is quite worthy. Presently, he leads the NL in innings and the majors in strikeouts. More to the point, Cueto boasts an ERA of 1.83, and opposing hitters are batting just .148/.207/.251 against him. He's also logged 10 quality starts in 11 trips to the mound.
I'm still not sure if we'll see Cueto in a Reds uniform by the season's end.  The Reds would have to go on one heck of a tear to get back in the division race, and Cueto, while not expensive for his production, should almost unquestionably bring more trade value now than he ever will.  But it has been awfully fun to watch him have this kind of season.