...or, what cities really are the best baseball cities?
We always hear the players, managers, and owners of ballclubs say how they "have the best fans in the country." But can we quantify which cities really are the best baseball towns, and which are the worst? An obvious (and accessible) measure of fan interest is attendance. So to take a first stab at this issue, I decided to look at attendance relative to city population size. To do this, I employed simple linear regression. Attendance figures came from ESPN.com, and city population numbers are the '03 numbers from citypopulation.de.
An initial scatter plot comparing attendance to city population looks like this:
As you can see, there is a non-linear relationship here. I plotted both the first and second order polynomial lines to show this, and you can really see the shape of this curve most clearly if you ignore Toronto (TOR) and the New York Mets (NYN). To correct for this, I applied a log10 transformation to the population data, which straightened this line very nicely, which allows us to do further analysis:
[Note that now there is a much nicer spread in the data, that the first and second polynomials plot almost identical paths, and that there is no visible curve to the datapoints themselves, even if you ignore certain teams.]
This graph tells us a few things. First, we can predict attendance with some accuracy by simply knowing city population size. The R^2 on the linear regression of these two variables is 0.34, meaning we can explain roughly a third of variation in attendance based on city population size alone. Not bad given how many other factors must go into attendance rates; more on this later. Second, while most cities are fairly close to the line, some are clearly pretty far off the line. We'd expect that the best baseball towns would have a higher attendance than their city populations would predict, while the worst towns would have much poorer attendance and predicted. This can be quantified by looking at the residuals, which are just the differences between each team's actual attendance and the attendance predicted by the regression line:
A few things jump out at me from this analysis. First, St. Louis fans attend far more games than you'd predict based on St. Louis's city population size. This is clear justification for people who claim that St. Louis is the best baseball town in the country. Furthermore, Tampa Bay and Toronto are terrible places for baseball. Toronto is the second largest major league city -- larger even than Los Angeles based on these numbers -- and yet it pulled in almost bottom-rung attendance last year. Tampa Bay isn't a large city to begin with, but even so it gets a miniscule amount of attendance for its size. All other cities fall within the heart of the curve.
Adjusting for Cities with Two Teams
An obvious criticism one could levy at the above analysis is that Chicago, Los Angeles, and New York each host two teams each. Since those teams have to split the market in their respective cities, shouldn't they get an adjustment in their available city population? The difficulty is deciding how to split the population. As an initial stab, I just divided the city population by two. Unfortunately, this resulted in an overall decrease in the predictive power of the regression model: the R^2 decreased from 0.34 to 0.27. Therefore, this adjustment seems inappropriate and hurts the fit of our attendance model.
An alternative approach might be to split the city population by the same ratio as the split in total attendance of the two teams within each city. For example, the total attendance in Chicago was 5,443,096 people; 3,100,262 (57%) with the Cubs, 2,342,834 (43%) with the White Sox. So if we split the total population of Chicago (2,869,121) accordingly, that gives the Cubs a city population of 1,635,399, and the White Sox a city population of 1,233,722. Even though this "forces" the population data to be slightly more in line with attendance, the overall R^2 still drops from 0.34 to 0.30. In other words, this correction also seems to hurt the fit of the model. Therefore, unless I can determine a better way to do this, I will continue to use the uncorrected population numbers in these models.
Stay tuned for part 2 in the next few days, where I will incorporate team performance to better understand team attendance.
Update: Part 2 is now online.