Posts Tagged ‘R’

Articles

US Half Marathon Statistics

In Statistical Analysis,Statistics on November 8, 2010 by David Tagged: , , , ,

Yesterday, I ran the US Half Marathon. It was my first half marathon, and I woke up at 4:30 AM to abysmal dark and pouring rain. Getting there around 5:30AM, I waited outside (the nearby Starbucks didn’t open till 6AM), and wondered whether I should really be attempting this. By 6:30AM, I met a few other runners waiting at Starbucks and their energetic reassurances convinced me to go for it. All in all, it was a really good experience, and I was happy that I ran. I finished in 2:17:26, but by mile 11, the region around my right lateral epicondyle started hurting (the bony structure lower and to the outside of your knee). Try as I may, I couldn’t run anymore, and walked the final two miles. I will write more about my experience, but I wanted to share my statistical analysis of the results of the race.

Relationship between Bib Number and the Runner's Finish Time

Relationship between Bib Number and the Runner's Finish Time

Later that day, I went to check out the posted results online. The results page had an easy to scrape PHP/javascript database, so I downloaded the information and did a brief statistical analysis on results. I graphed the relationships between gender, age, bib number, and finishing time using R and ggplot2. Some results are below:

  • I registered for the half marathon using Groupon, and all in all, 1298 people used Groupon for this event. This was approximately three weeks before the half marathon. I wondered if you could tell the Groupon participants from the regular, hypothetically more hardcore, racers.  I assumed that the bib (the paper you pin to yourself) numbers were assigned based on registration, ie. earlier registrants received lower numbers. Either that is not true or my assumption is wrong, as it does not look like later participants are slower than participants with low bib numbers. Rather people with higher bib numbers seem to be slightly faster than people with lower bib numbers.
  • I also heard that they reserved a range of bib numbers for professionals, relatively famous marathoners, or people who are extra dedicated. This does not seem to be the case as I could not determine any clusters of bib numbers in the fast runners. The winner seems to have a bib number in the middle of the range.
Boxplot of the Relationship between Runner's Age and Finish Time (in Seconds)

Boxplot of the Relationship between Runner's Age and Finish Time (in Seconds)

  • There appears to be a positive correlation with finishing time (in seconds) and age. The older you are, the more likely you are slower. Makes sense.
Distribution of Runner's Home Towns Across US

Distribution of Runner's Home Towns Across US

  • The vast majority of participants are from California. Note that the dots are in log(scale), of around 2.7k participants, maybe 2.5k were from California.
Different Divisions of Runners Represented

Different Divisions of Runners Represented

Gender Ratio of US Half Marathon Runners

Gender Ratio of US Half Marathon Runners

  • I examined the distribution of gender and age of the participants. There were more females than males running the race, and people were divided into divisions according to both age and gender. For example, M2029 would be males from 20 to 29 y/o.
Distribution of Times By Gender

Distribution of Times By Gender

  • Men are faster than women at this 13.1 mile race. The first 20 or so finishers were all men. Not sure if statistically significant.

In summary, this was an interesting experience and it was nice to see that results were easily accessible online. I did not find any very striking relationships or hidden patterns based on the results, but it is interested to examine the relationships between the different demographics and running times.