Final Project of Manegerial Statistics

 

 

 

This report is the final project for course GSBA-524 in Fall 2015.

Olympics, the biggest sports event around the world, is held by a select country every four years. There are approximately 200 countries that take part in it each time. Obviously, a lot of talented athletics are fighting so hard to get the entry of Olympics, and, probably to win a medal for his/her country, proudly. However, as a audience I have never thought about the mysteries and secrets of those medalists. What makes them excellent? What are the interesting facts about them?

As the final project of GSBA-524, I am asked to dig deeper in this topic as an audience and as a potential data analyst. All I have is a dirty dataset that consists of details of Olympics medalists from 2000 to 2012, Country GDP, Population, HDI and ISO codes. To be honest, I am a big basketball fans and I loved Kobe so much that I also plan to make this report a tribute to my beloved 2008 USA basketball team.

Clean the data

To start with, we need to read the dataset that is in ‘xls’ format into R. I changed its format to ‘csv’ in Excel and separately save the five worksheets beforehand.

Note that the setwd command is indispensable in this case.

Apparently, almost each data frame that were read in by R contains some blank columns in the end, so I tried to trim them out to make the data frames prettier.

Note that I found there are some duplicates in “GDP.csv” so I tried to omit all those observations. Instead of using R to solve this, I use Excel function “Countif” to find out that in “Int Olympic Committee code”, “MYA” repeats 24 times, “CZE” repeats 2 times, “THA” repeats 3 times, “BUR” repeats 2 times, “BIZ” repeats 2 times. Unfortunately, for each duplicate, I cannot find out which one of them is correct, so I decide to eliminate them entirely. That’s what following code does:

Additionally, I found that in ‘Medal’ data frame there are some NA observations so I also omit those observations.

Overview

In ‘Medal’ dataset, all the athletes who have got medals in Olympics from 2000 to 2012 are present. There are 8613 observations and 13 variables. Let’s take a look at it:

It’s such a long matrix that I set echo to FALSE otherwise there will be 3 pages full of data. This table basically summarize the number of awarded athletes by his nationality and display it by time. From the table, U.S.A., China and Russia are the top 3 countries in the history with most awarded athletes.

However, sometimes I want to see the total number of medals won by a specific country instead of the number of awarded athletes, because a talented athlete, such as Michael Phelps, could win several medals at the same event. Here’s my code to summarize it:

Dive deeper, and deeper!

Summer? Winter?

One of the factors notable here is that Olympics actually is divided into two categories – summer and winter. Different season make it consist of different sports. For instance, winter Olympics has skiing and skating that are exclusive in winter, while summer Olympics has soccer and athletics and so on. It would be more reasonable if we can look at the performance of all the countries in different seasonal sports.

To do so, I would use the created, well-structured “sum” dataset and split it into two parts. Since the column names are years in which an Olmpics is held, I manually grouped them using the following codes:

Now we can take a look at the performance of all countries in different Olympics. To achieve that goal, I sumed up all the values of the cells in the same row using “apply” function:

Interestingly, this separation makes me find out some other top-performance countries. Some countries are strong in summer, some are strong in winter and some are strong in both. We have talked about the top 3 countries of all time, which is U.S.A., China and Russia. U.S.A. and Russia are still top-ranking countries in number of medals in winter Olympics, but countries such as Finland(101), Germany(169), Canada(233), Sweden(108), Switzerland(57) and so on are climbing above China(48) in the list. This result has showed the spirit of Olympics very well. Talent makes a difference, so does hard work. Even countries with huge populations such as China and India are not as good as other much smaller countries in some sports. It drives players around the world to fight for the possible vitories.

Age

It would be fun to know in what age athletes are most likely to win medals. Let’s take a look at the summary of the “age” variable in “Medal” dataset:

It’s amazing to know that during 2000-2012 athletes aged 61 and 15 can also win medals for their countries. However, the mean age to win medals is 26, which is considered to be the “Golden Age” for athletes. To extend this topic further, I’d like to know the difference in age of athletes to win gold, silver and bronze medals. What’s in people’s mind could be that younger athletes have more chances to win gold medals and is it true?

I store three kinds of medal separately. Note that when I store “Silver” and “Bronze”, I actually subset them with an additional condition to cut out those duplicates in the previous categories. Especially I want to see those players who only win bronze and silver medals, since those talented athletes often win a bunch of all kinds of medals at once, and I need to eliminate this factor. Now let’s look at the result

Surprisingly, it is difficult to state that the more excellent athletes are, the younger they will be. Although athletes who won gold medals are youngest with average age of 26.33, I found that those who won bronze medals are younger than those who won silver, which is the opposite of my theory. If I had more time, I’d like to do an F-test to see whether those three mean values are significantly equal.

Relationship with GDP

People might believe that wealthy countries will have more awarded athletes because they can invest more resource on the research and training, but is this thought legitimate? I tried to join the “GDP.csv” with “Medal.csv” in R to do a linear regression. Firstly, I summarize the created dataset “sum” by year to get the total number of medals won by countries.

However, in this process there are challenges. I can not join “GDP” and “sum” directly even they have a common variables called “Country”. Therefore, I need to join “sum” wirh “Code” at first and then use “Int.Olympic.Committee.code” to join the new dataset with “GDP”, which is confusing. The reason why I use “Int.Olympic.Committee.code” is because the more simple one – “ISO.code” is lack of the code for the Great Britain, which is a key player in Olympics.

Let’s take a look at the “calculate_total” dataset.

I realized the last variable which states the GDP per capita for all countries is in “Factor” format, and it will cause problem when I want to do linear regression using it. Therefore, I use “gsub” and regular expressions to replace all the non-digit characters.

Great! It is now in numeric format and ready for the analysis. I created a variable called “lm1” to store the linear model because I want to see the summary of it.

To sum up, the p-value for variable “calculate_total$e” is significant and the p-value for intercept is not. The R-square value is about 10% and it seems that GDP per capita does not account for the total number of medals very much. Why not take a look at the plot?

There seems to be a weak positive relationship between GDP per capita and total number of medals, with every 1 unit increase in GDP per capita would result in an increase of 0.003 unit of the number of medals.

Relationship with Population

Similarly, since we have the population data for all the countries, we can do a similar analysis to the relationship between total number of medals and its population for each country.

Looking at the summary, I can say the both estimated parameters for intercept and “calculate_total1$e” are significant, implying that the regression is quite robust. However, the R-square is about 7% and it is a reasonable value for a simple linear model with only one independent variable. There is a trend of bigger population makes more medals, but I also noticed that a lot of points are accumulated in the left-bottom corner. In other words, a lot of countries have similar number of medals and some other countries such as U.S.A and Russia have abnormal numbers of medals or population. However, instead of cutting off these obvious outliers, I think I should keep them in the analysis because these points account for almost over 50% of the total number of medals. If we eliminate them in the dataset, it would cause serious bias in our results, which are not preferred.

What is every country’s best sport

It will be interesting to know that what the best-good-at sport is for each country. For example, U.S.A has a dream team for basketball and all of those popular NBA stars play for U.S.A. in Olympics. While the dominant power of a specific sport compared to that of other conntries is a key factor, I tend to use the total number of medals earned by that kind of sport as my indicator because it will be more convienient to calculate.

The code creates a list of all kinds of the sports in Olympics and also displays what country is best in it with what number of medals are won by the country in total.

Summary

I have investigated five aspects in this report. There are obviously many other opportunies to dig deeper, for instance, to use the remaining “HDI.csv” dataset. Before starting this project, I have some stereotype in Olympics like Switzerland is good at tennis because it has Roger Federer, but I found out that actually U.S.A. has won more medals than switzerland. Another example is that my motherland China is not as good at winter sports as it is in summer sports. To sum up, I appreciate this final project to give me a practice both in theory learnt in class and R programming language. Also, it gives me a new perspective to look at Olympics so next time when I talk about some interesting topics with my friends, I’ll say my sayings are backed by numbers. 🙂


 

3 Comments

  1. Wow, great post.Really looking forward to read more.

  2. “some genuinely interesting details you have written.”

  3. danniel winarto says:

    love it, keep up good work mate!

Leave a Reply

Your email address will not be published. Required fields are marked *