Final Project of GSBA-524
Shu Wu
November 22, 2015
This report is the final project for course GSBA-524 in Fall 2015.
Olympics, the biggest sports event around the world, is held by a select country every four years. There are approximately 200 countries that take part in it each time. Obviously, a lot of talented athletics are fighting so hard to get the entry of Olympics, and, probably to win a medal for his/her country, proudly. However, as a audience I have never thought about the mysteries and secrets of those medalists. What makes them excellent? What are the interesting facts about them?
As the final project of GSBA-524, I am asked to dig deeper in this topic as an audience and as a potential data analyst. All I have is a dirty dataset that consists of details of Olympics medalists from 2000 to 2012, Country GDP, Population, HDI and ISO codes. To be honest, I am a big basketball fans and I loved Kobe so much that I also plan to make this report a tribute to my beloved 2008 USA basketball team.
Clean the data
To start with, we need to read the dataset that is in ‘xls’ format into R. I changed its format to ‘csv’ in Excel and separately save the five worksheets beforehand.
1 2 3 4 5 6 |
setwd("../My Final Project/files") Pop = read.csv('Population.csv',header = TRUE) Code = read.csv('code.csv',header = TRUE) GDP = read.csv('GDP.csv',header = TRUE) HDI = read.csv('HDI.csv', header = TRUE) Medal = read.csv('Medal.csv',header = TRUE) |
Note that the setwd command is indispensable in this case.
Apparently, almost each data frame that were read in by R contains some blank columns in the end, so I tried to trim them out to make the data frames prettier.
1 2 3 4 |
Code = Code[,1:3] GDP = GDP[,1:4] HDI = HDI[,1:9] Pop = Pop[,1:4] |
Note that I found there are some duplicates in “GDP.csv” so I tried to omit all those observations. Instead of using R to solve this, I use Excel function “Countif” to find out that in “Int Olympic Committee code”, “MYA” repeats 24 times, “CZE” repeats 2 times, “THA” repeats 3 times, “BUR” repeats 2 times, “BIZ” repeats 2 times. Unfortunately, for each duplicate, I cannot find out which one of them is correct, so I decide to eliminate them entirely. That’s what following code does:
1 2 3 4 5 |
Pop = Pop[which(Pop$Int.Olympic.Committee.code != "MYA"),] Pop = Pop[which(Pop$Int.Olympic.Committee.code != "CZE"),] Pop = Pop[which(Pop$Int.Olympic.Committee.code != "THA"),] Pop = Pop[which(Pop$Int.Olympic.Committee.code != "BUR"),] Pop = Pop[which(Pop$Int.Olympic.Committee.code != "BIZ"),] |
Additionally, I found that in ‘Medal’ data frame there are some NA observations so I also omit those observations.
1 |
Medal = na.omit(Medal) |
Overview
In ‘Medal’ dataset, all the athletes who have got medals in Olympics from 2000 to 2012 are present. There are 8613 observations and 13 variables. Let’s take a look at it:
1 |
table(Medal$Year,Medal$Competing.Country) # Awarded athletes |
It’s such a long matrix that I set echo to FALSE otherwise there will be 3 pages full of data. This table basically summarize the number of awarded athletes by his nationality and display it by time. From the table, U.S.A., China and Russia are the top 3 countries in the history with most awarded athletes.
However, sometimes I want to see the total number of medals won by a specific country instead of the number of awarded athletes, because a talented athlete, such as Michael Phelps, could win several medals at the same event. Here’s my code to summarize it:
1 2 3 4 5 |
sum = tapply(Medal$Total.Medals, list(Medal$Competing.Country,Medal$Year), sum) sum[is.na(sum)] = 0 sum |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
## 2000 2002 2004 2006 2008 2010 2012 ## Afghanistan 0 0 0 0 1 0 1 ## Algeria 5 0 0 0 2 0 1 ## Argentina 20 0 49 0 51 0 19 ## Armenia 1 0 0 0 6 0 3 ## Australia 183 2 156 2 149 3 114 ## Austria 4 20 8 30 3 26 0 ## Azerbaijan 3 0 5 0 7 0 10 ## Bahamas 11 0 2 0 7 0 4 ## Bahrain 0 0 0 0 0 0 1 ## Barbados 1 0 0 0 0 0 0 ## Belarus 22 1 17 1 30 3 23 ## Belgium 7 0 3 0 5 0 3 ## Botswana 0 0 0 0 0 0 1 ## Brazil 48 0 40 0 74 0 56 ## Bulgaria 13 3 17 1 5 0 2 ## Cameroon 18 0 1 0 1 0 0 ## Canada 31 74 17 69 34 90 55 ## Chile 17 0 4 0 1 0 0 ## China 79 16 94 13 184 19 125 ## Chinese Taipei 5 0 9 0 4 0 2 ## Colombia 1 0 2 0 2 0 8 ## Costa Rica 2 0 0 0 0 0 0 ## Croatia 10 4 21 3 5 3 35 ## Cuba 65 0 62 0 47 0 14 ## Cyprus 0 0 0 0 0 0 1 ## Czech Republic 9 3 12 27 7 9 14 ## Denmark 25 0 29 0 19 0 16 ## Dominican Republic 0 0 1 0 2 0 2 ## Ecuador 0 0 0 0 1 0 0 ## Egypt 0 0 5 0 1 0 2 ## Eritrea 0 0 1 0 0 0 0 ## Estonia 3 3 3 3 3 1 2 ## Ethiopia 8 0 7 0 7 0 7 ## Finland 5 13 2 41 5 47 5 ## France 66 15 53 15 77 14 78 ## Gabon 0 0 0 0 0 0 1 ## Georgia 6 0 4 0 6 0 7 ## Germany 118 61 149 54 99 54 94 ## Great Britain 54 6 57 1 77 1 126 ## Greece 18 0 31 0 7 0 3 ## Grenada 0 0 0 0 0 0 1 ## Guatemala 0 0 0 0 0 0 1 ## Hong Kong 0 0 2 0 0 0 1 ## Hungary 53 0 40 0 27 0 25 ## Iceland 1 0 0 0 14 0 0 ## India 1 0 1 0 3 0 6 ## Indonesia 8 0 5 0 7 0 2 ## Iran 4 0 6 0 2 0 12 ## Ireland 1 0 0 0 3 0 5 ## Israel 1 0 2 0 1 0 0 ## Italy 65 21 104 25 43 5 68 ## Jamaica 23 0 13 0 19 0 25 ## Japan 44 2 93 1 51 7 84 ## Kazakhstan 7 0 8 0 13 1 13 ## Kenya 7 0 7 0 14 0 11 ## Kuwait 1 0 0 0 0 0 1 ## Kyrgyzstan 1 0 0 0 2 0 0 ## Latvia 3 0 4 1 3 3 3 ## Lithuania 17 0 3 0 5 0 5 ## Macedonia 1 0 0 0 0 0 0 ## Malaysia 0 0 0 0 1 0 2 ## Mauritius 0 0 0 0 1 0 0 ## Mexico 6 0 4 0 4 0 24 ## Moldova 2 0 0 0 1 0 2 ## Mongolia 0 0 1 0 4 0 5 ## Montenegro 0 0 0 0 0 0 14 ## Morocco 5 0 3 0 2 0 1 ## Mozambique 1 0 0 0 0 0 0 ## Netherlands 79 8 77 13 61 11 69 ## New Zealand 4 0 6 0 15 0 27 ## Nigeria 8 0 8 0 23 0 0 ## North Korea 4 0 5 0 6 0 6 ## Norway 43 41 7 23 22 39 17 ## Panama 0 0 0 0 1 0 0 ## Paraguay 0 0 17 0 0 0 0 ## Poland 24 2 12 2 20 8 12 ## Portugal 2 0 3 0 2 0 2 ## Puerto Rico 0 0 0 0 0 0 2 ## Qatar 1 0 0 0 0 0 2 ## Romania 46 0 39 0 22 0 16 ## Russia 187 38 191 41 146 25 140 ## Saudi Arabia 2 0 0 0 0 0 4 ## Serbia 0 0 0 0 15 0 16 ## Serbia and Montenegro 25 0 13 0 0 0 0 ## Singapore 0 0 0 0 3 0 4 ## Slovakia 6 0 10 1 10 3 5 ## Slovenia 3 4 5 0 5 3 5 ## South Africa 5 0 10 0 1 0 9 ## South Korea 73 7 52 19 78 18 61 ## Spain 42 0 29 0 70 0 64 ## Sri Lanka 1 0 0 0 0 0 0 ## Sudan 0 0 0 0 1 0 0 ## Sweden 32 26 12 64 7 18 22 ## Switzerland 14 24 7 21 11 12 4 ## Syria 0 0 1 0 0 0 0 ## Tajikistan 0 0 0 0 2 0 1 ## Thailand 3 0 8 0 4 0 3 ## Togo 0 0 0 0 1 0 0 ## Trinidad and Tobago 2 0 1 0 6 0 10 ## Tunisia 0 0 0 0 1 0 3 ## Turkey 5 0 10 0 8 0 5 ## Uganda 0 0 0 0 0 0 1 ## Ukraine 35 0 48 3 31 0 26 ## United Arab Emirates 0 0 1 0 0 0 0 ## United States 243 84 265 52 317 97 254 ## Uruguay 1 0 0 0 0 0 0 ## Uzbekistan 4 0 5 0 6 0 4 ## Venezuela 0 0 2 0 1 0 1 ## Vietnam 1 0 0 0 1 0 0 ## Zimbabwe 0 0 3 0 4 0 0 |
Dive deeper, and deeper!
Summer? Winter?
One of the factors notable here is that Olympics actually is divided into two categories – summer and winter. Different season make it consist of different sports. For instance, winter Olympics has skiing and skating that are exclusive in winter, while summer Olympics has soccer and athletics and so on. It would be more reasonable if we can look at the performance of all the countries in different seasonal sports.
To do so, I would use the created, well-structured “sum” dataset and split it into two parts. Since the column names are years in which an Olmpics is held, I manually grouped them using the following codes:
1 2 |
winter = sum[,c('2002','2006','2010')] summer = sum[,c('2000','2004','2008','2012')] |
Now we can take a look at the performance of all countries in different Olympics. To achieve that goal, I sumed up all the values of the cells in the same row using “apply” function:
1 |
apply(winter,1,sum) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
## Afghanistan Algeria Argentina ## 0 0 0 ## Armenia Australia Austria ## 0 7 76 ## Azerbaijan Bahamas Bahrain ## 0 0 0 ## Barbados Belarus Belgium ## 0 5 0 ## Botswana Brazil Bulgaria ## 0 0 4 ## Cameroon Canada Chile ## 0 233 0 ## China Chinese Taipei Colombia ## 48 0 0 ## Costa Rica Croatia Cuba ## 0 10 0 ## Cyprus Czech Republic Denmark ## 0 39 0 ## Dominican Republic Ecuador Egypt ## 0 0 0 ## Eritrea Estonia Ethiopia ## 0 7 0 ## Finland France Gabon ## 101 44 0 ## Georgia Germany Great Britain ## 0 169 8 ## Greece Grenada Guatemala ## 0 0 0 ## Hong Kong Hungary Iceland ## 0 0 0 ## India Indonesia Iran ## 0 0 0 ## Ireland Israel Italy ## 0 0 51 ## Jamaica Japan Kazakhstan ## 0 10 1 ## Kenya Kuwait Kyrgyzstan ## 0 0 0 ## Latvia Lithuania Macedonia ## 4 0 0 ## Malaysia Mauritius Mexico ## 0 0 0 ## Moldova Mongolia Montenegro ## 0 0 0 ## Morocco Mozambique Netherlands ## 0 0 32 ## New Zealand Nigeria North Korea ## 0 0 0 ## Norway Panama Paraguay ## 103 0 0 ## Poland Portugal Puerto Rico ## 12 0 0 ## Qatar Romania Russia ## 0 0 104 ## Saudi Arabia Serbia Serbia and Montenegro ## 0 0 0 ## Singapore Slovakia Slovenia ## 0 4 7 ## South Africa South Korea Spain ## 0 44 0 ## Sri Lanka Sudan Sweden ## 0 0 108 ## Switzerland Syria Tajikistan ## 57 0 0 ## Thailand Togo Trinidad and Tobago ## 0 0 0 ## Tunisia Turkey Uganda ## 0 0 0 ## Ukraine United Arab Emirates United States ## 3 0 233 ## Uruguay Uzbekistan Venezuela ## 0 0 0 ## Vietnam Zimbabwe ## 0 0 |
1 |
apply(summer,1,sum) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
## Afghanistan Algeria Argentina ## 2 8 139 ## Armenia Australia Austria ## 10 602 15 ## Azerbaijan Bahamas Bahrain ## 25 24 1 ## Barbados Belarus Belgium ## 1 92 18 ## Botswana Brazil Bulgaria ## 1 218 37 ## Cameroon Canada Chile ## 20 137 22 ## China Chinese Taipei Colombia ## 482 20 13 ## Costa Rica Croatia Cuba ## 2 71 188 ## Cyprus Czech Republic Denmark ## 1 42 89 ## Dominican Republic Ecuador Egypt ## 5 1 8 ## Eritrea Estonia Ethiopia ## 1 11 29 ## Finland France Gabon ## 17 274 1 ## Georgia Germany Great Britain ## 23 460 314 ## Greece Grenada Guatemala ## 59 1 1 ## Hong Kong Hungary Iceland ## 3 145 15 ## India Indonesia Iran ## 11 22 24 ## Ireland Israel Italy ## 9 4 280 ## Jamaica Japan Kazakhstan ## 80 272 41 ## Kenya Kuwait Kyrgyzstan ## 39 2 3 ## Latvia Lithuania Macedonia ## 13 30 1 ## Malaysia Mauritius Mexico ## 3 1 38 ## Moldova Mongolia Montenegro ## 5 10 14 ## Morocco Mozambique Netherlands ## 11 1 286 ## New Zealand Nigeria North Korea ## 52 39 21 ## Norway Panama Paraguay ## 89 1 17 ## Poland Portugal Puerto Rico ## 68 9 2 ## Qatar Romania Russia ## 3 123 664 ## Saudi Arabia Serbia Serbia and Montenegro ## 6 31 38 ## Singapore Slovakia Slovenia ## 7 31 18 ## South Africa South Korea Spain ## 25 264 205 ## Sri Lanka Sudan Sweden ## 1 1 73 ## Switzerland Syria Tajikistan ## 36 1 3 ## Thailand Togo Trinidad and Tobago ## 18 1 19 ## Tunisia Turkey Uganda ## 4 28 1 ## Ukraine United Arab Emirates United States ## 140 1 1079 ## Uruguay Uzbekistan Venezuela ## 1 19 4 ## Vietnam Zimbabwe ## 2 7 |
Interestingly, this separation makes me find out some other top-performance countries. Some countries are strong in summer, some are strong in winter and some are strong in both. We have talked about the top 3 countries of all time, which is U.S.A., China and Russia. U.S.A. and Russia are still top-ranking countries in number of medals in winter Olympics, but countries such as Finland(101), Germany(169), Canada(233), Sweden(108), Switzerland(57) and so on are climbing above China(48) in the list. This result has showed the spirit of Olympics very well. Talent makes a difference, so does hard work. Even countries with huge populations such as China and India are not as good as other much smaller countries in some sports. It drives players around the world to fight for the possible vitories.
Age
It would be fun to know in what age athletes are most likely to win medals. Let’s take a look at the summary of the “age” variable in “Medal” dataset:
1 |
mean(na.omit(Medal$Age)) |
1 |
## [1] 26.40543 |
1 |
max(na.omit(Medal$Age)) |
1 |
## [1] 61 |
1 |
min(na.omit(Medal$Age)) |
1 |
## [1] 15 |
It’s amazing to know that during 2000-2012 athletes aged 61 and 15 can also win medals for their countries. However, the mean age to win medals is 26, which is considered to be the “Golden Age” for athletes. To extend this topic further, I’d like to know the difference in age of athletes to win gold, silver and bronze medals. What’s in people’s mind could be that younger athletes have more chances to win gold medals and is it true?
1 2 3 4 5 |
Gold <- Medal[which(Medal$Gold.Medals != 0),] Silver <- Medal[which(Medal$Silver.Medals != 0),] Silver <- subset(Silver,Silver$Gold.Medals == 0) Bronze <- Medal[which(Medal$Bronze.Medals != 0),] Bronze <- subset(Bronze,Bronze$Gold.Medals == 0 & Bronze$Silver.Medals == 0) |
I store three kinds of medal separately. Note that when I store “Silver” and “Bronze”, I actually subset them with an additional condition to cut out those duplicates in the previous categories. Especially I want to see those players who only win bronze and silver medals, since those talented athletes often win a bunch of all kinds of medals at once, and I need to eliminate this factor. Now let’s look at the result
1 |
mean(na.omit(Medal$Age)) |
1 |
## [1] 26.40543 |
1 |
mean(na.omit(Gold$Age)) |
1 |
## [1] 26.33208 |
1 |
mean(na.omit(Silver$Age)) |
1 |
## [1] 26.48961 |
1 |
mean(na.omit(Bronze$Age)) |
1 |
## [1] 26.39841 |
Surprisingly, it is difficult to state that the more excellent athletes are, the younger they will be. Although athletes who won gold medals are youngest with average age of 26.33, I found that those who won bronze medals are younger than those who won silver, which is the opposite of my theory. If I had more time, I’d like to do an F-test to see whether those three mean values are significantly equal.
Relationship with GDP
People might believe that wealthy countries will have more awarded athletes because they can invest more resource on the research and training, but is this thought legitimate? I tried to join the “GDP.csv” with “Medal.csv” in R to do a linear regression. Firstly, I summarize the created dataset “sum” by year to get the total number of medals won by countries.
However, in this process there are challenges. I can not join “GDP” and “sum” directly even they have a common variables called “Country”. Therefore, I need to join “sum” wirh “Code” at first and then use “Int.Olympic.Committee.code” to join the new dataset with “GDP”, which is confusing. The reason why I use “Int.Olympic.Committee.code” is because the more simple one – “ISO.code” is lack of the code for the Great Britain, which is a key player in Olympics.
1 2 3 4 5 6 7 8 9 |
sumdal = as.data.frame(apply(sum, 1, sum)) sumdal1 <- cbind(rownames(sumdal), sumdal) colnames(sumdal1)[1] = 'Country' total = merge.data.frame(sumdal1,Code, by = 'Country') total = merge.data.frame(total,GDP,by='Int.Olympic.Committee.code') calculate_total = total[,-(5:6)] colnames(calculate_total) = c('a','b','c','d','e') |
Let’s take a look at the “calculate_total” dataset.
1 |
str(calculate_total) |
1 2 3 4 5 6 |
## 'data.frame': 109 obs. of 5 variables: ## $ a: Factor w/ 202 levels "AFG","AHO","ALB",..: 1 4 8 9 12 13 14 15 17 19 ... ## $ b: Factor w/ 110 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 10 12 ... ## $ c: num 2 8 139 10 609 91 25 24 1 18 ... ## $ d: Factor w/ 201 levels "","AD","AE","AF",..: 4 55 10 7 13 12 15 29 17 19 ... ## $ e: Factor w/ 149 levels "$1,000","$1,100",..: 2 132 42 122 106 105 14 83 64 91 ... |
I realized the last variable which states the GDP per capita for all countries is in “Factor” format, and it will cause problem when I want to do linear regression using it. Therefore, I use “gsub” and regular expressions to replace all the non-digit characters.
1 |
calculate_total$e = as.numeric(gsub('[^a-zA-Z0-9.]', '',calculate_total$e)) |
Great! It is now in numeric format and ready for the analysis. I created a variable called “lm1” to store the linear model because I want to see the summary of it.
1 2 |
lm1 = lm(calculate_total$c~calculate_total$e) summary(lm1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
## ## Call: ## lm(formula = calculate_total$c ~ calculate_total$e) ## ## Residuals: ## Min 1Q Median 3Q Max ## -377.54 -71.59 -28.36 -3.86 1107.83 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.527e+01 2.590e+01 0.590 0.556607 ## calculate_total$e 3.578e-03 9.886e-04 3.619 0.000453 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 173.9 on 107 degrees of freedom ## Multiple R-squared: 0.1091, Adjusted R-squared: 0.1007 ## F-statistic: 13.1 on 1 and 107 DF, p-value: 0.0004533 |
To sum up, the p-value for variable “calculate_total$e” is significant and the p-value for intercept is not. The R-square value is about 10% and it seems that GDP per capita does not account for the total number of medals very much. Why not take a look at the plot?
1 2 |
## $scipen ## [1] 0 |
There seems to be a weak positive relationship between GDP per capita and total number of medals, with every 1 unit increase in GDP per capita would result in an increase of 0.003 unit of the number of medals.
Relationship with Population
Similarly, since we have the population data for all the countries, we can do a similar analysis to the relationship between total number of medals and its population for each country.
1 2 3 4 5 6 7 8 |
total1 = merge.data.frame(sumdal1,Code, by = 'Country') total1 = merge.data.frame(total1,Pop,by='Int.Olympic.Committee.code') calculate_total1 = total1[,-(5:6)] # remove wrong pairs colnames(calculate_total1) = c('a','b','c','d','e') calculate_total1$e = as.numeric(gsub('[^a-zA-Z0-9.]', '',calculate_total1$e)) lm2 = lm(calculate_total1$c~calculate_total1$e) summary(lm2) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
## ## Call: ## lm(formula = calculate_total1$c ~ calculate_total1$e) ## ## Residuals: ## Min 1Q Median 3Q Max ## -442.24 -79.12 -66.67 9.85 1135.96 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 79.6857544751 20.0415526933 3.976 0.000138 *** ## calculate_total1$e 0.0000003021 0.0000001012 2.984 0.003631 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 186.6 on 93 degrees of freedom ## Multiple R-squared: 0.08739, Adjusted R-squared: 0.07758 ## F-statistic: 8.906 on 1 and 93 DF, p-value: 0.003631 |
Looking at the summary, I can say the both estimated parameters for intercept and “calculate_total1$e” are significant, implying that the regression is quite robust. However, the R-square is about 7% and it is a reasonable value for a simple linear model with only one independent variable. There is a trend of bigger population makes more medals, but I also noticed that a lot of points are accumulated in the left-bottom corner. In other words, a lot of countries have similar number of medals and some other countries such as U.S.A and Russia have abnormal numbers of medals or population. However, instead of cutting off these obvious outliers, I think I should keep them in the analysis because these points account for almost over 50% of the total number of medals. If we eliminate them in the dataset, it would cause serious bias in our results, which are not preferred.
What is every country’s best sport
It will be interesting to know that what the best-good-at sport is for each country. For example, U.S.A has a dream team for basketball and all of those popular NBA stars play for U.S.A. in Olympics. While the dominant power of a specific sport compared to that of other conntries is a key factor, I tend to use the total number of medals earned by that kind of sport as my indicator because it will be more convienient to calculate.
1 2 3 4 5 6 7 8 9 10 11 12 |
sum1 = tapply(Medal$Total.Medals, list(Medal$Competing.Country,Medal$Sport), sum) sum1 = as.data.frame(sum1) sum1[is.na(sum1)] = 0 a = 1 for(i in sum1[,1:49]){ print(colnames(sum1)[a]) print(rownames(sum1[match(max(i),i),])) print(max(i)) a = a + 1 } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
## [1] "Alpine Skiing" ## [1] "Austria" ## [1] 27 ## [1] "Archery" ## [1] "South Korea" ## [1] 34 ## [1] "Athletics" ## [1] "United States" ## [1] 147 ## [1] "Badminton" ## [1] "China" ## [1] 44 ## [1] "Baseball" ## [1] "Cuba" ## [1] 72 ## [1] "Basketball" ## [1] "United States" ## [1] 95 ## [1] "Beach Volleyball" ## [1] "Brazil" ## [1] 18 ## [1] "Biathlon" ## [1] "Germany" ## [1] 40 ## [1] "Bobsleigh" ## [1] "Germany" ## [1] 26 ## [1] "Boxing" ## [1] "Cuba" ## [1] 26 ## [1] "Canoeing" ## [1] "Germany" ## [1] 68 ## [1] "Cross Country Skiing" ## [1] "Norway" ## [1] 38 ## [1] "Curling" ## [1] "Canada" ## [1] 30 ## [1] "Cycling" ## [1] "Great Britain" ## [1] 58 ## [1] "Diving" ## [1] "China" ## [1] 55 ## [1] "Equestrian" ## [1] "Germany" ## [1] 40 ## [1] "Fencing" ## [1] "Italy" ## [1] 57 ## [1] "Figure Skating" ## [1] "Russia" ## [1] 16 ## [1] "Football" ## [1] "Brazil" ## [1] 70 ## [1] "Freestyle Skiing" ## [1] "United States" ## [1] 8 ## [1] "Gymnastics" ## [1] "United States" ## [1] 55 ## [1] "Handball" ## [1] "Spain" ## [1] 44 ## [1] "Hockey" ## [1] "Netherlands" ## [1] 113 ## [1] "Ice Hockey" ## [1] "United States" ## [1] 106 ## [1] "Judo" ## [1] "Japan" ## [1] 32 ## [1] "Luge" ## [1] "Germany" ## [1] 17 ## [1] "Modern Pentathlon" ## [1] "Great Britain" ## [1] 5 ## [1] "Nordic Combined" ## [1] "Austria" ## [1] 17 ## [1] "Rhythmic Gymnastics" ## [1] "Russia" ## [1] 31 ## [1] "Rowing" ## [1] "Great Britain" ## [1] 80 ## [1] "Sailing" ## [1] "Great Britain" ## [1] 34 ## [1] "Shooting" ## [1] "China" ## [1] 32 ## [1] "Short-Track Speed Skating" ## [1] "Canada" ## [1] 38 ## [1] "Skeleton" ## [1] "Canada" ## [1] 4 ## [1] "Ski Jumping" ## [1] "Austria" ## [1] 12 ## [1] "Snowboarding" ## [1] "United States" ## [1] 17 ## [1] "Softball" ## [1] "Australia" ## [1] 45 ## [1] "Speed Skating" ## [1] "Netherlands" ## [1] 31 ## [1] "Swimming" ## [1] "United States" ## [1] 267 ## [1] "Synchronized Swimming" ## [1] "Russia" ## [1] 43 ## [1] "Table Tennis" ## [1] "China" ## [1] 43 ## [1] "Taekwondo" ## [1] "South Korea" ## [1] 14 ## [1] "Tennis" ## [1] "United States" ## [1] 16 ## [1] "Trampoline" ## [1] "China" ## [1] 8 ## [1] "Triathlon" ## [1] "Australia" ## [1] 5 ## [1] "Volleyball" ## [1] "Russia" ## [1] 70 ## [1] "Waterpolo" ## [1] "United States" ## [1] 63 ## [1] "Weightlifting" ## [1] "China" ## [1] 31 ## [1] "Wrestling" ## [1] "Russia" ## [1] 41 |
The code creates a list of all kinds of the sports in Olympics and also displays what country is best in it with what number of medals are won by the country in total.
Summary
I have investigated five aspects in this report. There are obviously many other opportunies to dig deeper, for instance, to use the remaining “HDI.csv” dataset. Before starting this project, I have some stereotype in Olympics like Switzerland is good at tennis because it has Roger Federer, but I found out that actually U.S.A. has won more medals than switzerland. Another example is that my motherland China is not as good at winter sports as it is in summer sports. To sum up, I appreciate this final project to give me a practice both in theory learnt in class and R programming language. Also, it gives me a new perspective to look at Olympics so next time when I talk about some interesting topics with my friends, I’ll say my sayings are backed by numbers. 🙂
Wow, great post.Really looking forward to read more.
“some genuinely interesting details you have written.”
love it, keep up good work mate!