Web-scraping ODI Cricket World Cup Matches

Web-scraping ODI Cricket World Cup Matches

Introduction

World Cup matches are now in action as all initial matches are over, while semi-final and final matches are to come. Sri Lanka is not in the final four, but not at the bottom as well thankfully we ended up in sixth place.

It should be noted that not all world cups are the same, earlier it was 60 overs for a team now it is 50 overs. Two decades ago there were less than 10 teams who played cricket, but now more than 10 teams play cricket. To be more clear in 2019 Afghanistan is playing in the world cup but not Ireland, Scotland, Netherlands, Zimbabwe or Kenya. Further, in late 1990s Duckworth-Lewis method was introduced to avoid abandoning matches because of slight drizzle.

I initially wanted to do a clear analysis on all world cups so far, but as there are clear differences among several world-cups I will with hold that for now. Yet, In order to do this Exploratory Data analysis on World cups we need data. This data is extracted from Espn CricInfo website.

Web scraping

This link provides links to all world cups. These individual webpages all have the same format, therefore it is easy to extract information from them.

2019 world cup is unfinished therefore its data format is different from others and with missing information. Therefore 2019 world cup data is not added here. To extract this information R packages “rvest”, “stringr” and “splitstackshape” are used. rvest is for information extraction from webpages, where stringr and splitstackshape are used for manipulating characters, strings, columns and rows.

Below figure is the layout from the world-cup 1975 page. Also the highlighted area is chosen by me with the help of browser extension “Selector Gadget”. This extension is used to select the css section of a given websites highlighted area.

Process

  1. Using the rvest package we can extract this highlighted information, more knowledge regarding this can be studied from this blog post and Vignettes File.

Below is the code for data extraction from website, finally we have a character class list with 22 rows of information.

# Load the packages
  library(rvest)
## Loading required package: xml2
  library(stringr)
  library(splitstackshape)

# html link for world cup 1975  
  weblink <- "http://stats.espncricinfo.com/ci/engine/series/60793.html"

# extract the webpage information from cricinfo page
  base_url <- weblink
  readpage <- read_html(base_url)
  dialogue <- html_nodes(readpage,'.small-20')
  
# extract the data from html components
  data <- html_text(dialogue)
  class(data)
  str(data)
  1. The resultant is a long list of Characters, but it could be different according to what you have highlighted in the website.
# add the infor into a matrix
  data_mat <- matrix(data)
  # View(data_mat) # view the data

  1. Here, the world cup data of 1975 has multiple rows of information but one column, one row contains all the information regarding one match. All the information means match number, who vs who, location, winning conditions, number of runs scored, number of wickets taken, number of overs played.
# one row information
  data_mat[4,]
## [1] " \n 1st Match: \n  \n  \t\n\t\tEngland v India at Lord's\n\t\n  \n  - Jun 7, 1975\n \n\n\n\t\t\t  England won by 202 runs\n\n England 334/4 (60/60 ov); India 132/3 (60/60 ov)\n\n\n\n Scorecard\n\n\tArticles (7)\n\n\tPhotos (8)\n"
  1. The difficult part is dividing this one row of information which has similar pattern over rows into separate columns of meaningful information. Also it should be noted that several rows contain unnecessary information and they should be removed. Further, each row has latex notations such as " “,”" and “, which will be replaced by”-" because already we do have “-”.
# remove unnecessary rows     
  data_mat_edit <- as.matrix(data_mat[-c((1:3),((dim(data_mat)[1]-3):dim(data_mat)[1]))])
  
# replace \n, \t and " " with - 
  data_mat_rep <- data.frame(str_replace_all(data_mat_edit,c("\n"="-","\t"="-"," "="-")))
  names(data_mat_rep) <- "Column"
  # View(data_mat_rep)

Before —–> After

  1. It should be noted that being creative is the key part here, thankfully replacing those latex notations has made it easy to split the one column data frame into multiple columns with the help of “cSplit” function. After that we can select only the columns which matters to us and make them into a new data-set.
# remove the --- patterns and split it into  columns
  data_mat_tab <- cSplit(data_mat_rep,"Column","---")
  data_mat_tab_good <- data_mat_tab[,c(2,5,8,11,12)]
  # View(data_mat_tab_good)

data_mat_tab

data_mat_tab_good

Now we have a much better data frame than earlier, which makes a lot sense.

  1. Now is the next part of our data extraction, where we have to extract much more information from the last column which is about runs, wickets, overs of both teams. Along the way we can manipulate all the columns into losing the “-” notation and cleaning the data-set.

a. Extracting matches information and storing it in a separate column after removing “:”.

# acquire information about matches, basically which one and what type
  Matches <- str_remove(data_mat_tab_good$Column_02,":")
  # View(Matches)

Matches

b. Extracting the Location and finding who vs who.

# acquiring who vs who and locations
  WhovsWhos <- data.frame(cSplit(data_mat_tab_good,
                                 "Column_05","-at-")[,c("Column_05_1","Column_05_2")])
  Location <- data.frame(WhovsWhos$Column_05_2)
  WhovsWho <- str_replace(WhovsWhos$Column_05_1,"--","")
  # View(WhovsWhos)
  # View(Location)
  # View(WhovsWho)

Whovswhos—>Location–>WhovsWho

c. Extracting the Date match and winning conditions.

# acquiring the Date
  Date <- str_replace(str_replace(data_mat_tab_good$Column_08,",",""),"-","")
  # View(Date)

# Information of how wins
  won <- str_replace_all(str_replace(data_mat_tab_good$Column_11,"-",""),"-"," ")
  # View(won)

Date—->won

d. Information of both teams who played the match into separate columns.

# match info extracted
  onetwo <- data.frame(cSplit(data_mat_tab_good,
                              "Column_12",";")[,c("Column_12_1","Column_12_2")])
  names(onetwo) <- c("Team1","Team2")
  # View(onetwo)
  
# Team 1 overs and Team 2 overs extracted
  team1_overs <- str_remove(gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", 
                                 onetwo$Team1, perl=T),"-ov")
  team2_overs <- str_remove(gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", 
                                 onetwo$Team2, perl=T),"-ov")
  # View(team1_overs)
  # View(team2_overs)
  
# Runs of Team 1 and Runs of Team 2 extracted
  Runs1 <- str_match(onetwo$Team1, "\\-(\\d{2,3})")[,2]
  Runs2 <- str_match(onetwo$Team2, "\\-(\\d{2,3})")[,2]
  # View(Runs1)
  # View(Runs2)
  
# Country Name of Team 1 and Country Name of Team 2 extracted
  Team1 <- dplyr::recode_factor(cSplit(onetwo,"Team1","-")$Team1_1,
                                "Sri"="Sri Lanka","West"="West Indies",
                                "South"="South Africa","New"="New Zealand")
  Team2 <- dplyr::recode_factor(cSplit(onetwo,"Team2","-")$Team2_2,
                                "Sri"="Sri Lanka","West"="West Indies",
                                "South"="South Africa","New"="New Zealand")
  # View(Team1)
  # View(Team2)

# Wickets by Team 1 and Team 2 extracted    
  wickets1 <- str_match(onetwo$Team1, "\\-(\\d{2,3})\\/(\\d{0,1})\\-")[,3]
  wickets1[is.na(wickets1)]<-10
  
  wickets2 <- str_match(onetwo$Team2, "\\-(\\d{2,3})\\/(\\d{0,1})\\-")[,3]
  wickets2[is.na(wickets2)]<-10
  
  # View(wickets1)
  # View(wickets2)

Extracting useful match information # change

e. All those separate information after extraction now into one data set.

  output <- cbind.data.frame(Matches,Location,WhovsWho,Date,won,Team1,team1_overs,
                             Runs1,wickets1,Team2,team2_overs,Runs2,wickets2)
  # View(output)

Output # change

Now we have a clear table output, it is time to extract the data of all world cups(except 2019) into separate tables. As I mentioned before not all world cups stay the same in format therefore we are going to check these extracted tables for discrepancy. In order to do this extraction I am going to write the above code into one function.

Below are the links for different world cups.

Function for data extraction of World cup matches

Further, below is the code for extraction of data from html webpages.

WorldCup<-function(weblink)
{
  #extract the webpage information from cricinfo page
  base_url<-weblink
  readpage<-read_html(base_url)
  dialogue<-html_nodes(readpage,'.small-20')
  
  # extract the data from html components
  data<-html_text(dialogue)
  
  # add the infor into a matrix
  data_mat<-matrix(data)
  
  # remove unnecessary rows     
  data_mat_edit<-as.matrix(data_mat[-c((1:3),((dim(data_mat)[1]-3):dim(data_mat)[1]))])
  
  # replace \n, \t and " " with - 
  data_mat_rep<-data.frame(str_replace_all(data_mat_edit,c("\n"="-","\t"="-"," "="-")))
  
  names(data_mat_rep)<-"Column"
  
  # remove the --- patterns and split it into  columns
  data_mat_tab<-cSplit(data_mat_rep,"Column","---")
  data_mat_tab_good<-data_mat_tab[,c(2,5,8,11,12)]
  
  # acquire information about matches, basically which one and what type
  Matches<-str_remove(data_mat_tab_good$Column_02,":")
  
  # acquiring who vs who and locations
  WhovsWhos<-data.frame(cSplit(data_mat_tab_good,"Column_05","-at-")[,c("Column_05_1","Column_05_2")])
  Location<-data.frame("Location"=WhovsWhos$Column_05_2)
  WhovsWho<-str_replace(WhovsWhos$Column_05_1,"--","")
  
  # acquiring the Date
  Date<-str_replace(str_replace(data_mat_tab_good$Column_08,",",""),"-","")
  
  # Information of how wins
  won<-str_replace_all(str_replace(data_mat_tab_good$Column_11,"-",""),"-"," ")
  
  # match info extracted
  onetwo<-data.frame(cSplit(data_mat_tab_good,"Column_12",";")[,c("Column_12_1","Column_12_2")])
  names(onetwo)<-c("Team1","Team2")
  
  # Team 1 overs and Team 2 overs extracted
  team1_overs<-str_remove(gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", onetwo$Team1, perl=T),"-ov")
  team2_overs<-str_remove(gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", onetwo$Team2, perl=T),"-ov")
  
  # Runs 1 and Runs 2 extracted
  Runs1<-str_match(onetwo$Team1, "\\-(\\d{2,3})")[,2]
  Runs2<-str_match(onetwo$Team2, "\\-(\\d{2,3})")[,2]
  
  # Team 1 and Team 2 extracted
  Team1<-dplyr::recode_factor(cSplit(onetwo,"Team1","-")$Team1_1,"Sri"="Sri Lanka","West"="West Indies",
                              "South"="South Africa","New"="New Zealand","East"="East Africa",
                              "United"="United Arab Emirates")
  Team2<-dplyr::recode_factor(cSplit(onetwo,"Team2","-")$Team2_2,"Sri"="Sri Lanka","West"="West Indies",
                              "South"="South Africa","New"="New Zealand","East"="East Africa",
                              "United"="United Arab Emirates")
  
  # Wickets by Team 1 and Team 2 extracted    
  wickets1 <- str_match(onetwo$Team1, "\\-(\\d{2,3})\\/(\\d{0,1})\\-")[,3]
  wickets1[is.na(wickets1)]<-10
  
  wickets2 <- str_match(onetwo$Team2, "\\-(\\d{2,3})\\/(\\d{0,1})\\-")[,3]
  wickets2[is.na(wickets2)]<-10
    
  output<-cbind.data.frame("Matches" = Matches, "Location" = Location,"WhovsWho" = WhovsWho,"Date" = Date, 
                           "WinningConditions" = won, "Team1" = Team1, "Overs1" = team1_overs, "Runs1" =  Runs1,"Wickets1"=wickets1,
                           "Team2" = Team2, "Overs2" = team2_overs, "Runs2" = Runs2,"Wickets2"=wickets2)
  
  return(output)
}

Using the links above and function developed by me it is possible to extract each world cup information into separate data-frames, after a few tweaks(because as I said earlier not all world cups share the same format) we can generate one grand data-frame.

If we compare these 11 world cups and their matches it is clear that after 1992 it is no longer the same simple format. This earlier format has initial matches, semi-finals and at the end final. This format changes to initial matches, quarter finals, semi-finals and then only final in later versions. In 1999 this format was renewed and extra set of matches such as super sixes were added but with the removal of quarter finals. This format of super eights continued in 2003 and 2007 also. Yet in 2011 and 2015 this was changed by the removal of super eight matches, finally this year all together we do not even have quarter finals. This is the summary of match formats.

WorldCup 1975

WorldCup 1999

WorldCup 2011

Because of these format changes only the first column(Matches) will be affected and as it is a minor change I am going rectify it after extracting the data using the World Cup function. My main intention of this post is to explain how simple extracting information from webpages is and I hope it is completed.

Summarize

  • Extracting information from websites is mainly affected by the selector gadget and its highlighted area.

  • Ensure that you highlight the most sensible part of the webpage and be prepared to think few steps ahead how you are going to change this css information into a data-frame.

  • If your first attempt creates a complex structure and difficult to construct a clear data-frame do not hesitate to make changes in your highlighting process using selector gadget extension in the browser.

  • After creating this initial one column data-frame it is purely based on your creative skills to construct the much divided meaningful data-frame with multiple columns. This data-frame does not necessarily has to be perfect, but it needs to be more informative.

  • With the creation of this initial data-frame it is possible to make more intuitive changes for a much clear data-frame.

  • That is all the information related to extracting data from web pages.

THANK YOU