Using R to Fetch List of Pokemon Sets
Using R to Fetch List of Pokemon Sets
This document shows how to extract a dataset from an HTML page.
We�ll start by loading two libraries. RCurl is used to read an HTML page. XML is used to parse HTML which can be viewed as a form of XML.
library(RCurl)
## Loading required package: bitops
library(XML)
Let R know where to find the HTML page. Then download and parse it.
theurl <- "http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_Trading_Card_Game_expansions" webpage <- getURL(theurl) webpage <- readLines(tc <- textConnection(webpage)); close(tc) doc <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
Use XPATH to extract all tr (table row) nodes from the HTML page. There is a lot of extraneous information in those tr nodes so we�ll filter the list from 70 elements to 67 elements.
tr <- getNodeSet(doc, "//*/tr") tr_with_pokemon_sets <- tr[4:length(tr)-1]
Let�s look at one example of the HTML. It holds information about one Pokemon set. The pound signs at the start of the lines are not part of the data, they are just part of the printing.
tr_with_pokemon_sets[1]
[[1]] <tr><th> 1 </th> <td> 1 </td> <td> </td> <td> <a href="/wiki/Base_Set_(TCG)" title="Base Set (TCG)">Base Set</a> </td> <td> Expansion Pack </td> <td> 102 </td> <td> 102 </td> <td> January 9, 1999 </td> <td> October 20, 1996 </td></tr>
In order to make sense of that HTML, we�ll use a custom function to manipulate each element in tr_with_pokemon_sets. Generally speaking, the function removes newlines and HTML syntax. It also provides data types and column names.
xmlToCsv <- function(xml) { a <- gsub( , , xmlValue(xml)) b <- gsub( , , a) d <- gsub( , , b) e <- gsub(^ | $,, d) f <- gsub( , , e) cc <- c("numeric", "numeric", "character", "character", "character", "character", "character", "character", "character") cn <- c("EngNumber", "JpNumber", "Icon", "EngSet", "JpSet", "EngCardCount", "JpCardCount", "EngDate", "JpDate") g <- read.table(text=f, sep=" ", header=FALSE) colnames(g) <- cn keeps <- c("EngNumber", "EngSet", "EngCardCount") return(g[keeps]) }
Magic happens next. We apply the custom function, convert results toa data.frame and remove NA values.
pokemon_set_dataframe <- na.omit(do.call(rbind, lapply(tr_with_pokemon_sets, xmlToCsv)))
The information is displayed so you can see the data so far.
pokemon_set_dataframe
EngNumber EngSet EngCardCount 1 1 Base Set 102 2 2 Jungle 64 3 3 Fossil 62 4 4 Base Set 2 130 5 5 Team Rocket 83* 6 7 Gym Challenge 132 7 8 Neo Genesis 111 8 9 Neo Discovery 75 9 10 Neo Revelation 66* 10 11 Neo Destiny 113* 11 12 Legendary Collection 110 14 13 Expedition Base Set 165 15 14 Aquapolis 186* 16 14 Aquapolis 186* 17 15 Skyridge 182* 18 15 Skyridge 182* 19 16 EX Ruby & Sapphire 109 20 17 EX Sandstorm 100 21 18 EX Dragon 100* 22 19 EX Team Magma vs Team Aqua 97* 23 20 EX Hidden Legends 102* 24 21 EX FireRed & LeafGreen 116* 25 22 EX Team Rocket Returns 111* 26 23 EX Deoxys 108* 27 24 EX Emerald 107* 28 25 EX Unseen Forces 145* 29 26 EX Delta Species 114* 30 27 EX Legend Maker 93* 31 28 EX Holon Phantoms 111* 32 29 EX Crystal Guardians 100 33 30 EX Dragon Frontiers 101 34 31 EX Power Keepers 108 35 32 Diamond & Pearl 130 36 33 Mysterious Treasures 124* 37 34 Secret Wonders 132 38 35 Great Encounters 106 39 36 Majestic Dawn 100 40 37 Legends Awakened 146 41 38 Stormfront 106* 42 40 Rising Rivals 120* 43 41 Supreme Victors 153* 44 42 Arceus 111* 45 43 HeartGold & SoulSilver 124* 46 44 Unleashed 96* 47 45 Undaunted 91* 48 46 Triumphant 103* 49 47 Call of Legends 106 50 48 Black & White 115* 51 49 Emerging Powers 98 52 50 Noble Victories 102* 53 51 Next Destinies 103* 54 52 Dark Explorers 111* 55 53 Dragons Exalted 128* 56 54 Boundaries Crossed 153* 57 55 Plasma Storm 138* 58 56 Plasma Freeze 122* 59 57 Plasma Blast 105* 60 58 Legendary Treasures 138* 61 59 XY 146 62 60 Flashfire 109* 63 61 Furious Fists 113* 64 62 Phantom Forces 122* 65 63 Primal Clash 150+
Notice those extra asterisks and plus signs? The next bit of code removes them.
pokemon_set_dataframe$EngCardCount <- gsub("*|+", "", pokemon_set_dataframe$EngCardCount)
Here is the final dataset.
pokemon_set_dataframe
EngNumber EngSet EngCardCount 1 1 Base Set 102 2 2 Jungle 64 3 3 Fossil 62 4 4 Base Set 2 130 5 5 Team Rocket 83 6 7 Gym Challenge 132 7 8 Neo Genesis 111 8 9 Neo Discovery 75 9 10 Neo Revelation 66 10 11 Neo Destiny 113 11 12 Legendary Collection 110 14 13 Expedition Base Set 165 15 14 Aquapolis 186 16 14 Aquapolis 186 17 15 Skyridge 182 18 15 Skyridge 182 19 16 EX Ruby & Sapphire 109 20 17 EX Sandstorm 100 21 18 EX Dragon 100 22 19 EX Team Magma vs Team Aqua 97 23 20 EX Hidden Legends 102 24 21 EX FireRed & LeafGreen 116 25 22 EX Team Rocket Returns 111 26 23 EX Deoxys 108 27 24 EX Emerald 107 28 25 EX Unseen Forces 145 29 26 EX Delta Species 114 30 27 EX Legend Maker 93 31 28 EX Holon Phantoms 111 32 29 EX Crystal Guardians 100 33 30 EX Dragon Frontiers 101 34 31 EX Power Keepers 108 35 32 Diamond & Pearl 130 36 33 Mysterious Treasures 124 37 34 Secret Wonders 132 38 35 Great Encounters 106 39 36 Majestic Dawn 100 40 37 Legends Awakened 146 41 38 Stormfront 106 42 40 Rising Rivals 120 43 41 Supreme Victors 153 44 42 Arceus 111 45 43 HeartGold & SoulSilver 124 46 44 Unleashed 96 47 45 Undaunted 91 48 46 Triumphant 103 49 47 Call of Legends 106 50 48 Black & White 115 51 49 Emerging Powers 98 52 50 Noble Victories 102 53 51 Next Destinies 103 54 52 Dark Explorers 111 55 53 Dragons Exalted 128 56 54 Boundaries Crossed 153 57 55 Plasma Storm 138 58 56 Plasma Freeze 122 59 57 Plasma Blast 105 60 58 Legendary Treasures 138 61 59 XY 146 62 60 Flashfire 109 63 61 Furious Fists 113 64 62 Phantom Forces 122 65 63 Primal Clash 150
With a bit more complexity the first column of numbers can be removed.
x <- as.matrix(format(pokemon_set_dataframe)) rownames(x) <- rep("", nrow(x)) print(x, quote=FALSE)
EngNumber EngSet EngCardCount 1 Base Set 102 2 Jungle 64 3 Fossil 62 4 Base Set 2 130 5 Team Rocket 83 7 Gym Challenge 132 8 Neo Genesis 111 9 Neo Discovery 75 10 Neo Revelation 66 11 Neo Destiny 113 12 Legendary Collection 110 13 Expedition Base Set 165 14 Aquapolis 186 14 Aquapolis 186 15 Skyridge 182 15 Skyridge 182 16 EX Ruby & Sapphire 109 17 EX Sandstorm 100 18 EX Dragon 100 19 EX Team Magma vs Team Aqua 97 20 EX Hidden Legends 102 21 EX FireRed & LeafGreen 116 22 EX Team Rocket Returns 111 23 EX Deoxys 108 24 EX Emerald 107 25 EX Unseen Forces 145 26 EX Delta Species 114 27 EX Legend Maker 93 28 EX Holon Phantoms 111 29 EX Crystal Guardians 100 30 EX Dragon Frontiers 101 31 EX Power Keepers 108 32 Diamond & Pearl 130 33 Mysterious Treasures 124 34 Secret Wonders 132 35 Great Encounters 106 36 Majestic Dawn 100 37 Legends Awakened 146 38 Stormfront 106 40 Rising Rivals 120 41 Supreme Victors 153 42 Arceus 111 43 HeartGold & SoulSilver 124 44 Unleashed 96 45 Undaunted 91 46 Triumphant 103 47 Call of Legends 106 48 Black & White 115 49 Emerging Powers 98 50 Noble Victories 102 51 Next Destinies 103 52 Dark Explorers 111 53 Dragons Exalted 128 54 Boundaries Crossed 153 55 Plasma Storm 138 56 Plasma Freeze 122 57 Plasma Blast 105 58 Legendary Treasures 138 59 XY 146 60 Flashfire 109 61 Furious Fists 113 62 Phantom Forces 122 63 Primal Clash 150
And we can plot the number of cards per set against the set number.
plot(pokemon_set_dataframe[c(1,3)])
download file now