Friday, June 22, 2012

R and the web (for beginners), Part II: XML in R


This second post of my little series on R and the web deals with how to access and process XML-data with R. XML is a markup language that is commonly used to interchange data over the Internet. If you want to access some online data over a webpage's API you are likely to get it in XML format. So here is a very simple example of how to deal with XML in R.
Duncan Temple Lang wrote a very helpful R-package which makes it quite easy to parse, process and generate XML-data with R. I use that package in this example. The XML document (taken from w3schools.com) used in this example describes a fictive plant catalog. Not that thrilling, I know, but the goal of this post is not to analyze the given data but to show how to parse it and transform it to a data frame. The analysis is up to you...

How to parse/read this XML-document into R?
 
# install and load the necessary package

install.packages("XML")
library(XML)


# Save the URL of the xml file in a variable

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"

# Use the xmlTreePares-function to parse xml file directly from the web
 
xmlfile <- xmlTreeParse(xml.url)


# the xml file is now saved as an object you can easily work with in R:

class(xmlfile)



# Use the xmlRoot-function to access the top node

xmltop = xmlRoot(xmlfile)

# have a look at the XML-code of the first subnodes:

print(xmltop)[1:2]

This should look more or less like:


$PLANT
<PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
</PLANT>

$PLANT
<PLANT>
 <COMMON>Columbine</COMMON>
 <BOTANICAL>Aquilegia canadensis</BOTANICAL>
 <ZONE>3</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$9.37</PRICE>
 <AVAILABILITY>030699</AVAILABILITY>
</PLANT>

attr(,"class")
[1] "XMLNodeList"

One can already assume how this data should look like in a matrix or data frame. The goal is to extract the XML-values from each XML-tag <> for all $PLANT nodes and save them in a data frame with a row for each plant ($PLANT-node) and a column for each tag (variable) describing it. How can you do that?


# To extract the XML-values from the document, use xmlSApply:

plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))


# Finally, get the data in a data-frame and have a look at the first rows and columns

plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]

The first rows and columns of that data frame should look like this:
 
               COMMON              BOTANICAL ZONE        LIGHT
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady
2           Columbine   Aquilegia canadensis    3 Mostly Shady
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny
4             Cowslip       Caltha palustris    4 Mostly Shady
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady
Which is exactly what we need to analyze this data in R.



Thursday, June 21, 2012

R and the web (for beginners), Part I: How is the local nuclear plant doing?



One of the things I especially like about R is its ability to easily access and process data from the web. If you are new to R, or never have used it to access data from the Internet, here is the first part of a little series of posts with examples to get you started. This first post gives a very simple example of how to access a data set that is saved online.

This might be particularly useful if the data set at hand is frequently updated and you want to repeatedly generate a statistical report of some kind based on this data set. Hence, having the analysis and the link to the data in one R-script means you only have to rerun the script whenever you want to update your report or anybody else wants to reproduce it. The data I'm using for this example is exactly of that type. It's a file published by the United States Nuclear Regulatory Commission (U.S. NRC) reporting the power reactor status of U.S. nuclear power plants for the last 365 days, thus it is updated every day. 

How to access that data directly through R? 

# First: save the url of the data file as character string

url.npower <- "http://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/PowerReactorStatusForLast365Days.txt"


# then read the data file into R (in this case the data file is a text file with "|" separating the columns) 

npower <- read.table(url.npower, sep="|", header=TRUE)

# and format the date column

npower$ReportDt <- as.Date(npower$ReportDt, format="%m/%d/%Y")


The data set is now ready for analysis. For example: a graphical analysis of the recent power reactor status of some of the nuclear power plants:

# load the necessary lattice package
# (if it isn't installed yet, run: install.packages("lattice")
library(lattice)

# take a subset of the data
sample <- npower[npower$Unit==as.character(unique(npower$Unit)[1:24]),]

# get a graphical overview
xyplot(Power~ReportDt | Unit, data=sample, type="l",col.line="black", xlab="Time",ylab="Power" )



Save the code above in an R-script, rerun it some days later, and your graphical analysis will be up to date.