Friday 22 April 2011

How to Build a Dataset in R using an RSS feed or Web page

I recently wanted to build a dataset from content in an RSS feed - the feed of crimes in Newark provided by SpotCrime.  (They have feeds for lots of US cities, but I just wanted Newark.  Please read their Terms of Service before using this code on their feed.)  After some tinkering, I got it to work using the XML package in R. 
The first step is to read in the RSS feed XML file:


#install.packages("XML")
library(XML)
doc<-xmlTreeParse("http://s3.spotcrime.com/cache/rss/newark.xml")


The xmlTreeParse command "parses an XML or HTML file or string containing XML/HTML content, and generates an R structure representing the XML/HTML tree."  There are tons of optional arguments, but as you can see, I didn't use any of them, and frankly, I don't understand many of them.  But the function did what I wanted.
Next, I used the command xmlRoot to isolate the "top level XMLNode object resulting from parsing an XML document."  Now is a good time to look at what we have:

> xmlRoot(doc)
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:georss="http://www.georss.org/georss">
 <channel>
  <atom:link href="http://spotcrime.com" rel="self" type="application/rss+xml"/>
  <title>Spotcrime.com Crime Listing - Newark, NJ</title>
  <description>Crime feed - RSS - 5 incidents. To see more visit http://spotcrime.com</description>
  <language>en-us</language>
  <link>http://spotcrime.com</link>
  <ttl>180</ttl>
  <copyright>ReportSee, Inc.</copyright>
  <item>
   <guid isPermaLink="true">http://spotcrime.com/crime-report/18002873/robbery+on+easton+avenue%2C+franklin%2C+nj</guid>
   <link>http://spotcrime.com/crime-report/18002873/robbery+on+easton+avenue%2C+franklin%2C+nj</link>
   <pubDate>Mon, 18 Apr 2011 00:00:00 -0400</pubDate>
   <title>Robbery on EASTON AVENUE, Franklin, NJ (via spotcrime.com)</title>
   <description>Police are seeking a man who robbed the Financial Resources Federal Credit Union</description>
   <georss:point>40.5242061 -74.495662</georss:point>
   <geo:Point>
    <geo:lat>40.5242061</geo:lat>
    <geo:long>-74.495662</geo:long>
   </geo:Point>
  </item>



This is only a portion of the full output - there are more <item> nodes, one for each crime.
So the feed starts with a header full of stuff we don't need, followed by the content in the <item> node, which is the good stuff: a link to the crime on SpotCrime, the publication date (more on this later), the crime "title," a description, and the Lat/Lon, in two different formats.  How do we get at that meaty stuff, and put it into a friendly R dataframe?  We'll use the xpathApply command:


src<-xpathApply(xmlRoot(doc), "//item")


xpathApply is a "way to find XML nodes that match a particular criterion" using XPath syntax.  XPath is a way to navigate XML trees.  My approach for a project like this is to aim, first and foremost, for code that works, and worry about advanced techniques later.  So I did a simple search for nodes identified as "item," ignoring all the other possible arguments to xpathApply.  src is now a list with 5 elements, one for each "item" node in the feed (recall that above, I only showed the first item node - four more followed).  We can now iterate through the 5 elements of src and convert the data into a dataframe:



for (i in 1:length(src)) {
    if (i==1) {
            foo<-xmlSApply(src[[i]], xmlValue)
            DATA<-data.frame(t(foo), stringsAsFactors=FALSE)
        }
    else {
            foo<-xmlSApply(src[[i]], xmlValue)
            tmp<-data.frame(t(foo), stringsAsFactors=FALSE)
            DATA<-rbind(DATA, tmp)
        }
   
    }
   
xmlSApply applies a function to the subnodes of an XML node.  In this case, the function is xmlValue, which returns the raw contents of a node.  So foo becomes a character vector containing all of those nice data bits for crime i. We then transpose foo into a matrix and convert it to a (1 row) data.frame.  The stringsasFactors=FALSE prevents R from treating the strings as factors, which makes sense in this case - it might not in yours.
The first time through the loop, we want to create the data.frame; subsequent iterations, we just want to rbind a row on the bottom.  When we're done, we have what we want: the data from the RSS feed nicely formatted in a data.frame named (descriptively) DATA. 

Now, returning to the date and time.  SpotCrime reports the publication date and time, not the date and time that the crime actually occurred.  What can we do?  It looks like SpotCrime reports the date and time we want on the webpage for the crime, the link to which was helpfully provided in the RSS feed.  Take a look:



So, let's read in the html for that page, and grab the correct date and time!


# Looping through the crimes, going to web page and grabbing actual date and time
date_time<-vector()
for (i in 1:length(src)) {
    res<-htmlTreeParse(DATA$link[i], useInternalNodes=TRUE)
    title<-xpathApply(xmlRoot(res), "//title")
    date_time[i]<-xmlSApply(title[[1]], xmlValue)
}
DATA<-cbind(DATA,date_time)   


Here, we used many of the same commands we used for the RSS feed. The real date and time were stored in a node called "title," so we just grabbed that node for each crime, stuck it into the appropriate slot in a vector, and slapped that vector onto our DATA data.frame.
With a little string processing to extract and convert lat/lon and date/time to appropriate data types, the data collection code is finished!

Wednesday 20 April 2011

How to Source an R script automatically on a Mac using Automator and iCal

I wrote an R script that pulled data from an RSS feed.  The RSS feed updated frequently, so I wanted to be able to schedule the script to run automatically.  After some tinkering, I got it to work by implementing the steps below.  Note that these steps assume you do not want to save your workspace - that you are saving the objects you need explicitly within the script. 

Step 3: The Run Shell Script Action
  1. Test your R script in regular ole' R to make sure it runs without error.
  2. Add a quit(save="no") command at the end of your script. 
  3. Open Automator  (Applications -> Automator)
  4. Select Application from the Template Selection menu
  5. Select the Run Shell Script Action, double click or drag it over
  6. Type:  Rscript --no-save --no-restore /Users..../YourScript.R
    • The last argument should be the path of the script file you want to run
    • The 2nd and 3rd arguments tell R not to save the workspace when it's done and not to restore previously saved objects on startup, respectively
    • To see a full list of arguments you can pass to Rscript, just open Terminal and type Rscript 
  7. You can test your "Application" now by hitting the Run button in the upper right corner.  Ignore the little robot's warning.  Hopefully, you'll see some green checkmarks and a "Workflow completed" message.
  8. Save your "Application" somewhere clever.
  9. Now, open iCal and create a new event.  Set the date, time, and repeat values as you wish.  Select alarm->Open File.
  10. It will default to iCal; click iCal, select Other, and navigate to your "Application," which you saved somewhere clever.
That's it, you're done.  When the scheduled time comes, the script will run in the background without even opening R.  You may see a little gear up top on your menu bar, that's it.