Skip to main content

Screen scraping with Nokogiri

So a few months ago I put together a side project at http://www.myrecipesavour.com/ . Basically, the site allows you to put in the URL of a cooking recipe page and will then parse the recipe for your collection.

So it turns out, reading data from another site is very easy with Nokogiri.

The source code is available here https://github.com/abreckner/MyRecipeSavour

There is a lot I am going to cover in the next few posts based on this code base (like Devise and Heroku), but for now we are focussed on this file https://github.com/abreckner/MyRecipeSavour/blob/master/app/models/site.rb

So we are going to look at the add_recipe method.

First we need to require a few packages
require 'open-uri'
require 'rubygems'
require 'nokogiri'
Unfortunately, I haven't yet figured out a heuristic for separating a recipe web page into a recipe's components (Title, Ingredients, Instructions, Amounts, etc...) but as a workaround, I maintain a catalogue of CSS selectors which define these elements per domain. When I read the page, I use NokoGiri to parse those elements for me using the CSS selectors

i.e.
html = Nokogiri::HTML(open(url).read) # open the page
  title = html.css(site.title_selector).text.strip # read the title

I then populate a recipe object with these pieces


recipe = Recipe.new
recipe.name = title
...
recipe.save


My code around the ingredients and instructions is a little more complex as my Recipe model has many Ingredients and Instructions (eventually I am going to allow users to manipulate them individually). Each ingredient/instruction is parsed based on a line break, so I need to pull in the ingredient array from Nokogiri and then merge it into a string separated by line breaks.

ingredients = html.css(site.ingredient_selector).children.inject(''){|sum, n| sum + n.text + "\n"
...
Ingredient.multi_save(ingredients, recipe)

The reason I convert it to a string and then back into an array is so that the user can later edit the ingredients via a textarea. It's fair to say that I actually write the multi_save code from a textarea for input before I did the screen scrape and I wanted to reuse it.

The other interesting piece of this add_recipe method is that I store a new Site in case the user tries to add a recipe from an "uncatalogued" site. This automatically builds up a list of the sites people are interested in saving recipes from and allows me to catalogue it at a later date


site_domain = URI.parse(url).host
    site = Site.find_by_domain site_domain
    if site.nil?
      site = Site.new
      site.domain = site_domain
      site.url = url
      site.user = current_user
      site.save!
      false
    else
... #Nokogiri scraping code goes here
end


Comments

Popular posts from this blog

Master of my domain

Hi All, I just got myself a new domain ( http://www.skuunk.com ). The reason is that Blogspot.com is offering cheap domain via GoDaddy.com and I thought after having this nickname for nigh on 10 years it was time to buy the domain before someone else did (also I read somewhere that using blogspot.com in your domain is the equivalent of an aol.com or hotmail.com email address...shudder...). Of course I forgot that I would have to re-register my blog everywhere (which is taking ages) not to mention set up all my stats stuff again. *sigh*. It's a blogger's life... In any case, don't forget to bookmark the new address and to vote me up on Technorati !

Responsive Web Design

I wanted to go over Responsive Web Design using CSS. In the old days of web development, we had to code to common screen sizes (i.e. 800 X 600, 1024 X 768) and we patiently waited for people to upgrade their computers to have a decent amount of screen real estate so we could design things the way we really wanted. We also took on semi stretchy web layouts etc to expand and contract appropriately. Then about 2 or 3 years ago, Apple released this little device called an iPhone with a 320 X 480 resolution which took the world by storm and suddenly a lot of people were viewing your website on a tiny screen again... Anyways, as it can be difficult to design a site which looks good on 320 X 480 AND 1680 X 1050, we need to come up with some kind of solution. One way is to sniff the client and then use an appropriate stylesheet, but then you are mixing CSS with either JavaScript or server side programming and also potentially maintaining a list of appropriate clients and stylesheets. Also, you...

Elixir - destructuring, function overloading and pattern matching

Why am I covering 3 Elixir topics at once? Well, perhaps it is to show you how the three are used together. Individually, any of these 3 are interesting, but combined, they provide you with a means of essentially getting rid of conditionals and spaghetti logic. Consider the following function. def greet_beatle(person) do case person.first_name do "John" -> "Hello John." "Paul" -> "Good day Paul." "George" -> "Georgie boy, how you doing?" "Ringo" -> "What a drummer!" _-> "You are not a Beatle, #{person.first_name}" end end Yeah, it basically works, but there is a big old case statement in there. If you wanted to do something more as well depending on the person, you could easily end up with some spaghetti logic. Let's see how we can simplify this a little. def greet_beatle(%{first_name: first_name}) do case first_name d...