Thursday, June 21, 2012

Screen scraping with Nokogiri

So a few months ago I put together a side project at http://www.myrecipesavour.com/ . Basically, the site allows you to put in the URL of a cooking recipe page and will then parse the recipe for your collection.

So it turns out, reading data from another site is very easy with Nokogiri.

The source code is available here https://github.com/abreckner/MyRecipeSavour

There is a lot I am going to cover in the next few posts based on this code base (like Devise and Heroku), but for now we are focussed on this file https://github.com/abreckner/MyRecipeSavour/blob/master/app/models/site.rb

So we are going to look at the add_recipe method.

First we need to require a few packages
require 'open-uri'
require 'rubygems'
require 'nokogiri'
Unfortunately, I haven't yet figured out a heuristic for separating a recipe web page into a recipe's components (Title, Ingredients, Instructions, Amounts, etc...) but as a workaround, I maintain a catalogue of CSS selectors which define these elements per domain. When I read the page, I use NokoGiri to parse those elements for me using the CSS selectors

i.e.
html = Nokogiri::HTML(open(url).read) # open the page
  title = html.css(site.title_selector).text.strip # read the title

I then populate a recipe object with these pieces


recipe = Recipe.new
recipe.name = title
...
recipe.save


My code around the ingredients and instructions is a little more complex as my Recipe model has many Ingredients and Instructions (eventually I am going to allow users to manipulate them individually). Each ingredient/instruction is parsed based on a line break, so I need to pull in the ingredient array from Nokogiri and then merge it into a string separated by line breaks.

ingredients = html.css(site.ingredient_selector).children.inject(''){|sum, n| sum + n.text + "\n"
...
Ingredient.multi_save(ingredients, recipe)

The reason I convert it to a string and then back into an array is so that the user can later edit the ingredients via a textarea. It's fair to say that I actually write the multi_save code from a textarea for input before I did the screen scrape and I wanted to reuse it.

The other interesting piece of this add_recipe method is that I store a new Site in case the user tries to add a recipe from an "uncatalogued" site. This automatically builds up a list of the sites people are interested in saving recipes from and allows me to catalogue it at a later date


site_domain = URI.parse(url).host
    site = Site.find_by_domain site_domain
    if site.nil?
      site = Site.new
      site.domain = site_domain
      site.url = url
      site.user = current_user
      site.save!
      false
    else
... #Nokogiri scraping code goes here
end


Monday, June 11, 2012

How DRY is too DRY?

One of the earliest development principles we learn as programmers is Do not Repeat Yourself or DRY for short.

Copy and Paste are supposedly your worst enemies. Rather than rely on copy and paste, you create functions and subroutines and call them from your code so you don't have to reimplement it continuously. It also has the added advantage that if you need to make a change to that subroutine, you only need to make that change once. (Note: I realize that functions and subroutines are different entities, but for the purpose of this article they are interchangeable).

Sometimes you come across 2 pieces of functionality which are very similar, so instead of copying and pasting, you merely instantiate your function/subroutine with different variables to factor out/handle the differences, even where the core functionality is the same.

For example, you might have a Customer object and a Vendor object. Both customers and vendors have addresses and you need to send mail to them both from time to time so you might decide to create a format_address function which they both can use, rather than copying and pasting the format_address code from one to the other. You might even move Address out to its own object which has a Customer or Vendor parent and put the format_address function on that.

This becomes second nature after a while and is almost as rewarding as recycling. Hey, you are not wasting things by re-use right? An intermediate developer will try and re-use and keep things DRY as much as possible.

However, there are times when you may not want to actually keep things DRY. The primary example is integration tests. When you are running integration tests, you might need to instantiate large numbers and different types of objects to simulate the system running as a whole. This code might look quite similar between tests so the temptation is to move it out into a subroutine and reuse it.

This could be a mistake however. When hooking up test code in this manner, you are creating hidden dependencies in your code which can make things very difficult to change should the requirements change in only one area of your code. People might try and edit one object and find their tests failing for some unknown reason.

The other time that DRY can work against you is if you try and use it between objects that are really conceptually different and unrelated. The example that most readily comes to mind is in something like CSS. Just because 2 objects may look similar (i.e. they both have the same color, rounded corners, and font) does not mean they are related and attempts to DRY up the CSS code too much means changing the design later can be difficult. The same goes for code that is more procedural rather than object oriented (i.e. a lot of front end interactive code where form elements interact with the user and each other depending on the user's actions). In those cases you must really use your judgement to decide whether or not to DRY it up (and there is a high probability you won't get it perfectly right either).

So remember, while beginners Copy and Paste and intermediates DRY everything, an experienced programmer knows there may be a time for both.

Monday, June 04, 2012

We know JavaScript is weird... enough already!

So it looks like JavaScript is close to being considered a "real" language nowadays. There are frameworks which allow you to do MVC (like Backbone.js), you can use it on the server (with Node.js) and you can even use it to interact with datastores (via MongoDB).

So why is it that almost every job posting you see for a JavaScript gig and/or every interview you go to that has a JavaScript component, asks you to interpret/debug (without using a browser) some esoteric fault of JavaScript that you probably wouldn't run into in a 100 years because you actually write decent JS?

Like
  var cities = ["NY", "SF"];
  cities.length = 1;
  console.log(cities); // outputs ["NY"]

or 


  var a = 1 + 1 + "1"; // equals "21"
  var b = "1" + 1 + 1; // equals "111"

It's as if they are trying to say to you "Look at that piece of crap language you are programming in! You must be an idiot!" while at the same time offering you a job in said language whilst wanting to build up their systems in it.

Also, there are so many of these quirks in JS (and web development in general) that just because you may not have seen one, it does not mean that you are a bad programmer and you don't know JavaScript.

Instead, ask them to do FizzBuzz at a console with a text editor and a browser (whilst looking over their shoulder to see that they are not cheating) and really look at how well their code is written. Ask them to generate a recursive function. Ask them to create an instance of an object and add some functionality to it via prototype.

In short, ask them to do something they do in real life and what you probably want them to do for you.

Don't remind them how shitty JS is, believe me, they know this more than you.