Thursday, June 21, 2012

Screen scraping with Nokogiri

So a few months ago I put together a side project at http://www.myrecipesavour.com/ . Basically, the site allows you to put in the URL of a cooking recipe page and will then parse the recipe for your collection.

So it turns out, reading data from another site is very easy with Nokogiri.

The source code is available here https://github.com/abreckner/MyRecipeSavour

There is a lot I am going to cover in the next few posts based on this code base (like Devise and Heroku), but for now we are focussed on this file https://github.com/abreckner/MyRecipeSavour/blob/master/app/models/site.rb

So we are going to look at the add_recipe method.

First we need to require a few packages
require 'open-uri'
require 'rubygems'
require 'nokogiri'
Unfortunately, I haven't yet figured out a heuristic for separating a recipe web page into a recipe's components (Title, Ingredients, Instructions, Amounts, etc...) but as a workaround, I maintain a catalogue of CSS selectors which define these elements per domain. When I read the page, I use NokoGiri to parse those elements for me using the CSS selectors

i.e.
html = Nokogiri::HTML(open(url).read) # open the page
  title = html.css(site.title_selector).text.strip # read the title

I then populate a recipe object with these pieces


recipe = Recipe.new
recipe.name = title
...
recipe.save


My code around the ingredients and instructions is a little more complex as my Recipe model has many Ingredients and Instructions (eventually I am going to allow users to manipulate them individually). Each ingredient/instruction is parsed based on a line break, so I need to pull in the ingredient array from Nokogiri and then merge it into a string separated by line breaks.

ingredients = html.css(site.ingredient_selector).children.inject(''){|sum, n| sum + n.text + "\n"
...
Ingredient.multi_save(ingredients, recipe)

The reason I convert it to a string and then back into an array is so that the user can later edit the ingredients via a textarea. It's fair to say that I actually write the multi_save code from a textarea for input before I did the screen scrape and I wanted to reuse it.

The other interesting piece of this add_recipe method is that I store a new Site in case the user tries to add a recipe from an "uncatalogued" site. This automatically builds up a list of the sites people are interested in saving recipes from and allows me to catalogue it at a later date


site_domain = URI.parse(url).host
    site = Site.find_by_domain site_domain
    if site.nil?
      site = Site.new
      site.domain = site_domain
      site.url = url
      site.user = current_user
      site.save!
      false
    else
... #Nokogiri scraping code goes here
end


Monday, June 11, 2012

How DRY is too DRY?

One of the earliest development principles we learn as programmers is Do not Repeat Yourself or DRY for short.

Copy and Paste are supposedly your worst enemies. Rather than rely on copy and paste, you create functions and subroutines and call them from your code so you don't have to reimplement it continuously. It also has the added advantage that if you need to make a change to that subroutine, you only need to make that change once. (Note: I realize that functions and subroutines are different entities, but for the purpose of this article they are interchangeable).

Sometimes you come across 2 pieces of functionality which are very similar, so instead of copying and pasting, you merely instantiate your function/subroutine with different variables to factor out/handle the differences, even where the core functionality is the same.

For example, you might have a Customer object and a Vendor object. Both customers and vendors have addresses and you need to send mail to them both from time to time so you might decide to create a format_address function which they both can use, rather than copying and pasting the format_address code from one to the other. You might even move Address out to its own object which has a Customer or Vendor parent and put the format_address function on that.

This becomes second nature after a while and is almost as rewarding as recycling. Hey, you are not wasting things by re-use right? An intermediate developer will try and re-use and keep things DRY as much as possible.

However, there are times when you may not want to actually keep things DRY. The primary example is integration tests. When you are running integration tests, you might need to instantiate large numbers and different types of objects to simulate the system running as a whole. This code might look quite similar between tests so the temptation is to move it out into a subroutine and reuse it.

This could be a mistake however. When hooking up test code in this manner, you are creating hidden dependencies in your code which can make things very difficult to change should the requirements change in only one area of your code. People might try and edit one object and find their tests failing for some unknown reason.

The other time that DRY can work against you is if you try and use it between objects that are really conceptually different and unrelated. The example that most readily comes to mind is in something like CSS. Just because 2 objects may look similar (i.e. they both have the same color, rounded corners, and font) does not mean they are related and attempts to DRY up the CSS code too much means changing the design later can be difficult. The same goes for code that is more procedural rather than object oriented (i.e. a lot of front end interactive code where form elements interact with the user and each other depending on the user's actions). In those cases you must really use your judgement to decide whether or not to DRY it up (and there is a high probability you won't get it perfectly right either).

So remember, while beginners Copy and Paste and intermediates DRY everything, an experienced programmer knows there may be a time for both.

Monday, June 04, 2012

We know JavaScript is weird... enough already!

So it looks like JavaScript is close to being considered a "real" language nowadays. There are frameworks which allow you to do MVC (like Backbone.js), you can use it on the server (with Node.js) and you can even use it to interact with datastores (via MongoDB).

So why is it that almost every job posting you see for a JavaScript gig and/or every interview you go to that has a JavaScript component, asks you to interpret/debug (without using a browser) some esoteric fault of JavaScript that you probably wouldn't run into in a 100 years because you actually write decent JS?

Like
  var cities = ["NY", "SF"];
  cities.length = 1;
  console.log(cities); // outputs ["NY"]

or 


  var a = 1 + 1 + "1"; // equals "21"
  var b = "1" + 1 + 1; // equals "111"

It's as if they are trying to say to you "Look at that piece of crap language you are programming in! You must be an idiot!" while at the same time offering you a job in said language whilst wanting to build up their systems in it.

Also, there are so many of these quirks in JS (and web development in general) that just because you may not have seen one, it does not mean that you are a bad programmer and you don't know JavaScript.

Instead, ask them to do FizzBuzz at a console with a text editor and a browser (whilst looking over their shoulder to see that they are not cheating) and really look at how well their code is written. Ask them to generate a recursive function. Ask them to create an instance of an object and add some functionality to it via prototype.

In short, ask them to do something they do in real life and what you probably want them to do for you.

Don't remind them how shitty JS is, believe me, they know this more than you.

Thursday, May 31, 2012

HTML Canvas Libraries

So I was remarking to a coworker today that the HTML Canvas API is very low level and hard to use, and his reaction to that was actually positive (and he had a point). By being very low level, it means that pretty much everything is exposed and going forward you won't have to wait for browser vendors to update their libraries in order to get the latest features. Basically, the browser vendors are removing themselves from the equation.

However, application developers are left with a bit of a dilemma. Do we really want to reinvent the wheel every time we build an app? Why is it that you have to redraw everything every frame? Wouldn't it be easier to work with objects rather than pushing pixels?

Well, while the browser vendors might have removed themselves from the library equation, fortunately a number of other people are stepping in. It looks like there are a myriad of 3rd party libraries out there now for manipulating the HTML 5 Canvas, some of which are more sophisticated than others.

I haven't had time to delve into any of them for real yet, but from what I have seen so far, these seem to be the most advanced (as far as being Adobe Flash replacements).

CreateJs - http://www.createjs.com/#!/CreateJS
This looks to be the most complete library out of all the ones I have seen with support not only for Canvas, but also tweening, sounds, and preloading. It also supports a stage object. EaselJs is the part that manipulates the canvas.

oCanvas - http://ocanvas.org/
This also looks good, and the code samples I have seen look very similar to ActionScript. The "o" in oCanvas stands for Object.

Paper.js - http://paperjs.org/
This one also looks interesting. I need to look at it more. They also created a superset of JavaScript called PaperScript for handling objects more easily. Not sure if this is a good or bad thing.

There are a myriad more as well. The only problem with having so many libraries out there is that you need to make sure you choose the right one when you start your project. I am not sure how inter-operable they all are and what would happen to you if you chose the wrong library.


Saturday, May 12, 2012

Working on a side project

There are many pluses when working on a side project. You get to work with the latest technology. You get to decide what features go in and what doesn't. You can show it off to potential employers. The list is endless. However there are some tips to remember as well. We will cover these here.

Time management

Unless you are unemployed, your time is now a precious resource, which means that you are now a resource to your own project. Try and organize blocks of time during the week when you can work (i.e. a few hours on Saturday or an hour on Thursday night) and stick to them. Try to break up your work into small chunks and aim to have a feature done in that block. This will help motivate you.

Feature Management

Related to time management, it's important to maintain a list of what you want to do and be able to check items off this list. Ask friends for feature ideas and add them to the list. Save anything that's a large feature for the weekend and try and do the smaller features during the week.

Also while developing your site.app, try and maintain a list of every little thing that is bugging you. If it bugs you, it is sure to bug the user.

You also have to be judicious when you are not actually working on the site in figuring out which features really add value. Remember, your time is not limitless...

Site/App Design

I am not a designer, but I know my way around CSS pretty well. I found Twitter Bootstrap to be a good framework to help me get started. There are tons of resources online as well. You probably won't get the design done in one go so it's important to remember that you will iterate on it repeatedly.

Friends

Your friends are your best resource. Invite them early and get feedback ASAP. They will give you ideas for improvements and features and also help you determine which features to implement first (if everyone asks for the same feature, it must be important). At the end of the day though, you are in charge.

Tests

It is tempting at first to leave out automated unit tests (because, hey, you are having fun right?) and that is fine for a little while, but don't let it drag on too long. Remember that you may be showing this side project to a potential employer, and you wouldn't hire someone that doesn't write tests would you?

That's all I can think of for now, but I will probably add to this as I continue working on my project. Happy coding!