Thursday, June 21, 2012

Screen scraping with Nokogiri

So a few months ago I put together a side project at http://www.myrecipesavour.com/ . Basically, the site allows you to put in the URL of a cooking recipe page and will then parse the recipe for your collection.

So it turns out, reading data from another site is very easy with Nokogiri.

The source code is available here https://github.com/abreckner/MyRecipeSavour

There is a lot I am going to cover in the next few posts based on this code base (like Devise and Heroku), but for now we are focussed on this file https://github.com/abreckner/MyRecipeSavour/blob/master/app/models/site.rb

So we are going to look at the add_recipe method.

First we need to require a few packages
require 'open-uri'
require 'rubygems'
require 'nokogiri'
Unfortunately, I haven't yet figured out a heuristic for separating a recipe web page into a recipe's components (Title, Ingredients, Instructions, Amounts, etc...) but as a workaround, I maintain a catalogue of CSS selectors which define these elements per domain. When I read the page, I use NokoGiri to parse those elements for me using the CSS selectors

i.e.
html = Nokogiri::HTML(open(url).read) # open the page
  title = html.css(site.title_selector).text.strip # read the title

I then populate a recipe object with these pieces


recipe = Recipe.new
recipe.name = title
...
recipe.save


My code around the ingredients and instructions is a little more complex as my Recipe model has many Ingredients and Instructions (eventually I am going to allow users to manipulate them individually). Each ingredient/instruction is parsed based on a line break, so I need to pull in the ingredient array from Nokogiri and then merge it into a string separated by line breaks.

ingredients = html.css(site.ingredient_selector).children.inject(''){|sum, n| sum + n.text + "\n"
...
Ingredient.multi_save(ingredients, recipe)

The reason I convert it to a string and then back into an array is so that the user can later edit the ingredients via a textarea. It's fair to say that I actually write the multi_save code from a textarea for input before I did the screen scrape and I wanted to reuse it.

The other interesting piece of this add_recipe method is that I store a new Site in case the user tries to add a recipe from an "uncatalogued" site. This automatically builds up a list of the sites people are interested in saving recipes from and allows me to catalogue it at a later date


site_domain = URI.parse(url).host
    site = Site.find_by_domain site_domain
    if site.nil?
      site = Site.new
      site.domain = site_domain
      site.url = url
      site.user = current_user
      site.save!
      false
    else
... #Nokogiri scraping code goes here
end


Monday, June 11, 2012

How DRY is too DRY?

One of the earliest development principles we learn as programmers is Do not Repeat Yourself or DRY for short.

Copy and Paste are supposedly your worst enemies. Rather than rely on copy and paste, you create functions and subroutines and call them from your code so you don't have to reimplement it continuously. It also has the added advantage that if you need to make a change to that subroutine, you only need to make that change once. (Note: I realize that functions and subroutines are different entities, but for the purpose of this article they are interchangeable).

Sometimes you come across 2 pieces of functionality which are very similar, so instead of copying and pasting, you merely instantiate your function/subroutine with different variables to factor out/handle the differences, even where the core functionality is the same.

For example, you might have a Customer object and a Vendor object. Both customers and vendors have addresses and you need to send mail to them both from time to time so you might decide to create a format_address function which they both can use, rather than copying and pasting the format_address code from one to the other. You might even move Address out to its own object which has a Customer or Vendor parent and put the format_address function on that.

This becomes second nature after a while and is almost as rewarding as recycling. Hey, you are not wasting things by re-use right? An intermediate developer will try and re-use and keep things DRY as much as possible.

However, there are times when you may not want to actually keep things DRY. The primary example is integration tests. When you are running integration tests, you might need to instantiate large numbers and different types of objects to simulate the system running as a whole. This code might look quite similar between tests so the temptation is to move it out into a subroutine and reuse it.

This could be a mistake however. When hooking up test code in this manner, you are creating hidden dependencies in your code which can make things very difficult to change should the requirements change in only one area of your code. People might try and edit one object and find their tests failing for some unknown reason.

The other time that DRY can work against you is if you try and use it between objects that are really conceptually different and unrelated. The example that most readily comes to mind is in something like CSS. Just because 2 objects may look similar (i.e. they both have the same color, rounded corners, and font) does not mean they are related and attempts to DRY up the CSS code too much means changing the design later can be difficult. The same goes for code that is more procedural rather than object oriented (i.e. a lot of front end interactive code where form elements interact with the user and each other depending on the user's actions). In those cases you must really use your judgement to decide whether or not to DRY it up (and there is a high probability you won't get it perfectly right either).

So remember, while beginners Copy and Paste and intermediates DRY everything, an experienced programmer knows there may be a time for both.

Monday, June 04, 2012

We know JavaScript is weird... enough already!

So it looks like JavaScript is close to being considered a "real" language nowadays. There are frameworks which allow you to do MVC (like Backbone.js), you can use it on the server (with Node.js) and you can even use it to interact with datastores (via MongoDB).

So why is it that almost every job posting you see for a JavaScript gig and/or every interview you go to that has a JavaScript component, asks you to interpret/debug (without using a browser) some esoteric fault of JavaScript that you probably wouldn't run into in a 100 years because you actually write decent JS?

Like
  var cities = ["NY", "SF"];
  cities.length = 1;
  console.log(cities); // outputs ["NY"]

or 


  var a = 1 + 1 + "1"; // equals "21"
  var b = "1" + 1 + 1; // equals "111"

It's as if they are trying to say to you "Look at that piece of crap language you are programming in! You must be an idiot!" while at the same time offering you a job in said language whilst wanting to build up their systems in it.

Also, there are so many of these quirks in JS (and web development in general) that just because you may not have seen one, it does not mean that you are a bad programmer and you don't know JavaScript.

Instead, ask them to do FizzBuzz at a console with a text editor and a browser (whilst looking over their shoulder to see that they are not cheating) and really look at how well their code is written. Ask them to generate a recursive function. Ask them to create an instance of an object and add some functionality to it via prototype.

In short, ask them to do something they do in real life and what you probably want them to do for you.

Don't remind them how shitty JS is, believe me, they know this more than you.

Thursday, May 31, 2012

HTML Canvas Libraries

So I was remarking to a coworker today that the HTML Canvas API is very low level and hard to use, and his reaction to that was actually positive (and he had a point). By being very low level, it means that pretty much everything is exposed and going forward you won't have to wait for browser vendors to update their libraries in order to get the latest features. Basically, the browser vendors are removing themselves from the equation.

However, application developers are left with a bit of a dilemma. Do we really want to reinvent the wheel every time we build an app? Why is it that you have to redraw everything every frame? Wouldn't it be easier to work with objects rather than pushing pixels?

Well, while the browser vendors might have removed themselves from the library equation, fortunately a number of other people are stepping in. It looks like there are a myriad of 3rd party libraries out there now for manipulating the HTML 5 Canvas, some of which are more sophisticated than others.

I haven't had time to delve into any of them for real yet, but from what I have seen so far, these seem to be the most advanced (as far as being Adobe Flash replacements).

CreateJs - http://www.createjs.com/#!/CreateJS
This looks to be the most complete library out of all the ones I have seen with support not only for Canvas, but also tweening, sounds, and preloading. It also supports a stage object. EaselJs is the part that manipulates the canvas.

oCanvas - http://ocanvas.org/
This also looks good, and the code samples I have seen look very similar to ActionScript. The "o" in oCanvas stands for Object.

Paper.js - http://paperjs.org/
This one also looks interesting. I need to look at it more. They also created a superset of JavaScript called PaperScript for handling objects more easily. Not sure if this is a good or bad thing.

There are a myriad more as well. The only problem with having so many libraries out there is that you need to make sure you choose the right one when you start your project. I am not sure how inter-operable they all are and what would happen to you if you chose the wrong library.


Saturday, May 12, 2012

Working on a side project

There are many pluses when working on a side project. You get to work with the latest technology. You get to decide what features go in and what doesn't. You can show it off to potential employers. The list is endless. However there are some tips to remember as well. We will cover these here.

Time management

Unless you are unemployed, your time is now a precious resource, which means that you are now a resource to your own project. Try and organize blocks of time during the week when you can work (i.e. a few hours on Saturday or an hour on Thursday night) and stick to them. Try to break up your work into small chunks and aim to have a feature done in that block. This will help motivate you.

Feature Management

Related to time management, it's important to maintain a list of what you want to do and be able to check items off this list. Ask friends for feature ideas and add them to the list. Save anything that's a large feature for the weekend and try and do the smaller features during the week.

Also while developing your site.app, try and maintain a list of every little thing that is bugging you. If it bugs you, it is sure to bug the user.

You also have to be judicious when you are not actually working on the site in figuring out which features really add value. Remember, your time is not limitless...

Site/App Design

I am not a designer, but I know my way around CSS pretty well. I found Twitter Bootstrap to be a good framework to help me get started. There are tons of resources online as well. You probably won't get the design done in one go so it's important to remember that you will iterate on it repeatedly.

Friends

Your friends are your best resource. Invite them early and get feedback ASAP. They will give you ideas for improvements and features and also help you determine which features to implement first (if everyone asks for the same feature, it must be important). At the end of the day though, you are in charge.

Tests

It is tempting at first to leave out automated unit tests (because, hey, you are having fun right?) and that is fine for a little while, but don't let it drag on too long. Remember that you may be showing this side project to a potential employer, and you wouldn't hire someone that doesn't write tests would you?

That's all I can think of for now, but I will probably add to this as I continue working on my project. Happy coding!

Thursday, May 10, 2012

For whom do you write code?

Ok, this was going to be called "Who do you write code for?" but I was told never to end a sentence with a preposition.

Anyways, I was thinking the other day about who the audience is for the code I write and I came up with the following. Bear in mind I live by the principles of KISS (Keep It Simple, Stupid) and YAGNI (You Ain't Gonna Need It). I developed these attitudes after reading tons of other people's code (as well as my own code 6-12 months down the line.

So in descending order...

1. The User

Obviously you are writing code for someone to use (or a service to consume). I am not going to go into UX or HCI at this stage, just that as far as priority goes, this guy is the top. Make sure the user gets good and timely feedback for everything he does and that the steps he has to take make sense to him.


2. The Compiler/Interpreter

The code you write has to be compiled or interpreted by a computer before the user can use it. Don't worry too much about optimization at first (YAGNI, remember?) but don't write code that you know will perform badly off the bat. Try and stay away of O of N types of loops which could blow up in complexity, follow good OO techniques, and write tests where possible.


3. Your coworkers (or you, 6+ months down the line)

Whenever I come up with a "clever" solution, I ask myself, "Will someone else understand this code without me explaining it to them?". Of course the caveat is that that someone else is also a coder and not a plumber, but I normally think of the most junior developer in the team (in terms of experience, not necessarily age) and whether or not he will comprehend it. Sometimes that person is me (ok, most of the time). The other thing I learnt is that code you wrote 6 months ago, is no longer yours. Someone else who looks a lot like you wrote it, but for all practical purposes it was not you.


4. (and finally...) You

You are at the bottom of the list, but face it, you were really at the top all along. You write code because it's interesting, it's exciting, and it's fun.  You like brain teasers and you like solving new and interesting problems and you never wanted to be an accountant (not that there is anything wrong with accountancy, you also get to work with number and you get paid more). So write code for yourself, but remember that you are also at the bottom of the list.

Tuesday, May 08, 2012

Speeding up RSpec

So today I have been looking into getting our enormous battery of tests to run faster. I have yet to find anything that works for Cucumber, but I did find an interesting way to speed up RSpec which is detailed here.

https://makandracards.com/makandra/950-speed-up-rspec-by-deferring-garbage-collection

Basically, it seems that by not collecting garbage too frequently, you can make your tests run much faster (at the expense of memory management of course). We observed a 30% reduction in the time it takes to run an RSpec test suite.

I did try to implement this on Cucumber, however because we need to store much more in memory to set up and tear down our objects, it meant that I kept running out of memory when I wasn't using the default Garbage Collection and the tests took even longer (so, buyer beware). I suppose if you had a small set of features though you might see some benefit.

Tuesday, April 17, 2012

Cucumber - my perspective

I have been using Cucumber (http://cukes.info/) at my current gig for about a year now. My initial reaction was that I absolutely hated it. It didn't seem to make sense for a programmer to write out tests (features) in plain English and then write out a bunch of regular expressions to turn that plain English into runnable code. What a palaver!

The other problem, is that the Cucumber tests were extremely fragile. Even making text and/or HTML changes would break things in lots of random places.

Anyways, as it turns out, I don't really hate Cucumber, I just hate the way it is implemented in my current gig. Here are some lessons I learnt on the way...

1) Features are not supposed to be written by programmers.
You can write features as a programmer, but you are not the intended audience. The reason why features are written in plain text is that they are supposed to be written by business owners. As a programmer though, you can use features to organize your thoughts in plain English.

For example

Given I am a shopper
When I add something to the cart
Then I should see it in the cart

2) Keep all implementation details out of the features.
The features are not for you (the programmer). They are for the business owners. You are just supposed to make them pass and use them as a guide.

3) Don't aim to re-use features
Testing is about the one place where DRY (Do not Repeat Yourself) does not apply. When you re-use the same tests/step definitions in multiple places, you are creating hidden dependencies between 2 sections of code which probably don't need to be there. A lot of developers also fall into this trap with CSS. As developers we are trained to look for similar behaviours and abstract them out into their own methods, but unless the business owner really decides the two features are linked, they should be kept separate. The one area where I don't agree with Cucumber is that all the step definitions are shared globally (I personally think each feature should have its own step definition file).

4) Features are not supposed to cover every possible edge case
Specs are used to make sure all your edge cases are covered. Features should only cover what is asked for by the business owners.

5) Do not tie your step definitions to HTML elements
This makes it impossible to do any redesign. It is a bit annoying that copy (or text) changes can break your tests, but at the end of the day, business owners are the ones who are in charge of copy and they should be aware that copy changes will come at a cost.

So at the end of the day, keep your features concise, don't reuse steps and write them like you are not a programmer and then you should be happy with Cucumber.

Friday, April 13, 2012

JSON caching with Rails

So the other day, I needed to cache an action which was basically a proxy action to return JSON.

Basically, 3rd party company X has an XML API for it's deals. Ideally we would use JavaScript to pull the feed, but unfortunately, the feed is using XML which means we need to use the server to pull this "deal" and then reformat it in JSON for our site to use.

To render the JSON we were using a render call


render offer.to_json


It can get expensive to pull this deal over the wire every time, so we looked into using caches action and set it to expire in 5 minutes (because that is how often the deal feed updates).


caches_action :index, :expires_in => 5.minutes


(Oh, also we need to turn caching on in dev to see this happening)


config.action_controller.perform_caching = true


While this worked fine for the first call, I noticed that in the subsequent calls, the application type was being set to 'text/html' instead of 'application/json'. This was causing the AJAX call to fail. I then noticed that the AJAX call was calling the action directly instead of using the .json suffix.

i.e.
/offers/1234

instead of

/offers.json/1234

So it appeared that because I was not using the suffix, the cache was forcing the application/type header to text/html.

In any case, after a quick update to the JavaScript call, the caching worked fine.