Rubber Duck Dialogs: 2011

Thursday, December 22, 2011

Can We Have Too Many Tools?

Vim is a great editor. There are so many plugins that make it even better and increase my productivity. But can there be a saturation point where it's not worth finding the next plugin to shave a keystroke off of a command? I've been trying to find the sweet spot for the right amount of tools for the job.

Bigger Toolbox

Having more developers to talk to and work with, I have been exposed to different methods of development. Yan in particular, has been a tremendous Vim resource. He's made it his focus to optimize working with this editor as much as possible, and it's quite impressive. I've adopted some of his ideas, no longer content to be "good enough" with Vim.

Janus has been a tremendous help. It exposed me to command-T, which is now my favorite thing ever. I no longer use a buffer explorer because it's easier to just find it with command-T than search through the buffer list. I also rarely use NERD tree now, since I was using that mostly as a convenience to open project files. I'll still use it for looking in directories, but it's not open by default anymore.

From Yan, I've install git-grep and mapped K to search for the word under the cursor, and that's been such a pleasant, and faster experience, than using vimgrep. Yan is also cleaning up a plugin for rspec integration that provides some nice wins.

Too Big?

But how far do I take this path? I could continue to add plugins and map commonly used keystrokes to further increase efficiency, but when do I start to get diminishing returns? For example, I could map a letter to :GitGrep and save six keystrokes, but I haven't found the need to do that yet. Usually when I'm searching for something, I'm thinking about what I want to search for while typing the command, so I'm effectively multitasking and not wasting time with all that extra typing. Sometimes slowing down can be a good thing to allow that planning. Plus each new mapping or plugin is something new to learn, and it can occasionally become overwhelming with all the new options.

Just Right, For Now

I'll definitely continue to improve my Vim Fu, but I'm not in a hurry to continually add to my repertoire until I've mastered what I currently have available. I've come this far on a basic setup, and Vim has been around long enough that there have been many others who have gotten by with less, so I'm not going to stress out about not optimizing every single key I type.

Sunday, December 4, 2011

Pair Programming Feedback

It's been a few weeks of pair programming with the new guys, and it's been so enjoyable and valuable. Here's what I've experienced.

Impproved Toolset

Since there are multiple people working on the same computers, we need a shared toolset. Some of us like RubyMine, while others prefer the awesomeness of vim. Since everyone prefers one of these two, we can set them both up on each machine and switch editors as needed. Yes, it would be optimal for pairing to use the exact same environment and setup like Pivotal Labs, but you can pry vim out of my cold, dead hands (hard drive?).

It's easy to keep ones toolset static after getting used to that way of doing things, so getting others who have different processes allows us all to question our own and change those parts that are lacking. I've finally come around to Janus, a suite of vim customizations and plugins, and it's been great. Command-T is almost worth it by itself. Yan also introduced me to git grep, which has been much faster than vimgrep.

Improved Code

The value of the discussion-aspect of pairing becomes immediately obvious when there's a disagreement. Trying to argue ones position on a subject exposes how much one really understands about that subject. If I can't explain why we should use technique x to my pair, do I really know enough to say for sure that we should? It's a great didactic tool for all parts of the development process - pretty much anything that we would write is subject for debate, and the discussions are invaluable for coming up with the best solution.

More Fun

Having more people in the office has made the place much more lively and interesting. Since we're such a small startup, any new person is going to have a major impact on the physical space of the office and the dynamic of the group (going from four to five is kind of a big deal). Our new interview process seems to have worked because it's been a pleasure working with each new team member, and everyone gets along with everyone else.

Exhaustion

Pair programming is fun, but it's tiring to work so intimately and intensly with someone all day. The productivity has gone up a lot because of the focus, the breaks have become more separate (not just reading an article on the computer but getting up and moving away from the damn thing), and I'm definitely ready to stop coding by the end of the day. It's a good exhaustion, a feeling that much was accomplished and comes with a certain buzz that's different than when working by myself all day. It's a new stress, but I'll adapt to it and get even better.

We're in a good place, and we're ready to continue writing great software.

Friday, November 18, 2011

Welcome the New Guys!

We recently hired a trio of new developers at work, and it's been such an enjoyable experience to have them around. Tim, Clay, and Yan (who starts soon) are all solid developers and great guys as well.

I've paired with Clay and Tim full-time, and I'm completely sold on the practice. The mind-share we create through immediate feedback and conversation is invaluable. Management likes it because it means the new hires are productive from the outset, as the training happens in real-time. We're learning from each other in numerous ways. For example, the default macvim/git interaction doesn't work, so I have had to use the command line for commit messages. Tim kindly informed me that I can set my bash editor to default to macvim using:

EDITOR="macvim -f"

Now I can write my commit messages in macvim. The -f flag prevents forking when starting so git will wait until the editor closes before proceeding. A nice thing to have.

Here's to the new guys!

Friday, November 4, 2011

Rails 2.3.5 Bug With accepts_nested_attributes_for

The Setup

At work, our application has a series of questions it asks users, and an administrator can set a target of what he thinks (or wants) the answer to be.

class Question < ActiveRecord::Base
  has_many :targets
  accepts_nested_attributes_for :targets, :allow_destroy => true
end

class Target < ActiveRecord::Base
  belongs_to :question
end

The administrator fills the form out, choosing a target or not, and Rails will save, update, or destroy it as normal. Well, almost. Apparently there is a bug in Rails 2.3.5 that will save the child model twice, and we're experiencing that now.

What I Want To Do

Upgrade to Rails 3.1, get all the new hotness and bug fixes, and not worry about this problem.

What I'm Going To Do

Since we don't have the resources to upgrade Rails (yet...), we need a workaround. Here's what we came up with.

after_save :ensure_one_target_while_in_draft

def ensure_one_target_while_in_draft
  if draft? && targets.count > 1
    targets[1..-1].each { |t| t.destroy }
  end
end

This problem only comes up when a question has not been activated and is still in draft mode. Once it's activated, the administrator cannot change the target but only add additional ones. So while in that state, and if there are multiple targets, destroy all but the first. This needs to be in an after_save callback because the double-save does not happen until after the question is saved. It's not elegant, but it works. I don't want to try to patch Rails and remember to deal with that later when we finally do upgrade, and the situation is isolated enough that there isn't a performance hit and it's easy to remove once it's no longer necessary. Elegant? No. But it's a bug fix for the framework, it's well documented in the code, and it works.

Saturday, October 22, 2011

Rails Security Refactor: Protect Those Attributes!

Rails in the enterprise is still a fairly new concept, but the same web development principles we have also exist in this new realm. One no-brainer is security, and one no-brainer part of security is protecting data from malicious user input.

The basic Rails ways to protect attributes on a model are attr_protected and attr_accessible. If neither are set, it's easy to imagine a situation where a user updates his attributes and the corresponding model has a boolean admin field on it. The user can trivially submit post data that looks like this:

params: { "user" => { :email => "foo@bar.com", :first_name => "Joe", :last_name => "Hacker", :admin => "true" } }

Oops! Now Mr. Hacker is Mr. Admin!

So how do we get from here to (more) secure? Throwing attr_accessible on the model to whitelist is the safest, but it can cause a lot of unknown breakage if there aren't tests around the fields, which there probably aren't because why would someone test the fields for accessibility if they are automatically accessible? An interim step is to create a blacklist using attr_protected to only protect specified fields, get tests around these protected fields, and then upgrade to attr_accessible.

For our user model, let's protect that admin field:

Class User < ActiveRecord::Base
  attr_protected :admin
end

And the tests:

describe User do
  describe "with protected fields" do
    context "including admin" do
      let(:user) { Factory.build(:user, :admin => false) }

      it "cannot mass-update" do
        user.update_attributes({ :admin => true })
        user.admin.should be_false
      end
    end
  end  
end

We instantiate a user object with admin set to false, try to update that field, and ensure that it did not get updated. If we use user.update_attribute(:admin, true), the test would fail because that skips all the ActiveRecord protection, so we use user.update_attributes(). Doing this for all the fields we want to protect will eventually get us to the point where we can swap out attr_protected with the easier-to-deal-with attr_accessible. Since we need to be explicit with attr_protected, it can get to be difficult to maintain quickly since we need to remember to add each new field we want to protect to the list and test it.

Class User < ActiveRecord::Base
  # attr_protected :id, :admin, :awesomeness_rating, :money
  attr_accessible :email, :first_name, :last_name
end

That's better. Tests and iterative development to the rescue!

Friday, October 7, 2011

Our Interview Process

Our Hiring Process

At Crowdcast, we're currently hiring developers, and through much practice we have come up with an interview process that maximizes our chances of finding a candidate who will be a good fit.

There are a few steps that aren't always in the same order, and we may even skip some depending on the candidate.

The Phone Screen

I'm not involved in this first step. The engineering manager will talk to a candidate initially to find out if there are any immediate red flags and to clarify his experience. If nothing strange happens (believe me, there have been strange happenings), we'll schedule the candidate to come in or give him a preliminary coding test.

The Rails Test

We send the client a small Rails coding project that should only take a few hours. We'd like him to create a Rails application with some basic functionality and minimal markup and styling (design is a bonus but not a requirement), put the project on his Github account, and send us a url to the code. This is a Fizzbuzz-style question for the Rails framework, and it also implies that he has a Github account. You do have one, right? If the candidate can already show us Rails code, we may skip this step and just bring him in to the office.

The Onsite Interview

We then bring the candidate in for a few interview rounds. I'll ask a candidate about previous experience and what he enjoys working on to establish a little history and familiarity and to decode what's written on his resume. I'll next go through a technical screening about Ruby, some basic questions and some slightly-less basic questions, and I'll let the candidate explain as much or little as he'd like. A good one will definitely have a lot of say for some of the answers, and a short reply is usually an indication of minimal experience with the subject matter.

Next, we'll get into some Rails-specific questions, both methods and techniques. It's certainly not necessary to know everything about the framework, especially given its development speed, but there are some core ideas that get used often that he should be familiar with. If the candidate can teach me something, that makes me very happy.

The Pair Programming Exercise Part I

A new wrinkle we've introduced. Since I'll be pair programming with someone, I'd like to know how he thinks and how we'll work together. We'll sit at one computer and, with the candidate driving, work through a small Ruby problem. We'll do TDD (we use Rspec but that's not a requirement). The problem should take 20 - 40 minutes and leaves a lot of room to delve into interesting design issues and refactorings. This is the fun part.

The Pair Programming Exercise Part II

The last step is to invite the candidate back to pair program on real code for a few hours. We need to make sure we're a fit for each other, and this gives him an opportunity to engage our code base and be part of the team. This is not just for me but for everyone in the office, and of course the candidate, to assess the interactions.

Post Mortem

Does any interview process find the perfect candidate? Not that I've discovered. We can't go as in-depth as Hashrocket, but we can get partially there with the multiple pairing exercises and full team interaction. It's so incredibly important to get along with coworkers since we spend as much time at work as sleeping (actually, it's probably more time at work), so just having the technical chops is not enough. We hope that our process gives us the knowledge to make the right decisions for everyone involved.

Thursday, September 15, 2011

Testing controller JSON responses in Rspec

I was recently rewriting some controller specs because they were way too heavy: all the models were saved to the database and there was no mocking. While trying to test the JSON response of an action, I got the following exception:

ActiveSupport::JSON::Encoding::CircularReferenceError

Here is the relevant controller code:

format.json do
  render :json => as_json(@questions)
end

Here is the test's mock:

question = mock_model(Question)

A little cryptic, right? After a little digging, I changed the mock to this:

question = mock_model(Question, :as_json => {'foo' => 'bar'})

Ah, there we go!

The spec ended up looking like this:

context "as json" do
  it "lists the questions" do
    question = mock_model(Question, :as_json => {'foo' => 'bar'})
    Question.should_receive(:find_ordered_subjects).and_return([question])
    get :index, :format => 'json'
    response.body.should == "[{\"foo\":\"bar\"}]"
  end
end

In our application_controller.rb, an as_json() method called to_json, which would call as_json in the test, resulting in a circular reference. Oops.

Don't forget to stub as_json()!

Saturday, August 27, 2011

Emailing with Delayed Job

We used Delayed Job to queue emails sent out to users, both to offload that blocking process and for scheduling. It has worked well so far, but recently there were some strange bugs popping up. Some emails were stuck in the queue, and the error message was about bad YAML syntax.

Delayed Job serialized objects in its handler field, and, with some user input that's not encoded properly, created incorrect YAML. For example, this could happen:

id: 1
  foo: 'here is some 'text'
  bar: 'something else'

Notice the odd number of single quotes in foo? Yeah, that's bad. Since we already had that kind of data saved, we needed another way to fix this.

Instead of having methods in the notifier.rb file like so:

def forgot_password(user)
  ...
end

We did it like this:

def forgot_password(user_id)
  user = User.find(user_id)
  ...
end

Delayed Job serialized just the user id and not the entire user object, so any potentially harmful data wasn't saved. This was more expensive since the objects had to get instantiated again, but sending out email wasn't expensive for our app, so this solution worked well.

If you ever get strange YAML syntax errors from delayed job, perhaps this method will work for you.

Friday, August 12, 2011

Attaching Events to a Disabled Submit Button

There was a form that had a few required fields, and I wanted to show a message when the user hovered over the submit button when not every field was completed. The problem was that the submit button was disabled until the fields are filled in, and I couldn't attach an event to a disabled form element.

One solution is to add an invisible element over the button.

var $disabledSubmit = $('#submit_wrapper input:disabled');
var $disabledSubmitParent = $('#submit_wrapper');
var $overlay = $('<div />');
$overlay.css({
  position: 'absolute',
  top: $disabledSubmit.position().top,
  left: $disabledSubmit.position().left,
  width: $disabledSubmit.outerWidth(),
  height: $disabledSubmit.outerHeight(),
  zIndex: 10,
  opacity: 0
});
$overlay.mouseover(this.submitHoverOver);
$overlay.mouseout(this.submitHoverOut);
$disabledSubmitParent.append($overlay);

This created an overlay over the button that handled the hiding and showing (submitHoverOver()/submitHoverOut()) of the message.

When the form was ready to submit and the button was enabled, we needed to do two things. The first was to lower the z-index of the overlay so the user can access the button.

$overlay.css('z-index', -1);

The second was to unbind the events on the overlay so we didn't continue to show the message.

$overlay.unbind();

If the user changed the data to be in a bad state, we disabled the submit button. We also needed to reattach the events and crank up the z-index of the overlay.

$overlay.css('z-index', 10);
$overlay.mouseover(this.submitHoverOver);
$overlay.mouseover(this.submitHoverOut);

Now, instead of putting up an error message or more text about required fields, the user would be directed to finish the form if he hadn't done so when he tried to submit.

Saturday, August 6, 2011

My name is Danny...and I make mistakes.

Yes, it's true. Here's what I did, and here's the reaction.

At work a few weeks ago, I was going through some callback code that sets some meta data on a model's associations. I noticed that a related flag wasn't getting set as well, that it was only set in one other specific instance with the meta data. Hmm...let's fix that, shall we? Flag added, moving on.

Star wipe to this week.

We found a bug when displaying historical data, and I quickly realized that the display was wrong because it was skipping over objects it shouldn't, objects that were flagged when they shouldn't be. Cue pants pooping.

There was a fix, and I would just need to run a script that would update the flag for all the associated models affected after the callback happens and ignore the other ones because those were the ones that explicitly have the flag set at the other, correct, time. But should I tell anyone or do I just run the script and say that I fixed the display bug? Well, WWJD (what would Jack Nicholson do)? He'd tell everyone, damn it, because he's like that. Keeping it real. Not like Chuck Norris.

Anyway, I sent out an email admitting what happened, and I included a high-level explanation of what happened along with a technical one. I explained that there is a fix and we won't lose any data, and that it would fix the current display bug but that we need to do some tests to make sure it didn't affect any other parts of the application.

My manager's response? "Hey, man, shit happens. Glad you fixed it." That's why I like working here.

Wednesday, July 13, 2011

Adding Field Separation for List Data

A Big List

MetalDetectr is effectively a list of data as specific as a user wishes to see. It will show only a list of releases a user has in his last.fm library to a list of everything on metal-archives.com. A big concern is presenting it properly, and one method is to delineate releases by whatever sort method a user wants to see. This can be by release date, by the band's name, by the release's name, or by the release's format (eg, EP, full-length, DVD).

The Algorithm

Start with a table row showing the earliest or most recent, depending on sort order, of the selected sort column.
Loop through the releases.
If the current release's relevant field is greater/less than the preceding one, show another table row with the current release's field value.
Show the release.

For example, the default sort is by US release date starting at the earliest date (and the current month so there's less noise). The list will display the current month and every album released during that month. When a release is next month, it will show next month and then all releases from that month. Continue on through the rest of the releases. If the user wants to see the list in descending order, it will show the last month first and work its way to the most current month.

The Code

First find the first value and display it in a full column span table row:

# views/releases/index,html.haml
- comparison_value = @releases.first.chain_methods(Release::FIELDS_WITH_METHODS[Release.default_sort(params[:s])])
= separator_row(comparison_value)

These two lines use the following methods:

# models/release.rb
# Sets the sort order to what's passed or us_date.
def self.default_sort(sort)
  sort || 'us_date'
end

# models/release.rb
# Takes an array of symbols and calls them on the release instance if it
# responds to them.
# Example: release.chain_methods([:us_date, :month]) => release.us_date.month
def chain_methods(methods)
  methods.inject(nil) do |memo, acc|
    target = memo ? memo : self
    target.respond_to?(acc) ? target.send(acc) : memo
  end
end

# helpers/releases_helper.rb
# Creates a row with a full colspan for the value.
def separator_row(value)
  value = Date::MONTHNAMES[value] if value.is_a?(Fixnum)
  content_tag(:tr, :class => cycle('even', 'odd')) do
    content_tag(:td, value, :class => 'separator_row', :colspan => 7)
  end
end

FIELDS_WITH_METHODS is a constant that contains a mapping of field names and methods to call on them to display properly:

# models/release.rb
FIELDS_WITH_METHODS = {
  'band' => [:band, :first, :downcase],
  'name' => [:name, :first, :downcase],
  'us_date' => [:us_date, :month],
  'euro_date' => [:euro_date, :month],
  'format' => [:format],
  nil => [:us_date, :month]
}

Then we loop through each release, updating the comparison value when we get to the next one:

# views/releases/index,html.haml
- @releases.each do |release|
  - current_value = release.chain_methods(Release::FIELDS_WITH_METHODS[Release.default_sort(params[:s])])
  - if Release.values_compared?(current_value, comparison_value, params[:d])
    - comparison_value = current_value
    = separator_row(comparison_value)
  - else
    - comparison_value = current_value
  = render release

Compare the two values based on the sort order:

# models/release.rb
# Sets the comparison operator to be greater than if the direction is nil or ascending,
# or less than if the direction is descending.
def self.comparison_operator(direction)
  (direction.nil? || direction == 'asc') ? :> : :<
end

# models/release.rb
# True if both value and comparison exist and
# if the direction is ascending:
#   true if value > comparison, false otherwise
# if the direction is descending:
#   true if value < comparison, false otherwise
def self.values_compared?(value, comparison, direction)
  value &&
  comparison &&
  value.send(
    Release.comparison_operator(direction),
    comparison
  )
end

We tried to abstract the comparisons and what's displayed so we can add new fields and only need to update the field-method mapping. There is always the possibility that a field is nil, since we don't always get all the possible data for every release, so chain_methods will call all the methods it can on a release instance until it finishes or returns nil. We could have chained a bunch of try()s together, but that didn't look right.

We also tried to get as much code out of the view as we could, and it can be improved, but it's okay for now.

Friday, July 8, 2011

Namespacing /lib Files and RSpec

I've been in an ongoing battle with RSpec to get it to properly load files in the /lib directory of a rails app. There's a class MetalArchivesFetcher wrapped in a MetalDetectr module as a namespace:

module MetalDetectr
  class MetalArchivesFetcher
    ...
  end
end

The spec file starts like this:

require 'spec_helper'
require 'metal_archives_fetcher'

describe MetalDetectr::MetalArchivesFetcher do
  ...
end

Without the require, I would receive the message, load_missing_constant': Expected /Users/danny/code/metaldetectr/lib/metal_archives_fetcher.rb to define MetalArchivesFetcher (LoadError) It felt a little off to need to require the file again because Rails already loads it in with config.autoload_paths += Dir["#{config.root}/lib/**/"] set in the config/application.rb file. I could put the require in spec_helper.rb, but it still felt strange.

I decided to remove the module namespace. That lets me remove the require line and all preceding MetalDetectr:: for every MetalArchivesFetcher call in the spec. Is this the right decision? It's definitely DRYer, but I do create tighter coupling. Jim Weirich's talk, "The Building Blocks of Modularity" (that I can't find online) does go over the trade-offs of writing code that is either more tightly or loosely coupled, and my takeaway from that is, since this file is already coupled to the application and models within it, why add an additional layer? It's more of a perceived loosening while only adding a bit more complexity. And that's usually not a good thing.

Perhaps I'll add it back in later, but for now, I'm not going to need it.

Sunday, June 26, 2011

A Real Life Github Success Story

Github has been a real treasure for developers, and I've used it both at work and for personal projects. Until now, I haven't used it to its full effect, that is, contributing.

For MetalDetectr, I wanted to allow a user to filter the list to see releases from artists he had in his last.fm library. A quick search led me to this gem, only it wasn't as fully-featured as I needed.

So I forked it.

Github made this really easy to do. Soon I had the repository in my account, cloned it locally, checked out a new branch, and I was working.

The code was clean and certainly made my life easier to get what I wanted. There was a /method_categories folder that contained the methods to do API calls to get or create information for artists, tracks, and users. I wanted to read in a user's library of artists, so I simply modeled this after the other files.

class Lastfm
  module MethodCategory
    class Library < Base
      regular_method :get_artists, [:user], [[:limit, nil], [:page, nil]] do |response|
        response.xml['artists']['artist']
      end
    end
  end
end

This created a get request call for a last.fm user, set an optional limit on the number of fetched results, and set an optional page number to scan to. Along with the API key, these fields are outlined in the last.fm api docs.

Testing worked similarly. A spec file contained the other method tests, so adding the following, plus a fixture of the xml response, was super easy.

  describe '#library' do
    it 'should return an instance of Lastfm::Library' do
      @lastfm.library.should be_an_instance_of(Lastfm::MethodCategory::Library)
    end

    describe '#get_artists' do
      it 'should get the artists\' info' do
        @lastfm.should_receive(:request).with('library.getArtists', {
          :user => 'test',
          :limit => nil,
          :page => nil
        }).and_return(make_response('library_get_artists'))
        artists = @lastfm.library.get_artists('test')
        artists[1]['name'].should eql('Dark Castle')
        artists.size.should == 2
      end
    end
  end

After adding these methods, I pushed the branch to my github repository and sent a pull request to the original repository. Again, github makes this trivially easy. Before it was accepted, I had this line in the Metaldetectr Gemfile:

gem 'lastfm', :git => 'git://github.com/dbolson/ruby-lastfm.git', :branch => 'library_get_artists'

With the pull request accepted and my code merged into the master branch, it looked like this:

gem 'lastfm'

That's all it took to contribute to open source software.

Wednesday, June 15, 2011

Metal Archives' JSON Results Parsing

Some further explanation of how to get Metal Archives' JSON data from a recent post is necessary. Through reading the markup and trial-and-error, I found the URL to receive the data I needed. Here it is:

http://www.metal-archives.com/search/ajax-advanced/searching/albums \
/?&releaseYearFrom=2011&releaseMonthFrom=1&releaseYearTo=2011 \
&releaseMonthTo=12&_=1&sEcho=0&iColumns=4&sColumns=&iDisplayStart=1& \
iDisplayLength=100&sNames=%2C%2C%2C

This returns a result set that looks like this:

{ 
 "error": "",
 "iTotalRecords": 3637,
 "iTotalDisplayRecords": 3637,
 "sEcho": 0,
 "aaData": [
  [ 
    "<a href=\"http://www.metal-archives.com/bands/037/3540277845\" title=\"037 (ES)\">037</a>",
    "<a href=\"http://www.metal-archives.com/albums/037/Los_Fuertes_Sobreviven/307703\">Los Fuertes Sobreviven</a>",
    "Full-length", 
    "May 24th, 2011 <!-- 2011-05-24 -->"  
 ],
  [ 
    "<a href=\"http://www.metal-archives.com/bands/037/3540277845\" title=\"037 (ES)\">037</a>",
    "<a href=\"http://www.metal-archives.com/albums/037/Tantas_Vidas/306172\">Tantas Vidas</a>",
    "Single", 
    "May 6th, 2011 <!-- 2011-05-06 -->"  
 ]

You'll notice the iTotalRecords field which conveniently provides the total amount to releases available. You'll also notice the the iDisplayStart parameter in the URL that lets us step through the results 100 at a time. By looping through (iTotalRecords / 100 + 1) times, incrementing iDispalyStart by i * 100, we can get a result set for all the records very quickly.

Now that we have the results, we just need a little regular expression magic to pull out all the information.

BAND_NAME_AND_COUNTRY_REGEXP = /(.+)\s{1}\(([a-zA-Z]{2})\)/
ALBUM_URL_AND_NAME_REGEXP = /"(.+)">(.+)<\/a>/
RELEASE_DATE_REGEXP = /<!--\s(.{10})\s-->/

There was a strange situation where an album didn't have a band page but displayed a message that the band didn't exist, so there is one last regular expression used to guard against a slightly alternative format for the data:

NO_BAND_REGEXP = /span.+<\/span/

The data are much easier to gather and never time-out now, so I was able to get rid of all the intermediate saving steps such as after gathering the paginated links and saving the last release searched when the site times-out. I'll probably have to add it back in to get the record label of the release since you'll notice it's absent in the JSON but it is available on the release's page.

The code to save the albums now looks like this:

agent = ::MetalArchives::Agent.new
agent.paginated_albums.each_with_index do |album_page, index|
  album_page.each do |album|
    if album[0].match(::MetalArchives::Agent::NO_BAND_REGEXP).nil?
      Release.create(
        :name => agent.album_name(album),
        :band => agent.band_name(album),
        :format => agent.release_type(album),
        :url => agent.album_url(album),
        :country => agent.country(album),
        :us_date => agent.release_date(album)
      )
    end
    CompletedStep.find_or_create_by_step(CompletedStep::ReleasesCollected)
  end
end

Quick and simple.

Tuesday, May 31, 2011

Histograms in Ruby with Seer

Google has a robust set of charts through their visualization API that they use for Analytics, so when I needed a solution to graph a histogram, I turned to a Ruby implementation of Google Charts. There are other options, but most have little activity. The Seer gem seems to provide enough power to get the data I need, providing a decent set of configuration options and a lot of simplicity.

To get started, add this bit of JavaScript to the head of the page:

= Seer::init_visualization

In the view (preferably a view helper), add this:

begin
  Seer::visualize(
    question.data_points,
    :as => :column_chart,
    :in_element => 'histogram',
    :series => {
      :series_label => 'label',
      :data_method => 'percentage'
    },
    :chart_options => {
      :max => 100,
      :min => 0,
      :height => 355,
      :width => 488,
      :is_3_d => false,
      :legend => 'none',
      :colors => "[{color:'#990000', darker:'#660000'}]",
      :title => 'Best Estimates',
      :title_x => 'Numbers',
      :title_y => 'Percent'
    })
rescue ArgumentError => e
  show_no_data_message_for_histogram
end

Here, we provide a Struct (question.data_points, which I'll get to later) to the column chart (as per Google's API) to render in a div with a histogram id. The series_label and data_method names are important, as they are the two methods the Struct responds to and provide the x- and y-axis data respectively. There are a fair amount of options to play with. Notice that visualize will raise an exception if there is no data instead of just rendering a blank chart, so we need to catch that and display something else. I put a message to tell the user as much:

def show_no_data_message_for_histogram
  content_tag(:div),
    content_tag(:h1, 'There is not enough data to display the Crowd Belief chart'),
    :id => 'empty_histogram_text'
  )
end

The interesting part is packaging up the data into the Struct. Bear in mind that this code is not totally clean and refactored, but the test coverage will make it a lot easier to do so later.

class DataPointsContainer
  class DataPoint < Struct.new(:label, :percentage); end
end

That's all we need to get the Seer visualization working, so now we can provide DataPoint with some data points. As background, a Question has a numerical answer, and we want to get a set of data points where each point is a range of equal size containing the answers. We will show up to 11 points, depending on the size of the range (not the amount of data) so the graph doesn't look too bare or too cluttered. One last fun wrinkle is dealing with outlier data. We decided we don't want to show those data points as separate pieces of data but absorb them into the first and last points of the chart. To calculate the range without the outliers, we first calculate the mean and the standard deviation, and we return two lists: one with the outliers (those above or below 1.96 standard deviations from the mean or~5% and 95%).

def outliers_and_kept_answers(all_answers)
  mean = mean(all_answers)
  stddev = stddev(all_answers, mean)

  all_answers.partition do |i|
    i > high_threshold(all_answers, mean, stddev) ||
    i < low_threshold(all_answers, mean, stddev)
  end
end

def low_threshold(numbers, mean, stddev)
  mean - (stddev * STDDEV_FACTOR).round
end

def high_threshold(numbers, mean, stddev)
  mean + (stddev * STDDEV_FACTOR).round
end

Now we can find the smallest and largest points of the new range of data without worrying about the outliers messing up everything. We start with the first data point as the lowest in the range and add the width of each range to get the highest point:

low_point = answers.minimum
high_point = low_point + range_width(answers.minimum, answers.maximum, amount_of_points)

def range_width(min, max, points_size)
  width = ((max - min).to_f / points_size.to_f)
  width = 1.0 if width < 1.0 && @question.precision.to_i == 0
  width
end

For each data point (up to 11), we create a DataPoint, set the low to the current high, and find the new high by adding the width range to the current low (which is the old high). If we're on the last data point, use the last number in the range instead. When creating the DataPoint, we calculate the percentage by doing a SQL count of all answers within the range and grouping by the value:

answers.count(
  :conditions => 'value >= #{low} AND value < #{high}",
  :order => 'value ASC',
  :group => 'value'
)

Finally, we add the additional outliers to either the first or last DataPoint, if the outliers exist:

if index == 0
  additional = outliers.count { |x| x < low }
end
if index == total_points_count - 1
  additional = outliers.count { |x| x > high }
end
amount += additional

The only thing left to do is to give the Question model access to the data points in its class:

def data_points
  DataPointsContainer.new(self).data_points
end

A downside to this gem is that there is not a lot of discussion about it, but I found that is also true for other solutions. Google does provide more options than the gem currently offers, but I haven't had a need for these yet, and I can fork and add the functionality later if needed. The only frustration so far is on Google's part, because there are some options (removing the pop up bubbles or editing their contents) that aren't available.

Monday, May 23, 2011

Scraping a Site When It Changes Its Design

MetalDetectr has hit a snag. The site I'm scraping for all the release information, metal-archives.com recently changed their entire user interface, rendering the current functionality of my screen scraping gem useless. What shall I do?

I started looking into their new UI, and they use jQuery Datatables plugin to display the list of albums through ajax calls. A little Firebug and I can read the json that the plugin uses to populate the tables.

Now that I can access all the data, I will just need to rewrite the metal-archives gem to grab json for every paginated list of a result set, and I have all the information again! Check back for updates to the application.

Saturday, May 14, 2011

Introducing MetalDetectr

It came from a blog post. In July 2010, Cosmo Lee, creator of the metal blog Invisible Oranges requested "a simple list" of upcoming releases from metal-archives.com. I thought that this would be the perfect opportunity to:

1. Create something of value for a community I belong to
2. Help someone I genuinely appreciated for his hard work within the same community
3. Create a Rails 3 application and use some new technologies
4. Show off a little code on Github
5. Have fun!

Thus, Metaldetectr. This application will search metal-archives.com, the canonical site for all things metal, for upcoming releases and create a list with basic information for each release. The method is to programmatically search through the current year, returning a paginated list of results. By accessing each page of this list, we generate another list of possible releases. Going through that list and only choosing those releases that are in the future, we come up with the final list as requested. However, there are a few interesting problems that arose through the development process.

The first is that metal-archives.com times out. A lot. Each step in the above process needs to be able to save the spot it is on when the site is no longer accessible. If we're searching through the paginated list of results and it fails, we need to note which page we're on and continue from there the next time. The same with the links to each album.

Another problem is that metal-archives.com has human input, so there are human errors. Not every album has a specified release date, some only a month, some albums are in different languages that aren't easy to parse, and some have crazy titles or band names and can be almost impossible to automatically figure out exactly what each one of those is supposed to be. There will always need to be a manual step to clean up some data, but the main functionality is there.

To get better data, we pass each album through Amazon's US site and a few European sites (.uk, .de, and .fr currently) to check if the release date is different. Since Amazon actually provides the albums instead of just listing them, their release dates take highest priority.

I hope this ends up providing some value to the metal community. I constantly read about the desire for a one-stop list of upcoming releases, and I would like MetalDetectr to be that list.

You can look at the code yourself on my github account.

Tuesday, May 10, 2011

Messing With Magic Encoding

There's some irony in how difficult Ruby can be with different character sets considering it was written by someone who doesn't speak English natively. I don't want to get political here, so I'm just going to mention some encoding strangeness.

Ruby 1.9 is much improved over Ruby 1.8, but I don't know that defaulting to US-ASCII encoding is a good one. I suppose so, if one wishes to be more explicit with some sort of default. This became an issue when I wrote a screen-scraping library that read a site with foreign (to me) characters. With the default encoding, the program would stop on those characters, returning early without reading the entire word. It took a long time to track down because the site I was scrapping would time-out a lot, so I assumed that that was the issue. But when it would seemingly time-out on the same pages each time, I had to look into the problem more.

After finding the problem, I needed a solution. This post gave me the direction to fix it, so I was able to check the following for a result:

page = Mechanize.new('/url')
page.encoding = 'iso-8859-1'
page.search(...)

Setting the encoding to iso-8859-1 let me get the foreign characters I needed. In the tests, I put the following magic comment at the top of the file:

# coding: utf-8

Now I can test with an HTML fixture containing foreign characters. Crazy stuff, but it works.

Monday, May 2, 2011

I Am Become Genius

As follow-up to my previous post, I would like to share an article that's made the rounds on the internet recently. "How to get to Genius," an excellent synthesis of key ideas for success, correlates with themes from Pragmatic Thinking and Learning.

The working definition for "genius" is "the extreme form of insight...in terms of perspective," or similarly, the intuition an expert has gained from many years of deliberate study (Malcolm Gladwell's 10,000 hour rule). A genius has internalized a system so much that he is effectively part of that system. A genius/expert's perspective separates the novice because the latter does not understand either the rules of the system or the broad implications of why the system exists. It's the "ability to notice these vague connections amongst all the noise, amongst all the internal chatter going on inside your head, [that] separates the insightful from the unaware, the unobservant."

With evidence from neuroscience about brain plasticity, and using the computer/brain metaphor from Pragmatic Thinking, each read is a write, meaning that the act of remembering changes the memories of the brain and that constantly recalling information will write this information more permanently to memory, we bring together the necessity of much deliberate practice and seeing patterns within a system. The long hours thinking and practicing something will ingrain what's learned deeply in the mind, creating greater understanding and recall of the rules. This in turn grants the ability to understand more of what we study, and thus we can gain a broader perspective. With these tools, it becomes possible to see something different than if we were to only focus our attention on small details. Thus, becoming a genius, while certainly not easy, is, conceptually, simple and, more importantly, possible.

Tuesday, April 26, 2011

Book Review: Pragmatic Thinking and Learning: Refactor Your Wetware

The Pragmatic Programmers have consistently put out good material for years, and whether their own or other authors', the quality is almost always high. I just finished Pragmatic Thinking and Learning and, reading numerous other reviews, they are all very positive. The community liked it, and that's usually a good sign (we'll see what happens when not everyone is happy).

Important Relationships

The first major concept introduced, and repeatedly referenced, is that the relationships between objects are more interesting than the objects themselves. Discrete "things" - whether facts, concepts, or people, exist not in a vacuum but with other similar and different "things." Emergent behaviors and ideas spring up with the interactions between everything in their specific system. These specific contexts generate yet more powerful versions of the what is interacting in them.

A non-programming example is opening a locked door. The situation of why one wants to open the door matters in how one will attempt to open the door. Whether there is a baby in a burning house with a locked front door or there are documents we want to steal in a hotel room under security will influence the method used - an axe or lock picking tools. The why strongly influences the how and context matters.

Got Skills?

We then turn to the Dreyfus Model for Skill Acquisition and see how people learn new topics. Starting as a novice, one who needs context-free rules, like recipes, because he doesn't know enough to establish any big picture of the problem domain, and ending as an expert who understands the system he works in so well he works by intuition and pattern matching, there is a trend from stricter rules to practically none. The greater the expertise, the more we understand what is possible and what we should consider, both giving us faster access to what is possible and a smaller, and more easily understood, problem.

Impr-you-vement

The methods for improving skills is simple but not easy. Only through deliberate practice can we get to the highest levels of understanding, and this involves a lot of well-structured, hard work. We need a system to work on well-defined tasks that are appropriately challenging, continuous feedback from those tasks to keep us working on relevant ideas, and lots of repetition to strengthen the ideas and make them part of our long-term memories.

It also helps to find the relationships between what we're learning, looking at the big picture to understand the overall meaning and not be bothered with the minute details (at least not at the beginning). Analogies to previous knowledge further create relationships and help establish the context of the new ideas. For example, when reading a non-fiction book, we should first scan the table of contents and chapter summaries to get an overview of what we're going to read to establish the most ideas to focus on. While reading we should summarize the concepts, create metaphors with the material, and, when finished, expand the notes with a reread and discuss the ideas with colleagues. These tactics will cement the knowledge in the brain better than just a cursory skimming of the book.

Nothing is Perfect, However

One weakness of the book is the reliance on the left/right brain dichotomy, an idea that isn't totally backed by science. This criticism goes as deep as one wants to rely on the truth of this assertion, but if we think of it as a metaphor, like the main brain metaphor of a computer (another incomplete metaphor), and always remember that it's just a method to point in the general direction of understanding, and that we shouldn't understand these concepts to be almost literally accurate, we can get away with the comparisons. Regardless, I'm sure there is value in the takeaways such as creating analogies between disparate topics, drawing out ideas, talking out loud, and using other mixed media.

So What?

Pragmatic Learning is an excellent book, faults and all, and I did take away some concrete plans.

I've created a wiki for knowledge dumps and connecting ideas
I'm researching mind mapping software (either this one or this one)
I'm writing down my ideas in a notebook, along with notes on books I'm reading, which I then transfer to an online medium

There are dozens of other specific tasks to do, and I plan to come back to the book and implement more as needed. It's certainly not necessary to try every single one, but it is reasonable, if the motivation is there (and who wouldn't want to learn how to learn better?), the results will follow.

Sunday, April 17, 2011

Testing content_tag in Rails 2.3.5 with RSpec

I'm working on a codebase that's still on Rails 2.3.5, and recently I added a group of radio buttons for users to estimate their expertise level when answering a question. I wanted to play with content_tag() more than I have, so here is the view helper:

module AnswersHelper
  # Creates the markup for displaying the expertise choices for an answer.
  def expertise_choices(answer)
    content_tag(:div, :id => 'choices') do
      content_tag(:span, :class => 'clarification') { 'Not at all' } +
      collect_expertise_choices(answer) +
      content_tag(:span, :class => 'clarification') { 'Very Much So' }
    end
  end

  private

  # Creates 5 radio buttons and selects the one with the value of the answer's
  # expertise value if it exists.
  def collect_expertise_choices(answer)
    (1..5).collect do |i|
      checked = (i == answer.expertise) ? { :checked => 'checked' } : {}
      radio_button('answer', 'expertise', i, checked)
    end.to_s
  end
end

Nothing difficult to get through, but some small notes of interest:

content_tag() can nest within other content_tag() calls, and you can append markup to each other to get everything you need to display properly. Also, don't forget to call to_s() to get a string, not an array, of the radio buttons.

Here is the partial that calls the helper:

#expertise
  Are you an expert on this topic?
  %br
  #choices
    %span.clarification Not at all
    = expertise_choices(answer)
    %span.clarification Very Much So

Finally, here are the accompanying tests:

require 'spec_helper'
include AnswersHelper

describe AnswersHelper do
  describe "#expertise_choices" do
    it "should display five radio buttons" do
      answer = mock_model(Answer, :expertise => nil)
      results = expertise_choices(answer)
      (1..5).each do |i|
        results.should have_tag('input', :id => "answer_expertise_#{i}", :type => 'radio', :value => i)
      end
    end

    it "should have a #choices div" do
      answer = mock_model(Answer, :expertise => nil)
      results = expertise_choices(answer)
      results.should have_tag('div#choices')
    end

    it "should have two .clarification spans" do
      answer = mock_model(Answer, :expertise => nil)
      results = expertise_choices(answer)
      results.should have_tag('span.clarification', :minimum => 2)
    end

    context "when editing" do
      it "should check the existing choice" do
        answer = mock_model(Answer, :expertise => 4)
        results = expertise_choices(answer)
        results.should have_tag('input[checked="checked"]', :type => 'radio', :value => 4)
      end
    end
  end
end

Again, nothing difficult to understand, but you can see how cool and powerful have_tag() is. Unfortunately, when we upgrade to RSpec 2, we'll need to change these tests to use webrat's have_selector(). But for now, let's just enjoy the time we have together, okay?

Wednesday, April 6, 2011

Composite Pattern FTW

Background

A post by Paul Graham I recently found resonated with what I've been doing at work recently. In his post, "Taste for Makers," PG posits that beauty is not wholly subjective and that good design is beautiful. Among others, good design:

is simple
solves the right problem
is suggestive
looks easy
uses symmetry
is redesign
can copy
is often quite strange
happens in chunks

I'd like to focus on a few of these descriptions and use an example I've recently done.

In his fantastic book, Design Patterns in Ruby, Russ Olsen describes one tenant of the GOF book to "prefer composition over inheritance." Inheritance creates tighter coupling between classes, since the children of the base class need to know about the internals of the base, even though the coupling is very specific to the implementation and (should be) well understood. Composition, however, changes the relationship between objects. An object no longer is another type of object but has the functionality of another object (is-a vs. has-a). This relationship increases the encapsulation of the composite object by providing an interface to the composed object instead of exposing the underlying details of a base class.

Slices and Dices!

Now I know there is a tendency to think of design patterns as a silver bullet, but bear with me. The situation is fine when the inheritance tree is simple and the functionality basic. The complexity grows as the tree grows and as more functionality is required. Soon, you're not quite sure if it should inherit Foo which inherits from Bar, or if you should just inherit from Baz way up near the base. You'll have to dig into the classes to find out which one is closest to what you want and hope it makes the most sense to place the new class wherever you end up placing it. However, using the Composite Pattern gives us much more flexibility for creating new classes and giving them abilities.

An Example

There is a system that asks users different types of questions. One type asks when an event will happen (DateQuestion), one type asks the numerical results of an event (NumberQuestion), and one asks which event will happen given a set of choices (ChoiceQuestion). We have a base Question that each inherits from, and since dates can be represented as numbers, DateQuestion will inherit from NumberQuestion. These questions allow answers, comments, access control lists, and have a specific work flow (create, activate, suspend, close, etc.).

Later on, the system needs to support a few more types of questions: a numeric range (NumberRangeQuestion), a date range (DateRangeQuestion), a yes/no-only (YesNoQuestion)...you get the point. We need to figure out where these new types go in the inheritance tree - whether one is a child of a DateQuestion (itself a child of NumberQuestion), or if it's just a child of NumberQuestion, or maybe it's its own type and only inherits from the base Question type. We start to bump into complexity issues, that is, unnecessary complexity.

I'll Take a Little of This...

Let's approach this problem from a different angle. Given our original Question types, we can make them all inherit from a base Question class and then give them abilities as needed. So now our classes look like this:

class Question
  include Commentable
  include AccessListControllable
  include Workflowable
end

class NumberQuestion < Question
  include Numerical
end

class DateQuestion < Question
  include Numerical
  include Dateable
end

class ChoiceQuestion < Question
  include Choiceable
end

NumberQuestion and DateQuestion are numerical, that is, they have whatever functionality they need to do what numerical objects need to do. The DateQuestion is also dateable, so it has additional properties needed for a dateable object, while NumberQuestion, not needing them, doesn't have those abilities. So when we need additional Question types, we can choose which abilities they need. A DateRangeQuestion? It's dateable, numerical, and it's got its own class-specific functionality as well.

There are some trade-offs. Some modules may not have all the functionality an object needs, and there is a potential for similar code needed to provide slightly different abilities. There can also be unneeded functionality in a module that an object will never need. These problems aren't specific to the Composite design pattern, as they can occur with regular inheritance as well.

Some Clarity

We've refactored our code to use a design pattern to organize our code a little better to make our application more maintainable and extendable, both good things, and the process was relatively painless. Since the functionality never changed, just the organization, if the tests pass, we can feel confident that our models still work how we want.