Tuesday, May 31, 2011

Histograms in Ruby with Seer

Google has a robust set of charts through their visualization API that they use for Analytics, so when I needed a solution to graph a histogram, I turned to a Ruby implementation of Google Charts. There are other options, but most have little activity. The Seer gem seems to provide enough power to get the data I need, providing a decent set of configuration options and a lot of simplicity.

To get started, add this bit of JavaScript to the head of the page:

= Seer::init_visualization

In the view (preferably a view helper), add this:

begin
  Seer::visualize(
    question.data_points,
    :as => :column_chart,
    :in_element => 'histogram',
    :series => {
      :series_label => 'label',
      :data_method => 'percentage'
    },
    :chart_options => {
      :max => 100,
      :min => 0,
      :height => 355,
      :width => 488,
      :is_3_d => false,
      :legend => 'none',
      :colors => "[{color:'#990000', darker:'#660000'}]",
      :title => 'Best Estimates',
      :title_x => 'Numbers',
      :title_y => 'Percent'
    })
rescue ArgumentError => e
  show_no_data_message_for_histogram
end

Here, we provide a Struct (question.data_points, which I'll get to later) to the column chart (as per Google's API) to render in a div with a histogram id. The series_label and data_method names are important, as they are the two methods the Struct responds to and provide the x- and y-axis data respectively. There are a fair amount of options to play with. Notice that visualize will raise an exception if there is no data instead of just rendering a blank chart, so we need to catch that and display something else. I put a message to tell the user as much:

def show_no_data_message_for_histogram
  content_tag(:div),
    content_tag(:h1, 'There is not enough data to display the Crowd Belief chart'),
    :id => 'empty_histogram_text'
  )
end

The interesting part is packaging up the data into the Struct. Bear in mind that this code is not totally clean and refactored, but the test coverage will make it a lot easier to do so later.

class DataPointsContainer
  class DataPoint < Struct.new(:label, :percentage); end
end
That's all we need to get the Seer visualization working, so now we can provide DataPoint with some data points. As background, a Question has a numerical answer, and we want to get a set of data points where each point is a range of equal size containing the answers. We will show up to 11 points, depending on the size of the range (not the amount of data) so the graph doesn't look too bare or too cluttered. One last fun wrinkle is dealing with outlier data. We decided we don't want to show those data points as separate pieces of data but absorb them into the first and last points of the chart. To calculate the range without the outliers, we first calculate the mean and the standard deviation, and we return two lists: one with the outliers (those above or below 1.96 standard deviations from the mean or~5% and 95%).
def outliers_and_kept_answers(all_answers)
  mean = mean(all_answers)
  stddev = stddev(all_answers, mean)

  all_answers.partition do |i|
    i > high_threshold(all_answers, mean, stddev) ||
    i < low_threshold(all_answers, mean, stddev)
  end
end

def low_threshold(numbers, mean, stddev)
  mean - (stddev * STDDEV_FACTOR).round
end

def high_threshold(numbers, mean, stddev)
  mean + (stddev * STDDEV_FACTOR).round
end
Now we can find the smallest and largest points of the new range of data without worrying about the outliers messing up everything. We start with the first data point as the lowest in the range and add the width of each range to get the highest point:
low_point = answers.minimum
high_point = low_point + range_width(answers.minimum, answers.maximum, amount_of_points)

def range_width(min, max, points_size)
  width = ((max - min).to_f / points_size.to_f)
  width = 1.0 if width < 1.0 && @question.precision.to_i == 0
  width
end
For each data point (up to 11), we create a DataPoint, set the low to the current high, and find the new high by adding the width range to the current low (which is the old high). If we're on the last data point, use the last number in the range instead. When creating the DataPoint, we calculate the percentage by doing a SQL count of all answers within the range and grouping by the value:
answers.count(
  :conditions => 'value >= #{low} AND value < #{high}",
  :order => 'value ASC',
  :group => 'value'
)
Finally, we add the additional outliers to either the first or last DataPoint, if the outliers exist:
if index == 0
  additional = outliers.count { |x| x < low }
end
if index == total_points_count - 1
  additional = outliers.count { |x| x > high }
end
amount += additional
The only thing left to do is to give the Question model access to the data points in its class:
def data_points
  DataPointsContainer.new(self).data_points
end
A downside to this gem is that there is not a lot of discussion about it, but I found that is also true for other solutions. Google does provide more options than the gem currently offers, but I haven't had a need for these yet, and I can fork and add the functionality later if needed. The only frustration so far is on Google's part, because there are some options (removing the pop up bubbles or editing their contents) that aren't available.

Monday, May 23, 2011

Scraping a Site When It Changes Its Design

MetalDetectr has hit a snag. The site I'm scraping for all the release information, metal-archives.com recently changed their entire user interface, rendering the current functionality of my screen scraping gem useless. What shall I do?

I started looking into their new UI, and they use jQuery Datatables plugin to display the list of albums through ajax calls. A little Firebug and I can read the json that the plugin uses to populate the tables.

Now that I can access all the data, I will just need to rewrite the metal-archives gem to grab json for every paginated list of a result set, and I have all the information again! Check back for updates to the application.

Saturday, May 14, 2011

Introducing MetalDetectr

It came from a blog post. In July 2010, Cosmo Lee, creator of the metal blog Invisible Oranges requested "a simple list" of upcoming releases from metal-archives.com. I thought that this would be the perfect opportunity to:

1. Create something of value for a community I belong to
2. Help someone I genuinely appreciated for his hard work within the same community
3. Create a Rails 3 application and use some new technologies
4. Show off a little code on Github
5. Have fun!

Thus, Metaldetectr. This application will search metal-archives.com, the canonical site for all things metal, for upcoming releases and create a list with basic information for each release. The method is to programmatically search through the current year, returning a paginated list of results. By accessing each page of this list, we generate another list of possible releases. Going through that list and only choosing those releases that are in the future, we come up with the final list as requested. However, there are a few interesting problems that arose through the development process.

The first is that metal-archives.com times out. A lot. Each step in the above process needs to be able to save the spot it is on when the site is no longer accessible. If we're searching through the paginated list of results and it fails, we need to note which page we're on and continue from there the next time. The same with the links to each album.

Another problem is that metal-archives.com has human input, so there are human errors. Not every album has a specified release date, some only a month, some albums are in different languages that aren't easy to parse, and some have crazy titles or band names and can be almost impossible to automatically figure out exactly what each one of those is supposed to be. There will always need to be a manual step to clean up some data, but the main functionality is there.

To get better data, we pass each album through Amazon's US site and a few European sites (.uk, .de, and .fr currently) to check if the release date is different. Since Amazon actually provides the albums instead of just listing them, their release dates take highest priority.

I hope this ends up providing some value to the metal community. I constantly read about the desire for a one-stop list of upcoming releases, and I would like MetalDetectr to be that list.

You can look at the code yourself on my github account.

Tuesday, May 10, 2011

Messing With Magic Encoding

There's some irony in how difficult Ruby can be with different character sets considering it was written by someone who doesn't speak English natively. I don't want to get political here, so I'm just going to mention some encoding strangeness.

Ruby 1.9 is much improved over Ruby 1.8, but I don't know that defaulting to US-ASCII encoding is a good one. I suppose so, if one wishes to be more explicit with some sort of default. This became an issue when I wrote a screen-scraping library that read a site with foreign (to me) characters. With the default encoding, the program would stop on those characters, returning early without reading the entire word. It took a long time to track down because the site I was scrapping would time-out a lot, so I assumed that that was the issue. But when it would seemingly time-out on the same pages each time, I had to look into the problem more.

After finding the problem, I needed a solution. This post gave me the direction to fix it, so I was able to check the following for a result:

page = Mechanize.new('/url')
page.encoding = 'iso-8859-1'
page.search(...)

Setting the encoding to iso-8859-1 let me get the foreign characters I needed. In the tests, I put the following magic comment at the top of the file:

# coding: utf-8

Now I can test with an HTML fixture containing foreign characters. Crazy stuff, but it works.

Monday, May 2, 2011

I Am Become Genius

As follow-up to my previous post, I would like to share an article that's made the rounds on the internet recently. "How to get to Genius," an excellent synthesis of key ideas for success, correlates with themes from Pragmatic Thinking and Learning.

The working definition for "genius" is "the extreme form of insight...in terms of perspective," or similarly, the intuition an expert has gained from many years of deliberate study (Malcolm Gladwell's 10,000 hour rule). A genius has internalized a system so much that he is effectively part of that system. A genius/expert's perspective separates the novice because the latter does not understand either the rules of the system or the broad implications of why the system exists. It's the "ability to notice these vague connections amongst all the noise, amongst all the internal chatter going on inside your head, [that] separates the insightful from the unaware, the unobservant."

With evidence from neuroscience about brain plasticity, and using the computer/brain metaphor from Pragmatic Thinking, each read is a write, meaning that the act of remembering changes the memories of the brain and that constantly recalling information will write this information more permanently to memory, we bring together the necessity of much deliberate practice and seeing patterns within a system. The long hours thinking and practicing something will ingrain what's learned deeply in the mind, creating greater understanding and recall of the rules. This in turn grants the ability to understand more of what we study, and thus we can gain a broader perspective. With these tools, it becomes possible to see something different than if we were to only focus our attention on small details. Thus, becoming a genius, while certainly not easy, is, conceptually, simple and, more importantly, possible.