Tuesday, May 10, 2011

Messing With Magic Encoding

There's some irony in how difficult Ruby can be with different character sets considering it was written by someone who doesn't speak English natively. I don't want to get political here, so I'm just going to mention some encoding strangeness.

Ruby 1.9 is much improved over Ruby 1.8, but I don't know that defaulting to US-ASCII encoding is a good one. I suppose so, if one wishes to be more explicit with some sort of default. This became an issue when I wrote a screen-scraping library that read a site with foreign (to me) characters. With the default encoding, the program would stop on those characters, returning early without reading the entire word. It took a long time to track down because the site I was scrapping would time-out a lot, so I assumed that that was the issue. But when it would seemingly time-out on the same pages each time, I had to look into the problem more.

After finding the problem, I needed a solution. This post gave me the direction to fix it, so I was able to check the following for a result:

page = Mechanize.new('/url')
page.encoding = 'iso-8859-1'
page.search(...)

Setting the encoding to iso-8859-1 let me get the foreign characters I needed. In the tests, I put the following magic comment at the top of the file:

# coding: utf-8

Now I can test with an HTML fixture containing foreign characters. Crazy stuff, but it works.

No comments:

Post a Comment