Ruby 1.9 is much improved over Ruby 1.8, but I don't know that defaulting to US-ASCII encoding is a good one. I suppose so, if one wishes to be more explicit with some sort of default. This became an issue when I wrote a screen-scraping library that read a site with foreign (to me) characters. With the default encoding, the program would stop on those characters, returning early without reading the entire word. It took a long time to track down because the site I was scrapping would time-out a lot, so I assumed that that was the issue. But when it would seemingly time-out on the same pages each time, I had to look into the problem more.
After finding the problem, I needed a solution. This post gave me the direction to fix it, so I was able to check the following for a result:
page = Mechanize.new('/url') page.encoding = 'iso-8859-1' page.search(...)
Setting the encoding to iso-8859-1 let me get the foreign characters I needed. In the tests, I put the following magic comment at the top of the file:
# coding: utf-8
Now I can test with an HTML fixture containing foreign characters. Crazy stuff, but it works.
No comments:
Post a Comment