16 April 2013

I’ve started working on a website that will aggregate different arts calendars in the DC area. I’m writing it in ruby using the sinatra framework, and part of the project involves writing scripts that will scrape websites to create atom feeds. Yesterday I ran into a problem where a file I was generating in one script was causing an ‘invalid byte sequence’ error when I tried to read it into another script, and I decided to write up the solution in case some other ruby encodings neophyte gets bitten by the same bug.

Say you use nokogiri to parse an html fragment containing a non-breaking space (these snippets use ruby 2.0.0):

irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> doc = Nokogiri::HTML 'Hallo,&nbsp;Welt.'
=> #<Nokogiri::HTML::Document:0x1231274 ... >

When we access the text attribute of the document object, we get a ruby string encoded using utf-8. The non-breaking space becomes the unicode character \u00A0.

irb(main):003:0> s = doc.text
=> "Hallo,\u00A0Welt."
irb(main):004:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):005:0> s.valid_encoding?
=> true

Now let’s write the string to a file and read it back in.

irb(main):008:0> File.open('temp', 'w') {|f| f.write s}
=> 13
irb(main):009:0> s2 = File.open('temp') {|f| f.gets}
=> "Hallo,\xC2\xA0Welt."

Something happened and we didn’t get our \u00A0 back. If you were to view the temporary file in a text editor, you might see an A with a circumflex diacritic or a pair of funny-looking question mark characters. The problem is that ruby didn’t read in the file using utf-8.

irb(main):010:0> s2.encoding
=> #<Encoding:US-ASCII>
irb(main):011:0> s2.valid_encoding?
=> false

Processing the resulting string will cause certain methods to crash with the aforementioned ‘invalid byte sequence’ error. For me, this happened when I tried to parse my auto-generated atom feed using simple-rss. Fortunately the solution is simple: we just need to explicitly set the encoding when we open the file.

irb(main):012:0> s3 = File.open('temp', 'r:utf-8') {|f| f.gets}
=> "Hallo,\u00A0Welt."
irb(main):013:0> s3.encoding
=> #<Encoding:UTF-8>
irb(main):014:0> s3.valid_encoding?
=> true

Alternatively, you can write a byte-order mark at the beginning of your file and have ruby check for it when it opens the file. This approach is described in this stack overflow question. Although personally I’m not so sure I need these fancy-schmancy non-breaking spaces to begin with.

irb(main):015:0> s.gsub!(/(\u00A0)+/, ' ')
=> "Hallo, Welt."

Postscript: ironically, jekyll (or, more specifically, the liquid gem) crashed when I tried to process the first draft of this post because it included non-ASCII characters. Encoding!