UseMod Wiki: WikiBugs/Perl5.8BreaksUseModDataFiles

I believe this is a serious problem, although I have only been able to test it on ActiveState?'s ActivePerl 5.88 on Windows (Vista) with Apache, the "new" multi-byte $FS, and UTF-8 Character Set. Other configurations may work differently or malfunction in different ways.

Earlier discussions (SupportForUtf8) noted that the (old) pragma "use utf8;" breaks things. "Starting from Perl 5.8", however, Perl no-longer needs that pragma because it is now fully unicode aware as stated in the perluniintro man page: "The principle is that Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be avoided, the data are transparently upgraded to Unicode." The effect of this on UseMod (Version 1.04), at least in a multi-lingual setup where UTF-8 multi-byte characters are encountered regularly (but probably also affecting all pages because of the the multi-byte $FS???), is that Perl "upgrades" strings silently and then, because UseMod doesn't specify a PerlIO? layer, Perl writes these strings to binary in the .db files using its native encoding (the raw bytes by which it represents these strings to itself), which happens to be... UTF-8. However, when Perl reads those files again, because UseMod doesn't specify any decoding layer, it is unable to interpret the binary (especially with the non-UTF $FS byte-sequence) other than as a series of single-byte characters (I think: I haven't been able to work out exactly what is happening). In any case, the result is that UseMod throws a "Bad page version (or corrupt page) error" for any page that has been saved after upgrading to Perl 5.8.

Examining these .db files in an editor shows that they are not corrupt, they are merely encoded as "UTF-8" (according to my text editor), whereas any files produced on versions of Perl prior to 5.8 (regardless of whether or not they contain UTF-8 byte sequences: most of mine do) are encoded as "ANSI" (this is what my text editor tells me). Now, if I insert the line "use open ':utf8';" at the beginning of UseMod (e.g., directly after "use strict;"), UseMod can now magically read these utf-8 .db files without any error. HOWEVER, that command instructs Perl to pass all reads and writes through an implicit :utf8 encoding/decoding filter, so now it is the old-format files which throw the "Bad page version" error and become unreadable.

If I were starting a new Wiki, I would simply insert that one line (use open ':utf8';) and problem solved, everything would work. However, I maintain a large embedded Wiki used as a content-management system with a lot of old-style pages which would be rendered unreadable that way, while leaving out that line means that any attempt to edit an old-style page renders it undreadable. Catch-22.

Perl 5.8 unicode tutorial recommends that all read and write operations in new Perl code should explicitly encode and decode, to avoid precisely this kind of "upgrading" problem: "Decode everything you receive, encode everything you send out. (If it's text data.)" (pelunitut). Reads and writes should be done like this:

$string = open(IN, '<:utf8', $fileName); etc.
open OUT, ">:utf8", $fileName; print OUT $string; etc.

Or alternatively, in UseMod, you could decode at some later stage (but before Perl does any "upgrading" of its own -- if you leave it too late, the script halts with "Wide character" warnings). To use decode on existing strings, you need to insert a "use Encode;" pragma at the beginning of the UseMod script. So, in sub OpenPage, instead of the line $data = &ReadFileOrDie($fname);, you can do this:

: $data = decode('UTF-8', &ReadFileOrDie($fname));

Of course you might not want to hard-code 'UTF-8' here, so you could use $data = decode($HttpCharset, &ReadFileOrDie($fname)); instead. But you also need a fallback in case you are reading files which have a different encoding to the one specified by the user variable $HttpCharset. My ammendments to the sub OpenPage subroutine are as follows, and this works for me (remember to use Encode; or use Encode qw(encode decode); at the beginning of the script):

sub OpenPage {
  my ($id) = @_;
  my ($fname, $data);

  if ($OpenPageName eq $id) {
    return;
  }
  %Section = ();
  %Text = ();
  $fname = &GetPageFile($id);
  if (-f $fname) {
#   $data = &ReadFileOrDie($fname); ###OLD
    $data = decode($HttpCharset, &ReadFileOrDie($fname)); ###NEW
    %Page = split(/$FS1/, $data, -1);  # -1 keeps trailing null fields
  } else {
    &OpenNewPage($id);
  }
  if ($Page{'version'} != 3) {
###<NEW>### attempt generic fallback for pages written in earlier Perl
    my ($CharsetFallback) = (($HttpCharset eq 'UTF-8') || ($HttpCharset eq 'utf-8')) ? 'ISO-8859-1' : 'UTF-8';
    $data = decode($CharsetFallback, &ReadFileOrDie($fname));
    %Page = split(/$FS1/, $data, -1);
###</NEW>###
    &UpdatePageVersion() if ($Page{'version'} != 3); ###NEW if clause
  }
  $OpenPageName = $id;
}

I'm sure there are more elegant solutions than this, and it's quite possible that this kind of code would break installs on earlier Perl versions. Maybe a simpler solution would be to make the ($Page{'version'} test more flexible so that it understands the raw bytes even if it doesn't know which character coding has been used in the file, although patching it that way rather than grasping the need for explicit encoding/decoding might be storing up trouble for later. --Geoffrey K 2008/01/05.

Thanks for reporting this. I'll have a look at it. This may take quite some time as I am not that familiar with unicode. -- MarkusLude