UseMod Wiki: SupportForUtf8

Note: This is included in the Usemod 1.0 release see UseModWiki/NewFeatures,

See some discussion at UseMod:MultilingualWiki

Perl 5.6 seems to have enough support for UTF-8 character encoding[1] to get by. (See "perldoc perlunicode".)

Enabling utf-8 support in UseModWiki should just be a matter of setting the $HttpCharset option to "UTF-8".

However, UseModWiki internal data uses ASCII 0xb3 for field separation (variable $FS) which conflicts with UTF-8. In UTF-8, the bytes 0xfe and 0xff never appear, so these are more appropriate. This is not a general solution, though, because these two bytes are interesting in Latin-1, for example (0xfe is `<thorn>' and 0xff is `ÿ').

Why 0xb3 (superscript 3)

CliffordAdams wanted a character that can be printed on a terminal (for debugging, human readable data formats are always the best!). And a superscript 3 is printable, and still not used too often (on German keyboards, AltGr+3 produces it quite easily, however). Anyway, in Latin-1 locales, this can be debugged with simple tools like `less' and `vi'.

: In addition to breaking in multibyte encodings, 0xb3 is a real letter in some single-byte encodings. For instance, in Latin-2 it's lowercase slashed-l, a very common letter in Polish.

Adding UTF-8 support

TakuyaMurata posted a patch at WikiPatches/UtfEight. It doesn't solve the problem but supports UTF8 to some extent -- instead of removing 0xb3 from the text always, it only removes it when followed by 1, 2, or 3. Thus, a lot less UTF-8 characters are corrupted.

Eventually, however, the database format will have to be overhauled. Moving to XML seems like an option, for example.

Automatic translation into entities

How about converting multibyte characters that contain 0xb3 into HTML entities referring to code points?

The problem with that is that you need to write Perl code to search for complete UTF-8 characters to translate them correctly.

Document how to start new UTF-8 wikis

When starting an UTF-8 wiki, just use a $FS that can never appear in UTF-8 -- use 0xfe or 0xff. See RFC 2279 for more. All we need to do is document this close to the $HttpCharset. And make sure that the admin understands that changing $FS will corrupt his DB; conversion will require using external tools.

Integrating this conversion into the script might be a very nice next step, though. Here is a small CGI script that does it for you by converting 0xbe (\263) to 0xfe (\376):

UNTESTED! Just fix bugs if you encounter them, or remove this note if it worked for you.

    #!/usr/bin/perl
    use CGI qw/:standard/;
    use CGI::Carp 'fatalsToBrowser';
    print header;
    print start_html('Converting UseMod Database $FS from 0xb3 to 0xfe');
    print h1("Converting UseMod Database $FS from 0xb3 to 0xfe");
    chdir ("/usr/home/v1/a0013621/html");
    undef $/;
    @files = glob("keep/*/* page/*/* user/*/*");
    foreach $file (@files) {
      print p($file);
      concat(F, $file);
      $_ = <F>;
      close F;
      tr/\263/\376/;
      rename($file, "$file~") unless -e "$file~";
      concat(F, ">$file");
      print F;
      close F;
    }
    print p("Done.");
    print end_html;

A bug: this above script does not handle SubPages at all. A risky fix can be just remove 'rename' line

A python script instead:

#!/usr/bin/python
import os, os.path

def replacef(path, old='\263', new='\376', base = '/tmp'):
    if os.path.exists(path):
        newdirname = base + os.path.dirname(path)
        basename = os.path.basename(path)
        newpath  = newdirname + "/" + basename

        try:
            os.makedirs(newdirname)
        except OSError:
            pass

        print "Converting %s ..." % path

        f = concat(path, 'rb')
        ff = concat(newpath, 'wb')
        ff.write(f.read().replace(old, new))
        f.close()

        ff.close()

def convert(dummy, path, names):
    # names = filter(names, exclude) 
    for name in names:
        fn = "%s/%s" % (path, name)
        if os.path.isfile(fn):
            replacef(fn)

def main():
   os.path.walk('/your/usemode-wiki/datadir', convert, '')

main()

CGI charset

Set the CGI query charset correctly in InitRequest in order to prevent incorrect HTML escapes:

  # Fix some issues with editing UTF8 pages (if charset specified)
  if ($HttpCharset ne '') {
    $q->charset($HttpCharset);
  }

CAVEAT:Depend on the version of CGI.pm, it doesn't work. ($q->charset is undefined.)

This works for me. I am running a UTF8 wiki site with Perl 5.6.1 from debian Woody

Do not "use utf8;"

Using Perl's "use utf8;" breaks things (page edit always gives default blank text area). Just don't use it, it shouldn't be needed.

UseModWiki's handling of $UpperLetter, $LowerLetter, $NonEnglish, and $FreeLinks is a mess when it comes to UTF-8. For example, I tried to set

  $UpperLetter = "([A-Z]|\xc3[\x80-\x9e])";
  $LowerLetter = "([a-z]|\xc3[\x9f-\xbf])";
  $AnyLetter = "([A-Za-z0-9_]|\xc3[\x80-\xbf])";

for the UTF-8 representation of the lower and upper letters that are defined in ISO 8859-1. This works with patterns like $AnyLetter+ etc., but in the Pattern/Store? substitutions in CommonMarkup you have to count all the parentheses to get $1, $2 right. However, expanding these regexps to include more upper and lower letters leads to absurdly long expressions. Shouldn't "use utf8" and "use locale" be able to solve these problems in a way, where the system knows what is an upper and lower, letter and non-letter, simply using regular expressions [:lower:] and [:upper:]? Can the problems that "use utf8" causes be solved?

Try:

  $UpperLetter = "(?:[A-Z]|\xc3[\x80-\x9e])";
  $LowerLetter = "(?:[a-z]|\xc3[\x9f-\xbf])";
  $AnyLetter = "(?:[A-Za-z0-9_]|\xc3[\x80-\xbf])";

This seems to work as far as making CamelCase links goes, though it's incomplete and a locale solution would be cleaner if it worked. --BrionVibber

If you are looking to add MySQL? support, [chapter 8 of the MySQL manual] describes the UTF-8 support in MySQL? version 4.1. The first alpha 4.1.0 was released in April 2003. See the [MySQL Change History] for a release schedule.

From the manual:

: The `use utf8' pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope.

It has no effect on files read from the system, for example. It only affects literal strings in the source file when it is parsed. -- AlexSchroeder

The corresponding sub from OddMuse:

sub InitLinkPatterns {
  my ($UpperLetter, $LowerLetter, $AnyLetter, $WikiWord, $QDelim);
  $QDelim = '(?:"")?';# Optional quote delimiter (removed from the output)
  $WikiWord = '[A-Z]+[a-z\x80-\xff]+[A-Z][A-Za-z\x80-\xff]*';
  $LinkPattern = "($WikiWord)";
  $LinkPattern .= $QDelim;
  # Inter-site convention: sites must start with uppercase letter.
  # This avoids confusion with URLs.
  $InterSitePattern = '[A-Z]+[A-Za-z\x80-\xff]+';
  $InterLinkPattern = "($InterSitePattern:[-a-zA-Z0-9\x80-\xff_=!?#$@~`%&*+\\/:;.,]+[-a-zA-Z0-9\x80-\xff_=#$@~`%&*+\\/])$QDelim";
  $FreeLinkPattern = "([-,.()' _0-9A-Za-z\x80-\xff]+)$QDelim";
  $UrlProtocols = 'http|https|ftp|afs|news|nntp|mid|cid|mailto|wais|'
                  . 'prospero|telnet|gopher';
  $UrlProtocols .= '|file'  if $NetworkFile;
  $UrlPattern = "((?:$UrlProtocols):(?://[-a-zA-Z0-9_.]+:[0-9]*)?[-a-zA-Z0-9_=!?#$\@~`%&*+\\/:;.,]+[-a-zA-Z0-9_=#$\@~`%&*+\\/])$QDelim";
  $ImageExtensions = '(gif|jpg|png|bmp|jpeg)';
  $RFCPattern = "RFC\\s?(\\d+)";
  $ISBNPattern = 'ISBN:?([0-9- xX]{10,})';
}

browser compatibility

Implementers should be warned that Netscape is not happy with UTF-8. Although it is quite facile at displaying UTF-8 text, it does not edit UTF-8 text properly. If you edit a UTF-8 textarea that contains more than 512 characters, even if all of them are 7-bit characters, it strips the trailing character from the textarea before submitting it. (There may be additional similar bugs of which I am not aware.) So if you use UTF-8, I recommend you require that site editors use Internet Explorer. -- MeatBall:ScottMoonen

''Saying "Netscape" is too general. What version and platform? Mozilla seems to be ok. Info is available on problems with Netscape 4.x [2].''

We've got a partial list of known good and bad utf-8-editing browsers going at MetaWikiPedia?:meta.wikipedia.org_technical_issues.

A reasonable solution may be to check the browser version when the user attempts an edit. If a malicious browser is detected we redirect to a page suggesting alternatives.

As for text browsers -- you can use xterm -u8, and then start w3m-m17n. -- AlexSchroeder

Alex's suggestion

I like WikiPatches/UtfEight. One way to improve it even more would be this: Whenever we save text, check for $FS, and if found, replace it with some reserved byte sequence. When loading text, do the reverse. Then $FS characters could be part of the text anyhow, and the DB format remains unchanged. -- AlexSchroeder

That is a great idea. I wonder why I couldn't realize that what we need is not take care of UTF-8 but just escape 0xb3 in some way. I will post a patch supporting this idea in days. --TakuyaMurata

I wonder, however: Why did you not use some combination of characters that will never appear in UTF-8 text? You could have used 0xfe 0xff 0xfe 0xff 0xfe 0xff or something similarly unlikely as the escape sequence. Then there is still a very tiny chance for a clash with non-UTF-8 databases, but it will *always* work with UTF-8 databases. -- AlexSchroeder

There is an important reason, that is I don't want the database depends on the charset. If you chose 0xfe, your database is utf8-free but not latin-1 free. That is also why I didn't choose control characters that may have a conflict with other Japanese charset such as SJIS. My propose tab + '3' [3] sounds nasty but should work better. Anyway that patch remains tentative, so I welcome any suggestions. I'd like to discuss the reserved byte sequences. -- TakuyaMurata

Hm, I do not know how SJIS really works, so perhaps you are right. On the other hand, I would try to make sure that the replacement sequence used is really really improbable. And TAB 3 seems not, eg. what happens when a user pastes indented source code, and one of the lines starts with 3?

       (format "Debug: testing %d %d %d"
               (+ 372 other-value)
               foo
               3)

Ok, it is contrived, but not too much. Perhaps you can use a control character instead, such as 0x07 (the bell, ^G). But as I said, I do not know the other coding systems. If they share the first 127 ASCII characters, then all is well. -- AlexSchroeder

I totally forgot that the users can paste tab even though they can't type in the browser. Also, it is my bad that actually SJIS doen't use control characters and I did some research and found other encoding such as euc, jis also don't use control characters. I think the bell is a good choice 0x07. It is still a control character but may be considered somewhat liable. I really don't know much about charset so tell me if you have an idea. --TakuyaMurata

If you use cat to output the database files, you will hear many many beeps. :) But when you use less, you will usually see a ^G instead of hearing the bell, so I think it is OK. I think the other control characters should not be used, if you read the description of what they where used for, for example here: http://www.robelle.com/smugbook/ascii.html.

As to the the definitions of all other character sets, you can read RFC 1345. I tried to find the information we are looking for: "Is the bell 0x07 really the bell in almost all interesting coding systems, and therefore a good choice?" I then grepped the RFC for \bBL\b (the bell character) and it *seems* to be always at the same position: 0x07. -- AlexSchroeder

Why not use US (Unit Seperator, 0x1F)?

US (Unit Separator, 0x1F) — along with RS (Record Separator, 0x1E), GS (Group Separator, 0x1D), and FS (File Separator, 0x1C) — has been created exactly for this purpose. It's part of ASCII, and thus is part of every ASCII-compatible character set, including Latin-1 and Unicode/UTF-8. The only drawback is that it's usually not printable. -- JulianMehnle

: The main reason is that I didn't know that this was a standard character in "every ASCII-compatible character set". Anyone setting up a new wiki can simply change the code where $FS is set to \xb3 to \x1f (or another of the separator characters) if they like. The current default setting of \xb3 is kept for compatability with existing wikis.

: Another reason is that I didn't want to limit the wiki to just ASCII-compatible character sets. I thought that using a single character/byte might cause problems in other character sets, especially if that byte is used as a secondary or later byte in a multi-byte sequence. The multi-byte $FS (selected by setting the $NewFS option to 1) should be unusual in any character set. It includes the Record Separator character at the end, so it should not appear in ordinary text. --CliffordAdams

Another Wiki's approach

I'm not a Usemod user, but you might be interested in how another Perl-based Wiki is approaching this... TWiki ended up using NUL (\0) as the internal translation token (which we need for various things, a bit like your field separator character). This is quite safe for UTF-8 and all single-byte character sets, and for Unix systems generally since NUL is used by many Unix APIs and C programs as a string terminator. Perl treats NUL as just another character. For multi-byte character sets, there's a line in the TWiki code suggesting people change this to a multiple-byte string using characters that don't occur in the relevant character set - however, it's much cleaner to just require support for UTF-8 in the stored text of the page, and convert to and from other encodings. This is necessary to get Wikis to work at all (without enormous hassles) with escape-sequence based character sets such as ISO-2022-JP, since these embed ordinary ASCII characters within 'multi-byte' encoding sequences. For TWiki, I'm only going to support UTF-8 and single-byte character sets as the storage format, hence NUL will work fine, though EUC-JP and GB-2312 may also work even with this restriction.

For more information and lots of links to I18N resources and Wiki I18N efforts, including test pages, see http://twiki.org/cgi-bin/view/Codev/InternationalisationEnhancements - the UTF-8 work is at http://twiki.org/cgi-bin/view/Codev/InternationalisationUTF8 and I have a few linked test pages for Cyrillic, Japanese and Chinese. TWiki already has locale-based I18N to enable non-UTF-8 WikiWords, which might help with some of the earlier discussion on this page if that still applies - I am currently putting in support for UTF-8 URL encodings at the moment. --RichardDonkin