UseMod Wiki: WikiPatches/EscapeFS

This patch should not be necessary for 1.0 which has an optional longer $FS (0x1e 0xFF 0xFE 0x1e). New wikis can use the new $FS, and old wikis can be converted using a conversion utility. --CliffordAdams

Comments from Cliff: I like this patch, and I think I will use this idea for the next beta version (0.97). I would like to change the escape-byte sequence to something that should never appear in any encoded text. I am thinking of using a long sequence like 0xFF 0xFE 0xFF 0xFE 0x07 0x33 0xFE 0xFF which would be very unlikely to occur in any text encoding. (I am worried that a short sequence like 0x07 0x33 might occur as part of a longer sequence in some multi-byte encoding.) --CliffordAdams

I think it is too long (8 bytes!). Because 0xB3 may quite frequently appear at Utf8 text, that choice makes the database too big. I chose a short sequence since I want to save the size of text. If 0x07 0x33 is not good enough, some sequece like 0xfe 0x07 0x33 is good enough I guess and I don't see that sequence is the part of multi-byte encoding. Oh, well you never know and we are just talking about paranoia, then I'd better do some research. --TakuyaMurata

Size is not a good argument. Assume the escape sequence appears 100 times in the page. Then you will see 100 * (8 - 1) = 700 bytes of extra space. Note even one 1kb! And notice that these 700 bytes do not travel over the network -- they are just read and written to the file on the server. This cannot be important. -- AlexSchroeder

Well, what if you assume it appears 1000 times, you will get 7k extra. It is unlikely but possible that not every wite site, page is as short and neat as that in the MeatBall. I didn't want to assume the size is not a big deal. Say someone might store an entire novel as single one page. I disagree with his idea but we have to assure he can do that without prohibitive penalty.

The problem is we are talking about paranoia. Actually come to think of the real possibility, WikiPatches/UtfEight patch is good enought for me and indeed I applied it to my wiki site rather than this one here. Well we still can't use some words and it can be serious problem. Occasionaly you really need a certain character. Think what if your name contains a character that contains 0xB3 + 0x33 or something like it. How can you employ another character. WikiPatches/UtfEight can not be more than a compromise. But I can't ignore the benefit that the patch is simple so safe.

I am not so sure that UseMod supports multilinguals by default. I really love the simplicity of UseMod and I don't like this hacking patch, which is trickly and less compatible with the existing database and the benefits are not so much. (I posted it because simply I can, haha). Anyway my new proposal below seems fine at the end for me. --TakuyaMurata

: I was thinking of the possibility that if we really want to make sure certin characters never appear in any text, is it possible to use a sequence like 0xB3 0x42 (capital B, to avoid digits)? Because we quoted 0xB3 already we are free to use 0xB3. It sounds tricky and can be dangrous thought, it seems to work for me. --TakuyaMurata

Patch Description

By default, the field seperator of the page database, $FS, has the value 0xB3. In order to prevent corrupting the page database, it is removed from the text submitted by users. This will make the corresponding character disappear in single byte coding systems, and it might corrupt characters in multi byte coding systems such as UTF-8.

This patch my TakuyaMurata changes the escape sequence from 0xB3 to the bell character and the number three (0x07 0x33).

The patch should work better than /UtfEight.

The reserved byte sequece is still quite tentative. I chose the bell followed by '3' for the following reasons:

you can read it easily for debugging [SupportForUtf8]
it never appears in the text a user would submit, no matter what coding system is used (see RFC 1345)

Note:

The exsiting database remains untouched.
The patch stores the byte sequences (07 33) in the database, so unpatched versions of the script may yield corrupt characters (or mojibake) [1].

--- wiki_92.pl	2002-12-24 11:53:12.000000000 -0600
+++ wiki_text.cgi	2002-12-24 12:34:40.000000000 -0600
@@ -55,7 +55,7 @@
   $q $Now $UserID $TimeZoneOffset $ScriptName $BrowseCode $OtherCode);
 
 # == Configuration =====================================================
-$DataDir     = "/tmp/mywikidb"; # Main wiki directory
+$DataDir     = "../../../home/admin1312/mywikidb"; # Main wiki directory
 $UseConfig   = 1;       # 1 = use config file,    0 = do not look for config
 
 # Default configuration (used if UseConfig is 0)
@@ -666,7 +666,8 @@
       $author = &GetAuthorLink($host, "", 0);
     }
     $sum = "";
-    if (($summary ne "") && ($summary ne "*")) {
+    if (($summary ne "") && ($summary ne "*")) {
+      $summary = &UnquoteFs($summary);
       $summary = &QuoteHtml($summary);
       $sum = "<strong>[$summary]</strong> ";
     }
@@ -765,7 +766,8 @@
   }
   $html .= ". . " . $minor . &TimeToText($ts) . " ";
   $html .= T('by') . ' ' . &GetAuthorLink($host, $user, $uid) . " ";
-  if (defined($summary) && ($summary ne "") && ($summary ne "*")) {
+  if (defined($summary) && ($summary ne "") && ($summary ne "*")) {
+    $summary = &UnquoteFs($summary);
     $summary = &QuoteHtml($summary);   # Thanks Sunir! :-)
     $html .= "<b>[$summary]</b> ";
   }
@@ -1166,7 +1168,8 @@
   $pageText = &CommonMarkup($pageText, 1, 0);   # Multi-line markup
   $pageText = &WikiLinesToHtml($pageText);      # Line-oriented markup
   $pageText =~ s/$FS(\d+)$FS/$SaveUrl{$1}/ge;   # Restore saved text
-  $pageText =~ s/$FS(\d+)$FS/$SaveUrl{$1}/ge;   # Restore nested saved text
+  $pageText =~ s/$FS(\d+)$FS/$SaveUrl{$1}/ge;   # Restore nested saved text
+  $pageText = &UnquoteFs($pageText);
   return $pageText;
 }
 
@@ -2643,15 +2646,18 @@
 
 sub GetTextArea {
   my ($name, $text, $rows, $cols) = @_;
-
-  if (&GetParam("editwide", 1)) {
-    return $q->textarea(-name=>$name, -default=>$text,
-                        -rows=>$rows, -columns=>$cols, -override=>1,
-                        -style=>'width:100%', -wrap=>'virtual');
-  }
-  return $q->textarea(-name=>$name, -default=>$text,
-                      -rows=>$rows, -columns=>$cols, -override=>1,
-                      -wrap=>'virtual');
+  my ($ta) = '';
+
+  $text = &UnquoteFs($text);
+
+  # To avoid the bug in CGI.pm, make textarea directly
+  # without CGI.pm
+
+  $ta = "<textarea name=\"$name\" rows=\"$rows\" cols=\"$cols\" wrap=\"virtual\"";
+  if (&GetParam('editwide', 1)) {
+    $ta .= " style=\"width:100%;\"";
+  }
+  return "$ta>$text</textarea>\n";
 }
 
 sub DoEditPrefs {
@@ -3188,6 +3194,18 @@
   return @links;
 }
 
+sub QuoteFs {
+  my ($text) = @_;
+  $text =~ s/$FS/\a3/og;
+  return $text;
+}
+
+sub UnquoteFs {
+  my ($text) = @_;
+  $text =~ s/\a3/$FS/ge;
+  return $text;
+}
+
 sub DoPost {
   my ($editDiff, $old, $newAuthor, $pgtime, $oldrev, $preview, $user);
   my $string = &GetParam("text", undef);
@@ -3214,8 +3232,8 @@
     &ReportError(Ts('[[%s]] cannot be defined.', $id));
     return;
   }
-  $string =~ s/$FS//g;
-  $summary =~ s/$FS//g;
+  $string = &QuoteFs($string);
+  $summary = &QuoteFs($summary);
   $summary =~ s/[\r\n]//g;
   # Add a newline to the end of the string (if it doesn't have one)
   $string .= "\n"  if (!($string =~ /\n$/));

WikiPatches