I think it is too long (8 bytes!). Because 0xB3 may quite frequently appear at Utf8 text, that choice makes the database too big. I chose a short sequence since I want to save the size of text. If 0x07 0x33 is not good enough, some sequece like 0xfe 0x07 0x33 is good enough I guess and I don't see that sequence is the part of multi-byte encoding. Oh, well you never know and we are just talking about paranoia, then I'd better do some research. --TakuyaMurata
Size is not a good argument. Assume the escape sequence appears 100 times in the page. Then you will see 100 * (8 - 1) = 700 bytes of extra space. Note even one 1kb! And notice that these 700 bytes do not travel over the network -- they are just read and written to the file on the server. This cannot be important. -- AlexSchroeder
Well, what if you assume it appears 1000 times, you will get 7k extra. It is unlikely but possible that not every wite site, page is as short and neat as that in the MeatBall. I didn't want to assume the size is not a big deal. Say someone might store an entire novel as single one page. I disagree with his idea but we have to assure he can do that without prohibitive penalty.
The problem is we are talking about paranoia. Actually come to think of the real possibility, WikiPatches/UtfEight patch is good enought for me and indeed I applied it to my wiki site rather than this one here. Well we still can't use some words and it can be serious problem. Occasionaly you really need a certain character. Think what if your name contains a character that contains 0xB3 + 0x33 or something like it. How can you employ another character. WikiPatches/UtfEight can not be more than a compromise. But I can't ignore the benefit that the patch is simple so safe.
I am not so sure that UseMod supports multilinguals by default. I really love the simplicity of UseMod and I don't like this hacking patch, which is trickly and less compatible with the existing database and the benefits are not so much. (I posted it because simply I can, haha). Anyway my new proposal below seems fine at the end for me. --TakuyaMurata
By default, the field seperator of the page database, $FS, has the value 0xB3. In order to prevent corrupting the page database, it is removed from the text submitted by users. This will make the corresponding character disappear in single byte coding systems, and it might corrupt characters in multi byte coding systems such as UTF-8.
This patch my TakuyaMurata changes the escape sequence from 0xB3 to the bell character and the number three (0x07 0x33).
The patch should work better than /UtfEight.
The reserved byte sequece is still quite tentative. I chose the bell followed by '3' for the following reasons:
--- wiki_92.pl 2002-12-24 11:53:12.000000000 -0600 +++ wiki_text.cgi 2002-12-24 12:34:40.000000000 -0600 @@ -55,7 +55,7 @@ $q $Now $UserID $TimeZoneOffset $ScriptName $BrowseCode $OtherCode); # == Configuration ===================================================== -$DataDir = "/tmp/mywikidb"; # Main wiki directory +$DataDir = "../../../home/admin1312/mywikidb"; # Main wiki directory $UseConfig = 1; # 1 = use config file, 0 = do not look for config # Default configuration (used if UseConfig is 0) @@ -666,7 +666,8 @@ $author = &GetAuthorLink($host, "", 0); } $sum = ""; - if (($summary ne "") && ($summary ne "*")) { + if (($summary ne "") && ($summary ne "*")) { + $summary = &UnquoteFs($summary); $summary = &QuoteHtml($summary); $sum = "<strong>[$summary]</strong> "; } @@ -765,7 +766,8 @@ } $html .= ". . " . $minor . &TimeToText($ts) . " "; $html .= T('by') . ' ' . &GetAuthorLink($host, $user, $uid) . " "; - if (defined($summary) && ($summary ne "") && ($summary ne "*")) { + if (defined($summary) && ($summary ne "") && ($summary ne "*")) { + $summary = &UnquoteFs($summary); $summary = &QuoteHtml($summary); # Thanks Sunir! :-) $html .= "<b>[$summary]</b> "; } @@ -1166,7 +1168,8 @@ $pageText = &CommonMarkup($pageText, 1, 0); # Multi-line markup $pageText = &WikiLinesToHtml($pageText); # Line-oriented markup $pageText =~ s/$FS(\d+)$FS/$SaveUrl{$1}/ge; # Restore saved text - $pageText =~ s/$FS(\d+)$FS/$SaveUrl{$1}/ge; # Restore nested saved text + $pageText =~ s/$FS(\d+)$FS/$SaveUrl{$1}/ge; # Restore nested saved text + $pageText = &UnquoteFs($pageText); return $pageText; } @@ -2643,15 +2646,18 @@ sub GetTextArea { my ($name, $text, $rows, $cols) = @_; - - if (&GetParam("editwide", 1)) { - return $q->textarea(-name=>$name, -default=>$text, - -rows=>$rows, -columns=>$cols, -override=>1, - -style=>'width:100%', -wrap=>'virtual'); - } - return $q->textarea(-name=>$name, -default=>$text, - -rows=>$rows, -columns=>$cols, -override=>1, - -wrap=>'virtual'); + my ($ta) = ''; + + $text = &UnquoteFs($text); + + # To avoid the bug in CGI.pm, make textarea directly + # without CGI.pm + + $ta = "<textarea name=\"$name\" rows=\"$rows\" cols=\"$cols\" wrap=\"virtual\""; + if (&GetParam('editwide', 1)) { + $ta .= " style=\"width:100%;\""; + } + return "$ta>$text</textarea>\n"; } sub DoEditPrefs { @@ -3188,6 +3194,18 @@ return @links; } +sub QuoteFs { + my ($text) = @_; + $text =~ s/$FS/\a3/og; + return $text; +} + +sub UnquoteFs { + my ($text) = @_; + $text =~ s/\a3/$FS/ge; + return $text; +} + sub DoPost { my ($editDiff, $old, $newAuthor, $pgtime, $oldrev, $preview, $user); my $string = &GetParam("text", undef); @@ -3214,8 +3232,8 @@ &ReportError(Ts('[[%s]] cannot be defined.', $id)); return; } - $string =~ s/$FS//g; - $summary =~ s/$FS//g; + $string = &QuoteFs($string); + $summary = &QuoteFs($summary); $summary =~ s/[\r\n]//g; # Add a newline to the end of the string (if it doesn't have one) $string .= "\n" if (!($string =~ /\n$/));