UseMod Wiki: WikiBugs/NonEnglishRSS

The RSS feature does not work properly when non-ASCII characters are involved. These are the problems I have identified:

: Function UriEscape?

: Suggested new function. The reason for this is that the currently used CGI::escape does not do the right thing. The actual function used when CGI::escape is called is CGI::Util::escape(), which (for some to me unknown reason) performs encoding to UTF-8 before the URI escaping. This is not correct for UseModWiki, since the strings to be escaped are $HttpCharset encoded bytes strings.

: Function GetRcRss?

: The encoding attribute in the XML declaration should not be hard coded.

: $RCName? is not URI escaped before it's used in the link to the RecentChanges page.

: Function GetRssRcLine?

: CGI::escape is used to assign an URI escaped version of $pagename to $pagenameEsc. See above about the reason why that is a problem.

: $author is not URI escaped before it's used in the link to the author's page.

: $pagename, which is used in the "title" element, may contain underscores instead of spaces.

: Function DoRss?

: Character encoding is not specified in the text/xml content-type header.

Below please find a patch that fixes the above problems. --GunnarH

I've not much experience with RSS. The may be a lot more problems with charsets other than ascii. I'll have a closer look. There's already a function UriEscape? in 1.0.4, but it uses a positive instead a negative list of chars. I need to look if yours got all the normal characters or the current one has to much in it. -- MarkusLude

--- usemod-1.0.4/wiki.pl        2007-11-30 14:48:11.000000000 -0500
+++ wiki.pl     2009-03-26 14:59:13.000000000 -0400
@@ -882,6 +882,12 @@
   return $html;
 }

+sub UriEscape {
+  my $bytes = shift;
+  $bytes =~ s/([^A-Za-z0-9_.!~*'()-])/sprintf '%%%02X', ord $1/eg;
+  $bytes;
+}
+
 sub GetRcRss {
   my ($rssHeader, $headList, $items);

@@ -893,7 +899,7 @@
   my $ChannelAbout = &QuoteHtml($FullUrl . &ScriptLinkChar()
                                 . $ENV{QUERY_STRING});
   $rssHeader = <<RSS ;
-<?xml version="1.0" encoding="ISO-8859-1"?>
+<?xml version="1.0" encoding="@{[$HttpCharset or 'ISO-8859-1']}"?>
 <rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns="http://purl.org/rss/1.0/"
@@ -902,7 +908,7 @@
 >
     <channel rdf:about="$ChannelAbout">
         <title>${\(&QuoteHtml($SiteName))}</title>
-        <link>${\($QuotedFullUrl . &ScriptLinkChar() . &QuoteHtml("$RCName"))}</link>
+        <link>${\($QuotedFullUrl . &ScriptLinkChar() . &QuoteHtml( UriEscape($RCName) ))}</link>
         <description>${\(&QuoteHtml($SiteDescription))}</description>
         <wiki:interwiki>
             <rdf:Description link="$QuotedFullUrl">
@@ -935,7 +941,7 @@
   my ($pagenameEsc, $itemID, $description, $authorLink, $author, $status,
       $importance, $date, $item, $headItem);

-  $pagenameEsc = CGI::escape($pagename);
+  $pagenameEsc = UriEscape($pagename);
   # Add to list of items in the <channel/>
   $itemID = $FullUrl . &ScriptLinkChar()
             . &GetOldPageParameters('browse', $pagenameEsc, $revision);
@@ -948,7 +954,7 @@
   $host = &QuoteHtml($host);
   if ($userName) {
     $author = &QuoteHtml($userName);
-    $authorLink = 'link="' . $QuotedFullUrl . &ScriptLinkChar() . $author . '"';
+    $authorLink = 'link="' . $QuotedFullUrl . &ScriptLinkChar() . UriEscape($author) . '"';
   } else {
     $author = $host;
   }
@@ -959,7 +965,7 @@
   $year += 1900;
   $date = sprintf("%4d-%02d-%02dT%02d:%02d:%02d+%02d:00",
     $year, $mon+1, $mday, $hour, $min, $sec, $TimeZoneOffset/(60*60));
-  $pagename = &QuoteHtml($pagename);
+  ( $pagename = &QuoteHtml($pagename) ) =~ tr/_/ /;
   # Write it out longhand
   $item = <<RSS ;
     <item rdf:about="$itemID">
@@ -983,7 +989,7 @@
 }

 sub DoRss {
-  print "Content-type: text/xml\n\n";
+  print 'Content-type: text/xml', $HttpCharset ? "; charset=$HttpCharset" : '', "\n\n";
   &DoRc(0);
 }

Thanks I've added most of it for 1.0.5. There's already a function UriEscape? which does quite similar thing. The difference is yours use white-listing, there current one use black-listing. I need to look up if you have all characters and I tend to prefer yours. -- MarkusLude

There's at least '=' missing in the list in your UriEscape? function. -- MarkusLude