Convert entire Foswiki RCS databases from one character set to UTF-8
The character set encoding determines the range of characters that can be used in for naming wiki topics and attachments, and in content stored in topics.
(To understand what this means on a technical level, read Foswiki:Development.UnderstandingEncodings)
Before Foswiki 2.0, Foswiki had to be configured with a{Site}{CharSet}
,
which set the encoding used for characters in topic and attachment names,
and topic content.
The default encoding used by Foswiki before 2.0 was iso-8859-1, which was a reasonable choice for many western languages. However there are many other languages (for example, Arabic, Chinese, Hebrew, Hindi) that have characters that do not appear in this character set. Even some basic characters like the euro symbol are missing from iso-8859-1. For this reason, Foswiki has now moved to supporting the standard UTF-8 character encoding, which is designed to support a very wide range of characters.
Unfortunately once you chose a{Site}{CharSet}
and created a bunch
of topics, it became very risky to change because the charset is
associated with the entire database, and not with individual topics.
It was even possible to paste content in a different encoding into the
text editor and have it stored in that encoding, resulting in what looked
like garbled topics.
Ideally all Foswikis should use UTF-8, even those that are still using older Foswikis, but we have a legacy of existing sites that don't. So we need some way to convert an RCS-based wiki from any existing character encoding to UTF-8.
And that's what this module provides. If you have a store that is:{Site}{CharSet}
other than UTF-8
RcsWrap
or RcsLite
as it's {Store}{Implementation}
Even if you don't have an immediate need for non-western character sets this is worth doing, as Foswiki 2.0 and later work exclusively with UTF-8 content.
Note that this module converts all the histories of all your topics, as well as the latest version of the topic. It also maps all web, topic and attachment names. It does not, however, touch the content of attachments.tools/bulk_copy.pl
, as recommended in the release notes.
You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.
Open configure, and open the "Extensions" section. Use "Find More Extensions" to get a list of available extensions. Select "Install".
If you have any problems, or if the extension isn't available inconfigure
, then you can still install manually from the command-line. See http://foswiki.org/Support/ManuallyInstallingExtensions for more help.
The convertor is used from the command-line on your wiki server (if you do not have access to the command line then we are sorry, but there is currently no way for you to use the conversion).
To use the convertor,cd
to the tools
directory in your installation and perl convert_charset.pl -i
.
perl convert_charset.pl
The script will convert the Foswiki RCS database pointed at by {DataDir} and {PubDir} from the existing character set (as set by {Site}{CharSet}) to UTF8.
Options: -i |
info - report what would be done only, do not convert anything |
-q |
quiet - work silently (unless there's an error) |
-a |
abort - on error (default is to report and continue) |
-r |
repair - detect the encoding of each string and repair inconsistencies. |
Expert options | |
-web=webname |
Restrict conversion to a single web and it's subwebs. |
-encoding=charset |
Override the source encoding. (Required if running the conversion on Foswiki 2.0.) |
-r
if your site may contain content which cannot be decoded
using the {Site}{CharSet} (if this is the case, -i will abort with an
error).
if the -r option is given, then any number of additional repair options
can follow. These are of two types: detected-encoding=actual-encoding
topic-path=actual-encoding
detected-encoding
, while the second allows you to select an individual topic
and override the encoding of the content of just that topic. If you need to
override the encoding of a web or topic name, use :N
after the topic-path
e.g. Sandbox/NorthKorea:N=EUC-KR
Although this exension is intended for use on Foswiki 1.1, there may be cases
where an individual web requires conversion on a Foswiki 2.0 system. For example,
conversion of a single web migrated at a later date from an older system. For
example, convert the oops web from iso-8859-1
on a system already converted
to utf-8
. *Use extreme caution converting individual webs. Foswiki does
not support mixed encoding.
perl convert_charset.pl -web=Oops -encoding=iso-8859-1 -i
Once you have run the script without -i, all: {Site}{CharSet}
to 'utf-8'.
Change History: | |
1.4 (15 Sep 2015) | Foswikitask:Item13702 - Actually use the encoding detected by -r repair option. |
1.3 (15 Jul 2015) | Foswikitask:Item13523 - Better job of detecting Foswiki 2.0. |
1.2 (1 Jun 2015 ) | Foswikitask:Item13442 - Add repair option to detect exceptions to the encoding. Add limited support for Foswiki 2.0. Add more flexible overrides for detected encoding. |
1.1 (11 Jun 2014) |
Name | Version | Description |
---|---|---|
File::Copy | >0 | Required |
Foswiki | >=1.1 | Required |
Author | Foswiki:Main.CrawfordCurrie |
Version | 1.4 |
Release | 15 Sep 2015 |
Repository | https://github.com/foswiki/distro |
Copyright | © 2011-2015, Foswiki:Main.CrawfordCurrie |
License | GPL (GNU General Public License) |
Home | http://foswiki.org/Extensions/CommentPlugin |
Support | http://foswiki.org/Support/CommentPlugin |