Unicode Support: Using Non-English Characters in Filenames, Comments and Passwords
Introduction
- You should use the Unicode features of the ZipArchive Library when the filenames,
comments or passwords in the archives you use contain non-ASCII characters.
- Without the Unicode support, the strings in archives are stored under Windows using
the following code page:
- filenames - current system OEM code page (
CP_OEMCP
),
- comments, passwords - current system ANSI code page (
CP_ACP
).
Under other platforms, all strings are stored using the current system's code page.
- To use the Unicode functionality under Windows, you should compile the library and
your application for Unicode. Under systems that use Unicode UTF-8 as the default
code page (like Linux and OS X), there are no special considerations needed.
On other systems, the Unicode support is not available.
- When calling the CZipFileHeader::SetFileName() method,
the current Unicode mode will be applied to the file being renamed. This will also
affect the Unicode mode used for the file's comment.
Using Unicode in Filenames and File Comments (Full Version Only)
Introduction
- This feature is compatible with WinZip Unicode support and allows creating cross-platform
Unicode archives that are extractable by utilities provided with the system under
Linux and OS X.
- To use this functionality, make sure _ZIP_UNICODE is
defined in the _features.h file. Rebuild the ZipArchive
Library and your application, if you modify this definition.
Usage
Call the
CZipArchive::SetUnicodeMode() method and pass
CZipArchive::umExtra or
CZipArchive::umString
as the parameter. You can also use a combination of these two parameters.
- CZipArchive::umExtra will store Unicode information
in extra headers. This will cause to use the extra headers for a filename or comment
only when the string contains non-ASCII characters. This value is used by default
under Windows.
- CZipArchive::umString will store filename and comment
directly in Unicode and will set a special flag in the file header inside of the
archive. Some utilities under Windows may display an invalid strings in this case.
This value is used by default under Linux/OS X.
- To determine what Unicode mode is used by a file, use the CZipFileHeader::GetState() method.
Preserving the Compatibility
The ZipArchive Library correctly decompresses archives created under different systems
without additional settings.
- If you need an archive created under Windows to be extracted correctly by Linux
utilities, set the archive compatibility to ZipCompatibility::zcUnix
with the
CZipArchive::SetSystemCompatibility() method. To make
the archive readable also by Windows utilities, set additionally one of the Unicode
modes. Not all Windows utilities support the Unicode modes.
- If you need an archive created under Windows to be extracted correctly by Mac OS
X utilities, set the Unicode mode to CZipArchive::umString
or use the same way as for the Linux platform.
- If you need an archive created under Linux/OS X to be extracted correctly by
WinZip under Windows there is no need to change anything as the CZipArchive::umString mode is set by default, but you may need to set
CZipArchive::umExtra for other Windows utilities that
do not support the CZipArchive::umString mode.
Setting Unicode Password and Archive Comment (Windows Only)
- You can set a code page to be used while setting a password with the
CZipArchive::SetPassword() method.
- You can set a code page to be used while setting an archive global comment with
the
CZipArchive::SetGlobalComment() method.
- If your password or a comment contains non-ASCII characters and you intend to compress
files under Windows and extract them under Linux/OS X or vice versa, set the
appropriate code page to
CP_UTF8
.
Setting Locale in STL Applications
If your locale is different from English and you wish to use non-English characters
in archives, you need to set your locale globally;
setlocale()
function
is not sufficient in this case.
- To set the global locale to be the same as your system locale use the function:
std::locale::global(std::locale(""));
- To set the global locale to a particular value, use the function e.g. this way:
std::locale::global(std::locale("German"));
- When you use Unicode, do not use _T() macro in the
above calls.
- Remember about putting
#include <locale>
in your code.
Remember to restore the global locale to the previous value (returned by
std:locale::global
) after processing (it may affect other parts of your application).
Additional Considerations (Windows Only)
Unicode Normalization
When you decompress archives that store filenames using different Unicode Normalization
than form C (used by Windows), you should define
_ZIP_UNICODE_NORMALIZE
in the
_features.h file, because some software under Windows
may be unable to open files with filenames in a different form. This will convert
any other normalization form to form C. This is e.g. the case when extracting archives
created under OS X (it uses form D).
- Under Windows Vista and later you need to use the appropriate for your system Windows
SDK and make sure that you compile for that platform (WINVER
should be defined to be at least 0x600).
- Under Windows XP and Windows Server 2003, you need to download Microsoft Internationalized
Domain Name (IDN) Mitigation APIs to use this functionality.
- Under Windows 95/98/Me this functionality is unsupported.
Safe Windows API
The Unicode version the library uses Windows API
WideCharToMultiByte
and
MultiByteToWideChar
functions to perform conversions from ANSI
code page to OEM code page and vice versa. It takes four function's calls to perform
one conversion. The alternative is to use the
CharToOemBuffA
and
OemToCharBuffA
functions and it takes only one function call per conversion
in that case. However, this functions are considered unsafe and banned by Microsoft.
If you prefer using the fast solution with unsafe functions, comment out the
_ZIP_SAFE_WINDOWS_API definition in the
ZipPlatform_win.cpp
file.
Custom Unicode Handling (Windows Only)
- This functionality is specific to the ZipArchive Library and external software will
not be able to benefit from it.
- To use this functionality, make sure _ZIP_UNICODE_CUSTOM
is defined in the _features.h file. Rebuild the ZipArchive
Library and your application, if you modify this definition. You also need to set
the Unicode mode with the CZipArchive::SetUnicodeMode()
method to the CZipArchive::umCustom value.
- The ZipArchive Library will save the code pages used during compression and automatically
use them during extraction. The code pages are saved in zip extra fields. See below for more information.
- Setting string store settings with one of the API
calls does not affect existing files and comments.
- If you open an existing archive with intent to add new files to it and you want
the new files to use the same string store settings as the existing files, then:
Otherwise the library will use the default settings for the current system (ZipPlatform::GetSystemID()).
- If you want to open an archive created with a previous version of the ZipArchive
library or any program, that uses a different filename of comment encoding code
pages than the standard ones, set the code pages before opening the archive. The
library will use them while decoding filenames and comments. The settings will be
ignored, if the archive contains extra fields with code pages created by the ZipArchive
Library. In this case, code pages from extra fields will be used instead. Note,
that these settings will be used during compression in the same archive (unless
changed).
- When you close an archive, the string store settings are reset to its default values
for the current system, just like with the CZipStringStoreSettings::Reset()
method call. This way, if you open the next archive using the same
CZipArchive
object, its string store settings are not affected by the previous archive settings.
Storing Unicode Filenames in a Zip Archive
You may control the way the ZipArchive Library stores filenames in archives by adjusting
the first parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method.
- If you plan that the archive will be extracted under Linux/OS X, set this parameter
to the identifier of the code page used by the system under which you want to
extract the archive. You may try setting it to
CP_ACP
, then the current
system ANSI code page will be used - it will work correctly if the target platform
uses the same code page as your system.
- If you use e.g. Japanese or Korean characters, you may set this parameter to
CP_UTF8
. Unicode UTF-8 will be used.
- You can set the code page directly using its identifier. Be sure it is installed
on your system and on the system you plan to extract the archive on.
- To restore the OEM encoding under Windows, set this parameter back to
CP_OEMCP
.
Sample Code
CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
zip.SetStringStoreSettings(CP_UTF8);
zip.AddNewFile(emptyFile, _T("\u0391\u03A9"));
zip.SetStringStoreSettings(1250);
zip.AddNewFile(emptyFile, _T("\u010D\u011B"));
zip.SetStringStoreSettings(CP_OEMCP);
zip.AddNewFile(emptyFile, _T("English characters only"));
zip.Close();
zip.Open(zipFileName);
zip.ExtractFile(1, _T("C:\\Temp"));
zip.Close();
Preserving Compatibility with the Standard Zip Format
It is assumed that under Windows filenames are stored using the current system OEM
code page (
CP_OEMCP
). Hence external software will not be able to properly
decode filenames if they are stored using a different code page. For this reason,
the ZipArchive Library allows storing filenames encoded with a custom code page
in extra fields. The filenames in the standard location (the central directory and
local headers) are encoded using OEM code page. This way, external software will
see a typically encoded filenames and the ZipArchive Library will know the original
filenames while extraction.
You should note that this method takes additional space needed for storing a filename
in an extra field.
To store filenames in extra fields, set the second parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method
to
true
.
Sample Code
CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
zip.SetStringStoreSettings(1250, true);
zip.AddNewFile(emptyFile, _T("\u0104\u0118"));
zip.Close();
You can specify a different code page for file comments, e.g. by modifying the object
returned by the
CZipArchive::GetStringStoreSettings() method call.
The comment code page settings does not affect the global comment. Use
CZipArchive::SetGlobalComment() to use a different code page in this
case.
Sample Code
CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
zip.AddNewFile(emptyFile, _T("empty file"));
zip.GetStringStoreSettings().m_uCommentCodePage = CP_UTF8;
LPCTSTR comment = _T("\u0104\u0118");
zip[0]->SetComment(comment);
zip.SetGlobalComment(comment);
zip.Close();
zip.Open(zipFileName);
CZipFileHeader* info = zip.GetFileInfo(0);
CZipString result = info->GetComment();
zip.SetStringStoreSettings(info->GetStringStoreSettings());
result = zip.GetGlobalComment();
The ZipArchive Library stores code page information and if requested, encoded filename,
in extra fields in the central directory. The global format of the ZipArchive extra
field is as follows:
Header ID
|
2
|
0x5A4C
|
Data Size
|
2
|
|
Data
|
as specified by Data Size
|
|
The format of the
Data field is as follows (not all sub-fields
may be present):
Version
|
1
|
0x01
|
Flag
|
1
|
1, 3, 4
|
Filename Code Page
|
4
|
|
Encoded Filename
|
variable
|
|
Comment Code Page
|
4
|
|
The
Flag field values have the following meaning:
0
|
1
|
the Filename Code Page field is present
|
0 and 1
|
3
|
the Encoded Filename field is present
(and the Filename Code Page field must be present too)
|
2
|
4
|
the Comment Code Page field is present
|
See Also API Links