Unicode Support: Using Non-English Characters in Filenames, Comments and Passwords
Applies To: All

Introduction

Using Unicode in Filenames and File Comments (Full Version Only)

Introduction

Usage

Call the CZipArchive::SetUnicodeMode() method and pass CZipArchive::umExtra or CZipArchive::umString as the parameter. You can also use a combination of these two parameters.

Preserving the Compatibility

The ZipArchive Library correctly decompresses archives created under different systems without additional settings.

Setting Unicode Password and Archive Comment (Windows Only)

Setting Locale in STL Applications

If your locale is different from English and you wish to use non-English characters in archives, you need to set your locale globally; setlocale() function is not sufficient in this case. Remember to restore the global locale to the previous value (returned by std:locale::global ) after processing (it may affect other parts of your application).

Additional Considerations (Windows Only)

Unicode Normalization

When you decompress archives that store filenames using different Unicode Normalization than form C (used by Windows), you should define _ZIP_UNICODE_NORMALIZE in the _features.h file, because some software under Windows may be unable to open files with filenames in a different form. This will convert any other normalization form to form C. This is e.g. the case when extracting archives created under OS X (it uses form D).

Safe Windows API

The Unicode version the library uses Windows API WideCharToMultiByte and MultiByteToWideChar functions to perform conversions from ANSI code page to OEM code page and vice versa. It takes four function's calls to perform one conversion. The alternative is to use the CharToOemBuffA and OemToCharBuffA functions and it takes only one function call per conversion in that case. However, this functions are considered unsafe and banned by Microsoft. If you prefer using the fast solution with unsafe functions, comment out the _ZIP_SAFE_WINDOWS_API definition in the ZipPlatform_win.cpp file.

Custom Unicode Handling (Windows Only)

Storing Unicode Filenames in a Zip Archive

You may control the way the ZipArchive Library stores filenames in archives by adjusting the first parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method.
Sample Code
CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
// by default the current OEM code page is used, change it to UTF-8
zip.SetStringStoreSettings(CP_UTF8);
// use some non-English characters
zip.AddNewFile(emptyFile, _T("\u0391\u03A9"));
// set the code page using its identifier
zip.SetStringStoreSettings(1250);
zip.AddNewFile(emptyFile, _T("\u010D\u011B"));
// restore the OEM code page
zip.SetStringStoreSettings(CP_OEMCP);
zip.AddNewFile(emptyFile, _T("English characters only"));
zip.Close();
// extract one file now
zip.Open(zipFileName);
zip.ExtractFile(1, _T("C:\\Temp"));
zip.Close();

Preserving Compatibility with the Standard Zip Format

It is assumed that under Windows filenames are stored using the current system OEM code page (CP_OEMCP). Hence external software will not be able to properly decode filenames if they are stored using a different code page. For this reason, the ZipArchive Library allows storing filenames encoded with a custom code page in extra fields. The filenames in the standard location (the central directory and local headers) are encoded using OEM code page. This way, external software will see a typically encoded filenames and the ZipArchive Library will know the original filenames while extraction.

You should note that this method takes additional space needed for storing a filename in an extra field. 

To store filenames in extra fields, set the second parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method to true.

Sample Code
CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
// set the code page and request storing it in the extra field
// the filename encoded using this code page
zip.SetStringStoreSettings(1250, true);
// use some non-English characters
zip.AddNewFile(emptyFile, _T("\u0104\u0118"));
zip.Close();

Choosing a Code Page for Comments in a Zip Archive

You can specify a different code page for file comments, e.g. by modifying the object returned by the
CZipArchive::GetStringStoreSettings() method call.

Archive Global Comment Encoding and Decoding

The comment code page settings does not affect the global comment. Use CZipArchive::SetGlobalComment() to use a different code page in this case.
Sample Code
CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
zip.AddNewFile(emptyFile, _T("empty file"));
// set a specific code page for comments
zip.GetStringStoreSettings().m_uCommentCodePage = CP_UTF8;
// use some non-English characters
LPCTSTR comment = _T("\u0104\u0118");
zip[0]->SetComment(comment);
// the comment code page setting affects the global comment encoding as well
zip.SetGlobalComment(comment);
zip.Close();
// extract the comments
zip.Open(zipFileName);
CZipFileHeader* info = zip.GetFileInfo(0);
// the file comment, the comment code page is read from the stored settings
CZipString result = info->GetComment();
// adjust the settings to properly decode the global comment
zip.SetStringStoreSettings(info->GetStringStoreSettings());
result = zip.GetGlobalComment();

ZipArchive Library Extra Field Format

The ZipArchive Library stores code page information and if requested, encoded filename, in extra fields in the central directory. The global format of the ZipArchive extra field is as follows:

Sub-field Size in bytes Value
Header ID 2 0x5A4C
Data Size 2
Data as specified by Data Size

The format of the Data field is as follows (not all sub-fields may be present):

Sub-field Size in bytes Values
Version 1 0x01
Flag 1 1, 3, 4
Filename Code Page 4
Encoded Filename variable
Comment Code Page 4

The Flag field values have the following meaning:

Bits Set Value Meaning
0 1 the Filename Code Page field is present
0 and 1 3 the Encoded Filename field is present
(and the Filename Code Page field must be present too)
2 4 the Comment Code Page field is present

See Also API Links

Article ID: 0610051525
Copyright © 2000 - 2022 Artpol Software - Tadeusz Dracz