Introduction to Uniscribe

Design & Implementation of a Win32 Text Editor

Uniscribe is a low-level Win32 API that provides a high degree of control over the processing and display of Unicode text. The API is designed to provide a generic interface to all forms of Unicode text (complex or otherwise), and transparently handles properties such as bi-directional text and combining characters sequences.

Uniscribe is a single DLL called USP10.DLL, which contains all of the Uniscribe APIs. This DLL is present on Windows 2000 and above, or any computer with Internet Explorer 5.0 (or greater) installed. Two Platform SDK files (USP10.H and USP10.LIB) are provided by Microsoft to allow an application to make use of this complex-script support. An important point about Uniscribe is that it doesn't just handle complex-scripts - it can be used to process and display all Unicode text - so can be used as a direct replacement for existing text-output routines such as DrawText and TextOut.

The Uniscribe API is divided into two categories - the low-level API itself, and a wrapper library called ScriptString which hides much of the complexities of dealing with Uniscribe directly. The purpose of this tutorial is to give a brief introduction to the world of Uniscribe before we start delving in properly.

Uniscribe in Windows

When I first started Neatpad I was unfamiliar as to exactly what Uniscribe entailed, and it was only after researching Unicode that I fully appreciated the issues surrounding the display of Unicode text. Although Uniscribe occupies it's own section within the MSDN documentation (here), other than the occasional reference it is very easy to miss unless you already know of it's existence.

MSDN states that since Windows 2000, the ExtTextOut function (and others like TextOut, DrawText etc) have been extended to support complex scripts. Although this is true, it gives the impression that an application can call ExtTextOutW (the Unicode version) at any time with a buffer of UTF-16 text and it will always display correctly.

Unless Windows has been configured to do so, functions such as ExtTextOut do not automatically support complex scripts. The image above shows the "Regional and Language Options" dialog. The two settings which have been highlighted are not normally enabled by a default installation of American/English Windows.

Enabling complex-script support installs a number of extra libraries, after which ExtTextOut will use the Uniscribe when necessary to display complex scripts.

BOOL ExtTextOut (
  HDC       hdc,          // handle to DC
  int       X,            // x-coordinate of reference point
  int       Y,            // y-coordinate of reference point
  UINT      fuOptions,    // text-output options (ETO_GLYPH_INDEX etc)
  RECT    * lprc,         // optional dimensions
  LPCTSTR   lpString,     // string
  UINT      cbCount,      // number of characters in string
  INT     * lpDx          // array of spacing values
);

ExtTextOut is most commonly used to display a string of text. However it can do much more than this. When the ETO_GLYPH_INDEX and ETO_PDY options are specified, ExtTextOut can be used to display a buffer of glyphs instead of characters. This feature of ExtTextOut is used when displaying a string containing complex-scripts, as the diagram below illustrates.

Text drawing in Windows 2000 and above

For any complex string containing complex scripts, ExtTextOut makes use of Uniscribe to display it. Uniscribe breaks the string down into groups of glyphs and then re-calls ExtTextOut, this time with the ETO_GLYPH_INDEX option, and a buffer of glyph-indices instead of the original character values. For regular Unicode text which doesn't require any special processing, ExtTextOut behaves exactly the same as it did under previous Windows versions.

You may be wondering why Uniscribe is necessary if routines such as DrawText and TextOut can for the most part render complex scripts quite sucessfully. For applications which just output single strings of text, Uniscribe is not necessary.

It is only when a string must be broken up (for the purposes of styling/formatting) that Uniscribe is required. It is just not possible to split a Unicode string into sections (as we have been with Neatpad up 'til now). Doing so breaks all kinds of things such as contextual shaping behaviours and bidirectional support. A modern text-editor simply must support Unicode and all the various scripts that come along with that - we have no other choice than to move to Uniscribe.

The ScriptString API

The ScriptString API is designed for applications which want to display text in a single font and colour. Notepad (and the standard Windows EDIT control) is a prime example of the ScriptString API. One of the nice features of this API is that it allows you to display a string of text, with a portion of that string optionally displayed as 'selected'. This is actually a very nice touch as it saves a tremendous amount of effort.

The ScriptStringAnalyze function is the starting point with Uniscribe. It is a pretty intimidating function to look at. However its purpose is used to perform shaping and glyph-generation for any string of Unicode text, and returns a SCRIPT_STRING_ANALYSIS structure when complete.

HRESULT WINAPI ScriptStringAnalyse (
  HDC                       hdc,
  void                    * pString,
  int                       cString,
  int                       cGlyphs,
  int                       iCharset,
  DWORD                     dwFlags,
  int                       iReqWidth,
  SCRIPT_CONTROL          * psControl,
  SCRIPT_STATE            * psState,
  int                     * piDx,
  SCRIPT_TABDEF           * pTabdef,
  BYTE                    * pbInClass,
  SCRIPT_STRING_ANALYSIS  * pssa
);

SCRIPT_STRING_ANALYSIS is an opaque structure - there is no documention which details what it contains. This is not important though as this structure is simply passed to the rest of the ScriptString API without requiring any further knowledge.

HRESULT WINAPI ScriptStringOut (
  SCRIPT_STRING_ANALYSIS    ssa, 
  int                       iX, 
  int                       iY, 
  UINT                      uOptions, 
  RECT                    * prc, 
  int                       iMinSel, 
  int                       iMaxSel, 
  BOOL                      fDisabled 
);

ScriptStringOut is used to display a string of text that was previously analyzed. Note that a text-string is not specified with this call - only the SCRIPT_STRING_ANALYSIS structure is passed which contains all the necessary information to display the original string.

HRESULT WINAPI ScriptStringXtoCP (
  SCRIPT_STRING_ANALYSIS    ssa, 
  int                       iX, 
  int                     * piCh, 
  int                     * piTrailing 
);

ScriptStringXtoCP is an interesting function. It provides a mechanism for caret and mouse positioning within a string of Unicode text.

HRESULT WINAPI ScriptStringCPtoX (
  SCRIPT_STRING_ANALYSIS    ssa, 
  int                       icp, 
  BOOL                      fTrailing, 
  int                     * pX 
);

ScriptStringCPtoX is the counterpart to ScriptXtoCP. It performs the opposite task - converting a string-position to a display-coordinate.

HRESULT WINAPI ScriptStringFree(
  SCRIPT_STRING_ANALYSIS  * pssa  
);

When an application has finished displaying the string the ScriptStringFree function can be used to clean up. There are more ScriptString functions than what I have listed here, but with just these five an application can implement the front-end to a fully-functional text-editor with minimal effort.

The image above shows a simple application I wrote which demonstrates the ScriptString API. The source-code and demo executable can be downloaded at the top of this article.

An oddity of ScriptString is this: ScriptStringOut fails if the device-context used to render is not the same as the one used when analyzing the string with ScriptAnalyze!

Introducing UspLib

The main problem with the ScriptString API is its inability to display text in more than one font and colour. This makes it particularly unsuitable for our purposes with Neatpad. Our only option is to make use of the low-level Uniscribe functions directly.

USPLib is a library I have written to provide a far richer capability than ScriptString can offer. This new library provides a wrapper around the low-level Uniscribe API that we will be discussing over the next couple of tutorials. UspLib is very similar in approach to the ScriptString Uniscribe wrapper, but goes alot further in terms of text-colouring and formatting.

USPDATA * USP_Allocate();

The first API is USP_Allocate. This function returns a pointer to a USPDATA object which must be used for subsequent UspLib operations.

BOOL USP_Analyze (
  USPDATA   * uspData,
  HDC         hdc,
  WCHAR     * wstr,
  int         wlen,
  ATTR      * attrRunList,
  UINT        flags,
  USPFONT   * uspFont
);

USP_Analyze is similar to ScriptStringAnalyze. The difference is, a string of text can be re-analyzed using an existing USPDATA object.

void USP_ApplyAttributes (
  USPDATA  * uspData,
  ATTR     * attrRunList
);

Once a string has been analyzed (i.e. itemized and shaped etc.), colour-attributes can be reapplied at any time using the USP_ApplyAttributes. The font-information stored in the ATTR run-list is ignored.

void USP_ApplySelection (
  USPDATA  * uspData,
  int        selStart,
  int        selEnd
);

USP_ApplySelection performs a similar task to USP_ApplyAttributes. However this time only the selection-flags are modified in the USPDATA object.

int USP_TextOut (
  USPDATA  *  uspData,
  HDC         hdc,
  int         xpos,
  int         ypos,
  RECT     *  rect);

USP_TextOut is the counterpart to ScriptStringOut. It takes as input the USPDATA object which was previously analyzed, and draws it to the specified location. Any fonts, colours and selection-highlights are applied to the text as it is drawn.

void USP_Free(USPDATA * uspData);

USP_Free should be called then the USPDATA object is no longer needed. Over the course of the next two or three tutorials I will be detailing how I have implemented UspLib, and will provide details and examples of using Uniscribe directly.

I have designed UspLib in isolation from Neatpad. My intention is that it is a completely stand-along library, which can be used by any application to add complex-script support. It should certainly be possible to import UspLib into your projects and use it straight away, because it contains no dependencies other than the Uniscribe DLL.

Further Reading

There is very little information available about Uniscribe other than what is available in MSDN.

Uniscribe Platform SDK Reference

Supporting Multilanguage Text Layout and Complex Scripts with Windows NT 5.0

Globalization Step-by-Step - Complex Scripts Awareness

Windows Glyph Processing

There is also the CSSamp example program from Platform SDK, in the Samples sub-directory:\PlatformSDK\Samples\winui\globaldev\CSSamp

Alternatives to Uniscribe

Not every editor uses Uniscribe. If open-source is your thing then there are currently two very impressive efforts available which offer a very strong alternative to Uniscribe. There is also an equivalent version of Uniscribe available for Apple's OSX called ATSUI.

International Components for Unicode (ICU) is IBM's open-source Unicode support library. It contains alot of functionality, from character-conversions, analysis, searching and layout.

Pango is an open-source library for laying out and rendering Unicode text. It appears to sit on top of the GTK display library and can specify either Cairo or Win32 (Uniscribe) rendering back-ends. It offers a more complete solution than Uniscribe and appears to be very well designed and implemented. However Pango is UTF-8 based so this may be a consideration if the rest of your application is UTF-16.

Apple Type Services For Unicode Imaging (ATSUI) is Apple's own version of Uniscribe, although it appears to be higher-level than Microsoft's effort. A brief look at the documentation for ATSUI indicated a much easier-to-use design, and substantially better documentation than Microsoft had managed for Uniscribe.

Coming up in Part 12

This was just a short introduction to Uniscribe - hopefully you are a little more aware of what Uniscribe is capable of, and have downloaded and tested the ScriptString sample program.

Part 12 will focus on the first two Uniscribe functions: ScriptItemize and ScriptLayout. There is alot of detail to cover with just these two APIs and it won't be until Part 13 that we actually see any text being drawn this way with Neatpad.

Lastly, I've not had much feedback in the last few months about Neatpad - did you read this tutorial and find it useful?