Introduction to Uniscribe
Design & Implementation of a Win32 Text Editor
Uniscribe is a low-level Win32 API that provides a high degree of control over the processing and display of Unicode text. The API is designed to provide a generic interface to all forms of Unicode text (complex or otherwise), and transparently handles properties such as bi-directional text and combining characters sequences.
Uniscribe is a single DLL called
USP10.DLL, which contains all of the Uniscribe APIs. This DLL is present on Windows 2000 and above, or any computer with Internet Explorer 5.0 (or greater) installed. Two Platform SDK files (
USP10.LIB) are provided by Microsoft to allow an application to make use of this complex-script support. An important point about Uniscribe is that it doesn't just handle complex-scripts - it can be used to process and display all Unicode text - so can be used as a direct replacement for existing text-output routines such as
The Uniscribe API is divided into two categories - the low-level API itself, and a wrapper library called ScriptString which hides much of the complexities of dealing with Uniscribe directly. The purpose of this tutorial is to give a brief introduction to the world of Uniscribe before we start delving in properly.
Uniscribe in Windows
When I first started Neatpad I was unfamiliar as to exactly what Uniscribe entailed, and it was only after researching Unicode that I fully appreciated the issues surrounding the display of Unicode text. Although Uniscribe occupies it's own section within the MSDN documentation (here), other than the occasional reference it is very easy to miss unless you already know of it's existence.
MSDN states that since Windows 2000, the
ExtTextOut function (and others like
DrawText etc) have been extended to support complex scripts. Although this is true, it gives the impression that an application can call
ExtTextOutW (the Unicode version) at any time with a buffer of UTF-16 text and it will always display correctly.
Unless Windows has been configured to do so, functions such as
ExtTextOut do not automatically support complex scripts. The image above shows the "Regional and Language Options" dialog. The two settings which have been highlighted are not normally enabled by a default installation of American/English Windows.
Enabling complex-script support installs a number of extra libraries, after which
ExtTextOut will use the Uniscribe when necessary to display complex scripts.
BOOL ExtTextOut ( HDC hdc, // handle to DC int X, // x-coordinate of reference point int Y, // y-coordinate of reference point UINT fuOptions, // text-output options (ETO_GLYPH_INDEX etc) RECT * lprc, // optional dimensions LPCTSTR lpString, // string UINT cbCount, // number of characters in string INT * lpDx // array of spacing values );
ExtTextOut is most commonly used to display a string of text. However it can do much more than this. When the
ETO_PDY options are specified,
ExtTextOut can be used to display a buffer of glyphs instead of characters. This feature of
ExtTextOut is used when displaying a string containing complex-scripts, as the diagram below illustrates.
Text drawing in Windows 2000 and above
For any complex string containing complex scripts, ExtTextOut makes use of Uniscribe to display it. Uniscribe breaks the string down into groups of glyphs and then re-calls
ExtTextOut, this time with the
ETO_GLYPH_INDEX option, and a buffer of glyph-indices instead of the original character values. For regular Unicode text which doesn't require any special processing,
ExtTextOut behaves exactly the same as it did under previous Windows versions.
You may be wondering why Uniscribe is necessary if routines such as
TextOut can for the most part render complex scripts quite sucessfully. For applications which just output single strings of text, Uniscribe is not necessary.
It is only when a string must be broken up (for the purposes of styling/formatting) that Uniscribe is required. It is just not possible to split a Unicode string into sections (as we have been with Neatpad up 'til now). Doing so breaks all kinds of things such as contextual shaping behaviours and bidirectional support. A modern text-editor simply must support Unicode and all the various scripts that come along with that - we have no other choice than to move to Uniscribe.
The ScriptString API
ScriptString API is designed for applications which want to display text in a single font and colour. Notepad (and the standard Windows EDIT control) is a prime example of the
ScriptString API. One of the nice features of this API is that it allows you to display a string of text, with a portion of that string optionally displayed as 'selected'. This is actually a very nice touch as it saves a tremendous amount of effort.
ScriptStringAnalyze function is the starting point with Uniscribe. It is a pretty intimidating function to look at. However its purpose is used to perform shaping and glyph-generation for any string of Unicode text, and returns a
SCRIPT_STRING_ANALYSIS structure when complete.
HRESULT WINAPI ScriptStringAnalyse ( HDC hdc, void * pString, int cString, int cGlyphs, int iCharset, DWORD dwFlags, int iReqWidth, SCRIPT_CONTROL * psControl, SCRIPT_STATE * psState, int * piDx, SCRIPT_TABDEF * pTabdef, BYTE * pbInClass, SCRIPT_STRING_ANALYSIS * pssa );
SCRIPT_STRING_ANALYSIS is an opaque structure - there is no documention which details what it contains. This is not important though as this structure is simply passed to the rest of the ScriptString API without requiring any further knowledge.
HRESULT WINAPI ScriptStringOut ( SCRIPT_STRING_ANALYSIS ssa, int iX, int iY, UINT uOptions, RECT * prc, int iMinSel, int iMaxSel, BOOL fDisabled );
ScriptStringOut is used to display a string of text that was previously analyzed. Note that a text-string is not specified with this call - only the
SCRIPT_STRING_ANALYSIS structure is passed which contains all the necessary information to display the original string.
HRESULT WINAPI ScriptStringXtoCP ( SCRIPT_STRING_ANALYSIS ssa, int iX, int * piCh, int * piTrailing );
ScriptStringXtoCP is an interesting function. It provides a mechanism for caret and mouse positioning within a string of Unicode text.
HRESULT WINAPI ScriptStringCPtoX ( SCRIPT_STRING_ANALYSIS ssa, int icp, BOOL fTrailing, int * pX );
ScriptStringCPtoX is the counterpart to
ScriptXtoCP. It performs the opposite task - converting a string-position to a display-coordinate.
HRESULT WINAPI ScriptStringFree( SCRIPT_STRING_ANALYSIS * pssa );
When an application has finished displaying the string the
ScriptStringFree function can be used to clean up. There are more ScriptString functions than what I have listed here, but with just these five an application can implement the front-end to a fully-functional text-editor with minimal effort.
The image above shows a simple application I wrote which demonstrates the ScriptString API. The source-code and demo executable can be downloaded at the top of this article.
An oddity of ScriptString is this: ScriptStringOut fails if the device-context used to render is not the same as the one used when analyzing the string with ScriptAnalyze!
The main problem with the ScriptString API is its inability to display text in more than one font and colour. This makes it particularly unsuitable for our purposes with Neatpad. Our only option is to make use of the low-level Uniscribe functions directly.
USPLib is a library I have written to provide a far richer capability than ScriptString can offer. This new library provides a wrapper around the low-level Uniscribe API that we will be discussing over the next couple of tutorials. UspLib is very similar in approach to the ScriptString Uniscribe wrapper, but goes alot further in terms of text-colouring and formatting.
USPDATA * USP_Allocate();
The first API is
USP_Allocate. This function returns a pointer to a
USPDATA object which must be used for subsequent UspLib operations.
BOOL USP_Analyze ( USPDATA * uspData, HDC hdc, WCHAR * wstr, int wlen, ATTR * attrRunList, UINT flags, USPFONT * uspFont );
USP_Analyze is similar to
ScriptStringAnalyze. The difference is, a string of text can be re-analyzed using an existing
void USP_ApplyAttributes ( USPDATA * uspData, ATTR * attrRunList );
Once a string has been analyzed (i.e. itemized and shaped etc.), colour-attributes can be reapplied at any time using the
USP_ApplyAttributes. The font-information stored in the ATTR run-list is ignored.
void USP_ApplySelection ( USPDATA * uspData, int selStart, int selEnd );
USP_ApplySelection performs a similar task to
USP_ApplyAttributes. However this time only the selection-flags are modified in the
int USP_TextOut ( USPDATA * uspData, HDC hdc, int xpos, int ypos, RECT * rect);
USP_TextOut is the counterpart to
ScriptStringOut. It takes as input the
USPDATA object which was previously analyzed, and draws it to the specified location. Any fonts, colours and selection-highlights are applied to the text as it is drawn.
void USP_Free(USPDATA * uspData);
USP_Free should be called then the
USPDATA object is no longer needed. Over the course of the next two or three tutorials I will be detailing how I have implemented UspLib, and will provide details and examples of using Uniscribe directly.
I have designed UspLib in isolation from Neatpad. My intention is that it is a completely stand-along library, which can be used by any application to add complex-script support. It should certainly be possible to import UspLib into your projects and use it straight away, because it contains no dependencies other than the Uniscribe DLL.
There is very little information available about Uniscribe other than what is available in MSDN.
There is also the CSSamp example program from Platform SDK, in the Samples sub-directory:
Alternatives to Uniscribe
Not every editor uses Uniscribe. If open-source is your thing then there are currently two very impressive efforts available which offer a very strong alternative to Uniscribe. There is also an equivalent version of Uniscribe available for Apple's OSX called ATSUI.
International Components for Unicode (ICU) is IBM's open-source Unicode support library. It contains alot of functionality, from character-conversions, analysis, searching and layout.
Pango is an open-source library for laying out and rendering Unicode text. It appears to sit on top of the GTK display library and can specify either Cairo or Win32 (Uniscribe) rendering back-ends. It offers a more complete solution than Uniscribe and appears to be very well designed and implemented. However Pango is UTF-8 based so this may be a consideration if the rest of your application is UTF-16.
Apple Type Services For Unicode Imaging (ATSUI) is Apple's own version of Uniscribe, although it appears to be higher-level than Microsoft's effort. A brief look at the documentation for ATSUI indicated a much easier-to-use design, and substantially better documentation than Microsoft had managed for Uniscribe.
Coming up in Part 12
This was just a short introduction to Uniscribe - hopefully you are a little more aware of what Uniscribe is capable of, and have downloaded and tested the ScriptString sample program.
Part 12 will focus on the first two Uniscribe functions:
ScriptLayout. There is alot of detail to cover with just these two APIs and it won't be until Part 13 that we actually see any text being drawn this way with Neatpad.
Lastly, I've not had much feedback in the last few months about Neatpad - did you read this tutorial and find it useful?