Uniscribe Mysteries

Design & Implementation of a Win32 Text Editor

The last tutorial presented a very brief overview of the Uniscribe ScriptString API. Unfortunately ScriptString is insufficient for our purposes with Neatpad because of the limitations of a single font and colour. The aim of this tutorial is to therefore investigate the "low-level" Uniscribe API. Because we have very specific requirements for Neatpad's text display our approach that of a multi-font, syntax-coloured text editor.

The string of Unicode text shown below will be used as the basis for much of our discussion. The Arabic phrase in the middle has been chosen because it's Unicode properties suit the context of our discussion, not because they have any special meaning.

HelloيُساوِيWorld

You will notice that two of the "glyphs" in the Arabic phrase above have been highlighted in different colours to the rest of the string. These two letters are "U+0633 ARABIC LETTER SEEN" and "U+0627 ARABIC LETTER ALEF". In isolation they display as follows:

 
سا
 
سا

The box on the left shows the two characters rendered with contextual-shaping (assuming you are using a Unicode-enabled web-browser such as Internet Explorer and have the appropriate fonts installed). This is the behaviour we are aiming for. The box on the right shows the characters rendered separately from each other. If both boxes look the same then your browser is not displaying Unicode properly.

One of the big reasons Unscribe exists is to provide the kind of complex "shaping" behaviour illustrated above. The requirement on our part (as programmer) is that we do not split Unicode strings into individual characters because this would break the shaping behaviour we are aiming for. Therefore the major goal of this tutorial is to explain how characters can be drawn individually (in different colours) whilst still maintaining the contextual shaping.

Basic Outline

The basic set of steps for drawing text with Uniscribe are outlined below. Note that I am omitting word-wrapping (and the ScriptBreak API) for the moment. So assuming that we have a string of UTF-16 Unicode text, this is what we do:

  1. ScriptItemize - to break the string into distinct scripts or "item-runs".
  2. Merge item runs with application-defined "style" runs to produce finer-grained items.
  3. ScriptLayout - to potentially reorder the items.

Then for each item/run (in the order dictated by the ScriptLayout results)

  1. ScriptShape - to apply contextual shaping behaviour and convert the characters from each run into a series of glyphs.
  2. ScriptPlace - to calculate the width and positions of each glyph in the run.
  3. Apply colouring/highlighting to the individual glyphs.
  4. ScriptTextOut - to display the glyphs.

This outline closely follows how Microsoft recommends you use the Uniscribe API. Note however that I have included an extra step#6 (text-colouring) which is not mentioned in MSDN. The reasoning behind this difference will be explained as we progress through the tutorial. I will leave the subject of word-wrapping to a later tutorial, as this is more of a problem of line-buffer management rather than using Uniscribe.

BOOL UspAnalyze (
  USPDATA         * uspData,   
  HDC               hdc,
  WCHAR           * wstr,
  int               wlen,
  ATTR            * attrRunList,
  UINT              flags,
  SCRIPT_TABDEF   * tabDef,
  USPFONT         * uspFont 						  
);

The function prototype above is for a function called UspAnalyze. It is part of the new UspLib text-rendering engine that I have written for Neatpad. UspAnalyze is similar in many ways to ScriptStringAnalyze, but with the additional capability of allowing the caller to specify font and style information for the string.

The rest of this tutorial will begin to focus on each aspect of the Uniscribe API as outlined above and will discuss any issues related to each stage. However each stage that we look at will be a key step towards implementing the UspAnalyze function.

1. ScriptItemize

ScriptItemize is usually the first Uniscribe function to be called when displaying a string of Unicode text. It's purpose is to identify the various scripts in a string, and then split this string into items (or runs) according to the script, with one item per script.

H

e
l
l
o
ي
ُ
س
ا
و
ِ
ي

W

o

r

l

d

0048
0065
006C
006C
006F
064A
064F
0633
0627
0648
0650
064A
0057
006F
0072
006C
0064
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

The table above illustrates how the UTF-16 string "HelloيُساوِيWorld" would be treated by ScriptItemize. The characters are shown in logical order - in other words, the order that they appear when stored in memory. The string has been divided into three segments. Note that these items are derived purely by their script - not by the finer-grained glyphs and grapheme clusters that are present in the string.

HRESULT WINAPI ScriptItemize(
  WCHAR          * wszText,       // pointer to unicode string
  int              wszLength,     // count of WCHARs         
  int              cMaxItems,     // length of pItems buffer
  SCRIPT_CONTROL * psControl,    
  SCRIPT_STATE   * psState, 
  SCRIPT_ITEM    * pItems,        // out - array of SCRIPT_ITEM structures
  int            * pcItems        // out - count of items
);

ScriptItemize returns an array of SCRIPT_ITEM structures, one for each "shapable" item (script) in the paragraph of text. The number of structures is returned in *pcItems. In the example above, *pcItems would hold the value "3". This SCRIPT_ITEM structure is very simple and is shown below.

struct SCRIPT_ITEM
{ 
   int              iCharPos; 
   SCRIPT_ANALYSIS  a;
};

The SCRIPT_ITEM::iCharPos variable is used to identify the starting position of each "run" of text in the string. The SCRIPT_ANALYSIS child structure holds alot of extra information about the run including the reading-direction and the shaping-engine that should be used to convert the run into glyphs.

The image below this time illustrates how our Unicode string is represented by an array of SCRIPT_ITEM structures:

Notice that there is always a "hidden" SCRIPT_ITEM on the end of the array which represents the end-of-string. This makes it possible to calculate the length of each SCRIPT_ITEM by using the following construct:

itemLength = pItems[i+1].iCharPos - pItems[i].iCharPos;

There are a couple of general points worth making here. Notice that the first parameter to ScriptItemize is a WCHAR *. There is no ANSI version of this function so from now on Neatpad will be a pure Unicode application. Unless we can use the Microsoft Layer for Unicode (MSLU) we will have to drop support for Win9x.

Note also that you can never know in advance how many SCRIPT_ITEM s will be returned for a string of text, so it is usually necessary to use a loop of some kind - which allocates more and more memory for the SCRIPT_ITEM buffer until the call to ScriptItemize succeeds:

SCRIPT_CONTROL scriptControl = { 0 };
SCRIPT_STATE   scriptState   = { 0 };

SCRIPT_ITEM   *itemList      = 0;
int            itemCount;

do {
    itemList = realloc(itemList, ... );

    hr = ScriptItemize(
            wstr,
            wlen,
            allocLen,
            &scriptControl,
            &scriptState,
            itemList,
            &itemCount);

    if(hr != S_OK && hr != E_OUTOFMEMORY)
        break;

} while(hr != S_OK);

A word of warning here - make sure you always pass fully initialized SCRIPT_CONTROL and SCRIPT_STATE structures to ScriptItemize even if their contents are initialized to all "zeros". Unless both these structures are specified, the Unicode bi-directional algorithm will not be used for the purposes of itemizing the string. This can result in incorrect identification of item-run positions in some circumstances (such as LTR and RTL scripts appearing in the same string).

Interestingly, MSDN says that when the SCRIPT_CONTROL and SCRIPT_STATE are NULL the itemization is based purely on character code. When non-null, the full bidirectional algorithm is applied as stated above. For this latter case the entire paragraph must be in memory. Although I'm not going to go down this path, this does suggest a method for handling arbitrarily long lines of text that cannot reside in memory as whole paragraphs.

2. Merging Style Runs

The reason we are using Uniscribe directly instead of the ScriptString functions is because we want finer-grained control over text colouring and font selection. And we have now reached the point (after calling ScriptItemize) where Microsoft's documentation advises us to merge "application-defined" style runs with the item information returned by ScriptItemize. Here's the quote from MSDN:

"Before using Uniscribe, an application divides the paragraph into runs, that is, a string of characters with the same style. The style depends on what the application has implemented, but typically includes such attributes as font, size, and color....Merge the item information with the run information to produce runs with a single style, script and direction."

This quote is one of the most confusing, cryptic and misleading statements in the whole of the Uniscribe documentation. The problem is, there are no hints in MSDN as to how one should merge style-runs with item-runs, or even what a "style run" actually is. We will look at how to "merge runs" a little further down, but first let's understand what is meant by the term "style run".

Of course, a style-run is whatever an application wants it to be. In essence it is a range of text that has been assigned a specific set of attributes. In the case of Neatpad I have used an ATTR structure to represent colour and font - one for each character in a string of text. The string of text and the attribute-list looked something like this:

WCHAR buff[ MAXLINELEN ];
ATTR  attr[ MAXLINELEN ];

However since migrating to Uniscribe and the 'inversion-highlighting' scheme, I have extended the ATTR structure somewhat, so that it is no longer a 'one ATTR per character':

struct ATTR
{
   COLORREF     fg;    // foreground text colour
   COLORREF     bg;    // background text colour
   int  len   : 16;    // length of this run (in WCHARs)
   int  font  : 7;     // font-index
   int  sel   : 1;     // selection flag (yes/no)
   int  ctrl  : 1;     // show as an isolated control-character
};

The foreground and background colours remain unchanged. The new structure-members are detailed below.

  • The first change is a new length field which represents the length (in characters) of an attribute-run - simply because Uniscribe likes to deal with "runs" of things rather than single characters. I won't be modifying the existing Neatpad code just yet (which assumes 1 ATTR per character) and this extra field will just be used for "internal housekeeping" when dealing with the Uniscribe API.
  • The font field is no different than before. It is still used as an index into a font-table.
  • The sel boolean is used to indicate the selection-state of the text-run - in other words, if the run should be rendered with selection-highlighting. This is quite an important change - I no longer store the selection-colours in ::fg and ::bg - a separate flag is used to indicate whether a character (or range of text) is selected. The move to an 'inversion highlighting' scheme requires this change.
  • The ctrl boolean is the last addition and is used to indicate whether or not the characters in the text-run should be rendered normally, or as individual control-characters.

The problem we now face is that we have two lists of unrelated entities - a SCRIPT_ITEM list which identifies the position of scripts within the original character array, and an ATTR list which identifies the ranges of style in the original string. We need to understand what MSDN means when it instructs us to merge these two unrelated lists together:

SCRIPT_ITEM *itemList;
ATTR        *attrList;

The basic process is to look at the style-runs and item-runs together and identify any position within the string where a run of one type overlaps another. For example, suppose a SCRIPT_ITEM run overlaps the boundary-position between two ATTR structures. This SCRIPT_ITEM would have to be split into two new halves - each representing a different ATTR style-run.

The way the split occurs is like this: the SCRIPT_ITEM::iCharOffset variables are modified to point to new positions within the original string and an array of ITEM_RUN structures is built up which holds these new character-postions. The other contents of the SCRIPT_ITEM (i.e. the SCRIPT_ANALYSIS structure) must be duplicated between the resulting two halves. Think of it as follows: The ScriptItemize function first breaks the string into discrete units based on script. The merge process then further breaks the string into smaller units based on style, should there be any overlap between the two.

The following diagram hopefully illustrates what is meant by a "style merge":

Now here's the problem. If we break up a SCRIPT_ITEM won't this affect the contextual shaping behaviour of the Uniscribe engines? The short answer is, yes, we will break the Uniscribe shaping behaviours by breaking up a string - and no, there is no magic way to get around this problem.

You may notice in the above diagram that I have written "font only" next to the ATTR style runs. This is delibrate, because although Microsoft advises us to break up a string based on style, this is not really correct. In fact, breaking a string due to colour differences (for the purpose of selection/syntax highlights) at this stage is wrong:

We must only take fonts into account when merging style-runs and item-runs, and ignore colour-information entirely.

Hopefully I have gotten this point across adequately. After following the advise of the Microsoft docs I wasted about a week trying to figure out how to colourise a string only to realise that I was going about it the wrong way. Syntax-colouring (or any kind of text colouring for that matter) must be applied to a string after the shaping has taken place - i.e. after ScriptShape and ScriptPlace have been called, and just prior to calling ScriptTextOut. This doesn't mean that we can't store colours in our ATTR structures - it's just that we don't use this information whilst performing the 'merge'. Any ATTR structures which share the same font must therefore be coalesced into a single run by the merge process before doing any "splits".

OK so once we have broken up the ATTR and SCRIPT_ITEM structures what do we do with them? I have defined a new structure called ITEM_RUN which contains the necessary content from the SCRIPT_ITEM and ATTR structures:

struct ITEM_RUN
{
   SCRIPT_ANALYSIS  analysis;      // from the original SCRIPT_ITEM
   int              charPos;       // character-offset within the original string
   int              len;           // length of run in WCHARs
   int              font;          // only font is required, not colours
   ...
};

ITEM_RUN basically allows us to keep "formatting" information alongside the item-runs. Once we have itemized the string, Uniscribe only cares about the SCRIPT_ANALYSIS structures for each run. The other members of the ITEM_RUN structure are for our own private use. The item-run-list is stored inside the USPDATA structure for the string, in the itemRunList field:

struct USPDATA
{
   ITEM_RUN   * itemRunList;
   int          itemRunCount;
   ...
};

The algorithm to merge runs is actually quite complicated - in fact it's one of the trickier aspects with Uniscribe programming, not helped by the fact that Microsoft give absolutely no hint as to how this should be performed, other than the 7-year old CSSamp application from the 1998 article "Supporting Multilanguage Text Layout and Complex Scripts with Windows NT 5.0".

To solve this problem I have written a new function called BuildMergedItemRunList that builds an array of ITEM_RUN structures for a given Uniscribe string. It performs two tasks - calling ScriptItemize and then merging the results with the style-runs specified by attrList.

BOOL BuildMergedItemRunList(
                 USPDATA  * uspData,       // in/out - holds results of merge
                 WCHAR    * wstr,          
                 int        wlen, 
                 ATTR     * attrList,      
 );

BuildMergedItemRunList is an private function to USPLib, and is called by UspAnalyze as one of the first steps when building a USPDATA object. Taken in isolation, the function is used something like this:

ATTR attrList[2] = 
{
    { RGB(0xff, 0x00, 0xff), RGB(0,0,0), 5, 0, 0 },    // five characters using font#0
    { RGB(0xAA, 0x22, 0xAA), RGB(0,0,0), 6, 1, 0 }     // six  characters using font#1
}

BuildMergedItemRunList(uspData, L"Hello World", 11, attrList);

Understand that the big advantage of using Uniscribe is the contextual-shaping and complex-script support. Dividing a Unicode string into sections by splitting SCRIPT_ITEM structures will break the script-shaping behaviour that we seek. We must try to keep the number of split SCRIPT_ITEM s to a minimum - and splitting based on colour differences at this stage is wrong. Although Neatpad will already have built it's ATTR style-lists before displaying text with Uniscribe, using the colour information in these lists must occur after shaping has taken place.

Finally, if you are building a text-editor that only deals with a single font then you can completely skip this phase and save yourself alot of work (or even use the ScriptString API if you don't want syntax colouring!)

3. ScriptLayout

The next stage with Uniscribe is to take the merged item-runs and establish the correct visual order for display. In our case we use the array of ITEM_RUN structures produced by BuildMergedItemRunList. This is an important step and is the key to the correct display of bidirectional text. Note that unless a string contains right-to-left scripts reordering is not necessary but we still need to go through the motions because we won't know until runtime what kind of scripts and languages we might be processing.

The Uniscribe ScriptLayout function is called to perform the reordering, and uses the Unicode Bidirectional Algorithm to achieve this task.

HRESULT WINAPI ScriptLayout(
   int     cRuns, 
   BYTE  * pbLevel,              // in
   int   * piVisualToLogical,    // out
   int   * piLogicalToVisual     // out
);

ScriptLayout takes as input a simple array of BYTE s which represent the bidi run-embedding levels of the string - one BYTE per item-run. This bidi run-embedding value is stored in the SCRIPT_STATE::uBidiLevel variable for each ITEM_RUN. It is up to us to build this BYTE[] array before calling ScriptLayout.

We have to therefore manually extract the uBidiLevel variable from each item-run. uBidiLevel is buried deep within each SCRIPT_ANALYSIS, as a member of the SCRIPT_STATE structure. Once the BYTE[] array is built the ScriptLayout API can be called. It all seems like rather alot of work just to return a further array of integers but thats just the way it is. Presumably the Uniscribe developers did it this way because they assumed that you would be creating and merging your own ITEM_RUN (or similar) structures.

VOID BuildVisualMapping( ITEM_RUN *  itemRunList, 
                         int         itemRunCount, 
                         int         visualToLogicalList[]  // out
  )
{
    int     i;
    BYTE  * bidiLevel = malloc(itemRunCount * sizeof(BYTE));

    // Manually extract bidi-embedding-levels ready for ScriptLayout
    for(i = 0; i < itemRunCount; i++)
        bidiLevel[i] = itemRunList[i].analysis.s.uBidiLevel;

    // Build a visual-to-logical mapping order
    ScriptLayout(itemRunCount, bidiLevel, visualToLogicalList, NULL);

    // free the temporary BYTE[] buffer
    free(bidiLevel);
}

The function above shows how obtain the visual-mapping list given an array of ITEM_RUN structures. This list is essential when displaying a string of text, or in fact doing anything which requires visual-order processing such as mouse/caret hit testing.

int xpos = 0, ypos = 0;

for(visualIdx = 0; visualIdx < itemRunCount; visualIdx++)
{
    int logicalIdx    = visualToLogicalList[visualIdx];
    ITEM_RUN *itemRun = itemRunList[logicalIdx];

    ProcessRun(itemRun, xpos, ypos);

    xpos += itemRun->width;
}

This type of processing-loop is necessary because even though we may be dealing with right-to-left scripts (i.e. Arabic or Hebrew), when it comes to text-display we still draw everything from left-to-right, including the 'backwards' runs. The visual-to-logical list provides a way to map from a visual to logical index and ensures we always process the runs in the appropriate order.

Coming up in Part 13

Uniscribe is a very complicated business as you can probably tell from reading this tutorial. Unfortunately this is a necessary evil, as all software written today should be fully Unicode compliant. Don't think for a minute that Uniscribe can be ignored - we need to support Unicode, and we must accept the added complications that it brings. The days of ASCII/English text display have gone for good.

So far we have covered the process of breaking up, and reordering a string of Unicode text into a series of item-runs. However we are still only half-way towards implementing the UspAnalyze function. The next tutorial will reveal how to take the item-runs we have produced and generate glyph and width information using the ScriptShape and ScriptPlace APIs.