Integrating UspLib

14 minute read • 6 Mar 2006

It’s finally here! - a new and improved Neatpad which demonstrates the rendering capabilities of UspLib. The purpose of this is tutorial to document the UspLib API, and secondly to mention a few details about how UspLib was integrated into Neatpad’s code. I very much hope that the design of UspLib is good enough that it will others to import it into their own editors and get instant styled-text support!

The image above shows Neatpad’s new Unicode text-rendering engine in action. Five different scripts are being displayed - Devanagari, Tamil, Thai, Arabic and of course Latin. Font-fallback is not currently supported in Neatpad, so to display all of these different scripts a suitable font must be selected. In the example above I used the “Arial Unicode MS” font which weighs in at a hefty 22Mb!

Now, don’t get too exited about this latest version. On the surface it is no different than before - it is only until you load a Unicode file containing lots of complex scripts that you will see where all the work has gone into.

The UspLib API has been documented below. Please let me know if you were successful in integrating this API into your own projects!

To use UspLib, include the single header-file usplib.h, and link against usplib.lib. There are no dependencies on the library itself other than the Uniscribe Script Processor DLL (usp10.dll) which will be present on Windows2000 and above.

UspAllocate

USPDATA * UspAllocate();

UspAllocate initializes and returns a new USPDATA object, which must be used for all subsequent UspLib operations.

UspAnalyze

BOOL UspAnalyze (
  USPDATA * uspData,
  HDC hdc,
  WCHAR * wstr,
  int wlen, 
  ATTR * attrRunList, // optional
  UINT flags,
  USPFONT * uspFont, // optional
  SCRIPT_CONTROL * scriptControl,
  SCRIPT_STATE * scriptState,
  SCRIPT_TABDEF * scriptTabdef, // optional
);

UspAnalyze takes as input a single USPDATA object and analyses the the specified paragraph of UTF-16 text, saving the results back in uspData.

uspData points to a single USPDATA object which will hold the results of the analysis. This object can be reused from a previous call to UspAnalyze, which results in less memory allocation overheads.
hdc is a handle to a device-context.
wstr and wlen together identify the wide-character string to be analyzed.
attrRunList points to an optional array of ATTR structures. The size of this array is not directly specified in the call to UspAnalyze. The range of text represented by the ATTR::len field of each array element is assumed to be wlen - the same length as the string being analyzed.

If attrRunList is NULL, the string is intialized with a single default attribute spanning the entire range of text, using the default system colours and the font specified by uspFont.

flags is a single DWORD variable which should be set to zero.
uspFont points to an optional array of USPFONT structures. Each array element must have been intialized using UspInitFont beforehand. If uspFont is NULL, the currently selected font in HDC is used instead. This same font must be re-selected into the target device-context when calling UspTextOut.
scriptControl points to an optional SCRIPT_CONTROL structure. See MSDN for details.
scriptState points to an optional SCRIPT_STATE structure. See MSDN for details.
scriptTabdef points to an optional SCRIPT_TABDEF structure, which defines the tab-stop positions to be used when performing tab-expansion. See MSDN for details.

UspAnalyze must be used to analyze an entire paragraph of text. The resulting USPDATA object can be used in subsequent calls to UspTextOut and UspSnapXtoOffset.

struct ATTR
{
   COLORREF fg; // foreground text colour
   COLORREF bg; // background text colour

   int len : 16; // length of this run (in WCHARs)
   int font : 7; // font-index into the USPFONT table 
   int sel : 1; // selection flag (yes/no)
   int ctrl : 1; // show as an isolated control-character
   int eol : 1; // only valid for last character in line, prevents mouse selection
   int reserved : 6; // unused
};

All fields of the ATTR structure must be initialized before use. Any unrequired field should be set to zero. The ATTR::font field is used as an index into the USPFONT table. Any font in referenced by ATTR::font must have initialized using UspInitFont.

UspInitFont

void UspInitFont (
  USPFONT * uspFont,
  HDC hdc,            
  HFONT hFont
);

UspInitFont must be called once for each font referenced by UspAnalyze in the attrRunList array. Several font-related resources are managed by the USPFONT object, including the Uniscribe SCRIPT_CACHE object, and the text-metrics for the font.

uspFont points to a single USPFONT structure.
hdc is a handle to a device-context.
hFont is a handle to the font resource.

The USPFONT structure is defined below:

struct USPFONT
{
  HFONT hFont;        
  SCRIPT_CACHE scriptCache;  
  TEXTMETRIC tm;           
  int yoffset; // height-adjustment when drawing font (set to zero)
};

The yoffset field is user-defined and specifies the vertical adjustment to be applied to all text using this font. UspInitFont initially sets this value to zero, however it can be modified after this call. All other structure members are managed by UspInitFont and should not be modified by the caller.

UspFreeFont

void UspFreeFont (
  USPFONT * uspFont
);

UspFreeFont must be called when the specified USPFONT resource is no longer required. The font-handle specified in the call to UspInitFont is released, as well as the SCRIPT_CACHE object held internally to the structure.

UspApplyAttributes

void UspApplyAttributes (
  USPDATA * uspData, 
  ATTR * attrRunList
);

UspApplyAttributes can be called at any time to re-apply the style-run attributes for the specified USPDATA object. Only the colour and selection information is used - all other fields of the attribute-runs (including the font) are ignored.

attrRunList specifies a new list of style-runs for the text.

The attribute-run list must reference a range of text the same length as the string that was previously analyzed by UspAnalyze.

UspApplySelection

void UspApplySelection (
  USPDATA * uspData, 
  int selStart,
  int selEnd
);

UspApplySelection performs a similar task to UspApplyAttributes. However this time only the selection-flags are modified in the USPDATA object.

selStart is the starting position in the string where the selection-highlight should begin.
selEnd is the ending position of the selection-highlight.

UspSetSelColor

void UspSetSelColor (
  USPDATA * uspData,
  COLORREF fg,
  COLORREF bg
);

UspSetSelColor controls the selection-highlight colour to be used when calling UspTextOut. Any character marked with the ATTR::sel attribute, or any range of text identified by UspApplySelection will be drawn using this colour. Note that by default, the Windows selection-highlight colours will be used.

fg is the COLORREF value of the selection foreground (text) colour.
bg is the COLORREF value of the selection background colour.

UspTextOut

int UspTextOut (
  USPDATA * uspData,
  HDC hdc,
  int xpos, 
  int ypos,
  int lineHeight,
  int lineOffsetY,
  RECT * rect
);

UspTextOut is the counterpart to ScriptStringOut. It takes as input the USPDATA object which was previously analyzed, and draws the text to the specified location. Any fonts, colours and selection-highlights are applied to the text as it is drawn.

hdc is a handle to a device-context.
xpos is the x-coordinate where text-output should begin.
ypos is the y-coordinate where the text-output should begin.
lineHeight specifies the total height, in pixels, that will be occupied by each line. The text background will be filled to this extent. Applications will usually set this value to be the same as (rect.bottom - rect.top)
lineOffsetY specifies the vertical distance in pixels - relative to ypos - from which the text will be offset. This value is in addition to any y-adjustment specified by the USPFONT structures. Can be zero.
rect is the bounding rectangle beyond which clipping will occur. This parameter must be specified and at a minimum should identify the client-area rectangle of the device-context.

It is recommend to “double-buffer” the output of this function as the multi-pass rendering will result in flickering. The alignment-mode, background-mode and device-context colours of the device-context are unspecified on this function’s return.

UspTextOut will change in the future to support word-wrapping.

UspSnapXToOffset

BOOL UspSnapXToOffset (
  USPDATA * uspData,	
  int xpos,
  int * snappedX, // out, optional
  int * charPos, // out
  BOOL * fRTL // out, optional
);

UspSnapXtoOffset converts an x-coordinate to the nearest character-offset. In addition it returns the x-coordinate of the selected character.

xpos specifies the x coordinate.
snappedX points to an integer that receives the adjusted x-coordinate.
charPos points to an integer that receives the character position corresponding to xpos.
fRTL points to a BOOL that receives the direction of the item-run corresponding to xpos. If TRUE it indicates a right-to-left run, if FALSE it indicates a left-to-right run.

The fRTL parameter is useful for the case when the text-caret’s shape is modified to reflect the reading-direction of the run of text that corresponds to xpos.

UspXToOffset

BOOL UspXToOffset (
  USPDATA * uspData,
  int xpos,
  int * charPos, // out
  BOOL * trailing, // out
  BOOL * fRTL // out, optional
);

UspXToOffset converts an x-coordinate to a character position.

xpos specifies the x coordinate.
charPos points to a variable that receives the character position corresponding to xpos.
trailing points to a variable that receives an indicator whether the position is the leading or trailing edge of the character.
fRTL points to a variable that receives the direction of the item-run corresponding to xpos. If TRUE it indicates a right-to-left run, if FALSE it indicates a left-to-right run.

UspOffsetToX

BOOL UspOffsetToX (
  USPDATA * uspData,	
  int offset,
  BOOL trailing,
  int * xpos // out
);

UspOffsetToX returns the x-coordinate for the leading or trailing edge of a character position.

offset specifies the character position in the string.
trailing indicates the edge of the character that corresponds to the x coordinate. If TRUE it indicates the trailing edge, if FALSE it indicates the leading edge.
xpos points to a variable that receives the corresponding x coordinate for the character-offset.

UspFree

void UspFree(USPDATA * uspData);

UspFree should be called when the specified USPDATA object is no longer required.

Changes to Neatpad

It was very straight-forward to import UspLib into Neatpad’s existing codebase. However there were several changes made to key aspects of the TextView library which made this possible. These changes are briefly mentioned below.

Reference to NeatTextOut and NeatTextWidth have been removed, as have the functions themselves. This means that all previous tutorials that discussed drawing and painting (prior to UspLib) are effectively obsolete. Although the ideas they presented were good, the method in which styled text was drawn (successive calls to ExtTextOut) have been superseded by the UspLib library.
The existing mouse-handling code has been substantially reduced in complexity. The caret hit-testing ideas presented in previous tutorials have again been superseded by the functionality provided by UspLib.
Font-handling has been moved in part to UspLib.
Control-character rendering is fully handled by UspLib so all of the related code has been removed from the TextView.

Whilst a large amount of code has removed from the TextView, in reality these areas of functionality have been transferred to UspLib which now handles all aspects of drawing, fonts and mouse hit-testing.

UspLib was designed primarily for use with Neatpad. However this does not mean that it cannot be used for other text-editor projects, or in fact any application that requires the use of complex, styled text. Remember, UspLib is Freeware and can be used in any project!

Problems with paragraphs

With Uniscribe (and therefore UspLib), the basic unit of text is the paragraph. For text-editors such as Neatpad, an entire line can be treated as a paragraph. This concept is important however, as it imposes a restriction on how UspLib should be used. Because whole lines must be analyzed, this effectively means that an entire line of text must be in memory at one time. The consequence of this means that we must impose a “line length” limit on text files that we load. In Neatpad, any line of text beyond a certain length will be truncanted.

I’m quite please about this limitation actually as I wasn’t looking forward to handling arbitrarily long lines. These are just a few of the issues that long-lines present:

How to apply the Unicode bi-directional algorithm with ScriptItemize when the whole line must be in memory?
What would the maximum line-length be anyway? 2Gb. 4Gb?
How would we represent x-coordinates on a line this long? The x-coordinate would overflow the limits of a 32bit integer.
How would we count the characters on a line that contained tabs? Tab-expansion could potentially push the line-length beyond 4Gb.

I don’t have any good answers to these questions so I’m happy for the moment to have a simple restriction of something like 65Kb per line. I’d like to hear any thoughts in this area though!

Caching with GetUspData

The big issue with Uniscribe is all the memory that must be allocated in order to display just a single line of text. UspLib hides this complexity behind the USPDATA object. However the memory overhead that each USPDATA imposes is quite significant:

16 bytes per glyph.14 bytes per wide-character.32 bytes for each item-run.

For a typical string of UTF-16 text we are looking at an increase of many times that of the original string length. Obviously this is far too much to be creating USPDATA objects for every line of text in a file. To solve this problem a new TextView member function was written, which manages USPDATA objects from an internal cache.

struct USP_CACHE
{
   USPDATA * uspData; // the UspLib data for this line
   ULONG lineno; // which line this refers to
   ULONG usage; // usage count for caching purposes
};

class TextView
{
   ...
   // keep an internal cache of USPDATA objects
   USP_CACHE m_uspCache[USP_CACHE_SIZE];
};

Whenever a line of text is required by the TextView (for drawing or mouse hit-testing), the GetUspData function is called. The drawing and mouse-related routines no longer directly access the underlying TextDocument. All data-access is now through this single function.

USPDATA *TextView::GetUspData(HDC hdc, ULONG nLineNo)
{
    TCHAR buff[TEXTBUFSIZE];
    ATTR attr[TEXTBUFSIZE];
    int len;

    USPDATA * uspData = << find a cached object >>

    // if found a match (an already analyzed line) then return it here!!
    if(....)
        return uspData;

    // otherwise we need to style + analyze a new line
    len = m_pTextDoc->getline(nLineNo, buff, TEXTBUFSIZE, &off_chars);
    len = ApplyTextAttributes(nLineNo, off_chars, colno, buff, len, attr);

    // setup the tabs
    int tablist[] = { m_nTabWidthChars };
    SCRIPT_TABDEF tabdef = { 1, 0, tablist, 0 };
    SCRIPT_CONTROL scriptControl = { 0 }; 

    SCRIPT_STATE scriptState = { 0 };

    // generate glyphs etc
    UspAnalyze(uspData, hdcTemp, buff, len, attr, 0, m_uspFontList, 
       &scriptControl, &scriptState, &tabdef);

    return uspData;
}

The sample-code above gives the general idea for how GetUspData works. The caching details are rather boring so I’ve omitted them here (just look in the real sources). The idea behind this method though, is that any time we want a USPDATA object, GetUspData will return one ready-analyzed. Most of the time this object will be from the cache, and only occasionally will a line need to be fetched from the TextDocument and analyzed with UspAnalyze.

Conclusions

The move to Uniscribe defines a turning-point in Neatpad’s development. It has taken alot of effort to get here but the future now looks alot clearer. In many ways I wish I had started this project with Uniscribe right from the beginning - it would have saved alot of work. However Unicode is quite complicated and I think the beginning tutorials would have suffered from this extra complexity. Besides, I think it is good to see the evolution that has occurred since the start of this project, and also the mistakes that I have made along the way.

Overall I’ve found working with Uniscribe to be a very rewarding experience. The API itself is rather complicated but it is very well designed. The main difficulty is coming to terms with the concept of glyph-based rendering. However I do feel that the MSDN documentation for Uniscribe to be rather inadequate in places. For someone who had no prior experience in displaying Unicode text I struggled for quite some time before finally completing this phase of the project.

As a comparison, take a look at the Apple documentation for ATSUI (an equivalent API to Uniscribe but higher-level). The documentation is much clearer in my opinion - it doesn’t just document the ATSUI API but gives guidelines on how it should be used on the Apple system.

Coming up in Part 16

There are still some minor “todo’s” with UspLib which I haven’t quite managed to finish. The issue of CRLF sequences at the end of a line of text needs addressing for bi-directional texts. Sometimes the CRLF will not be on the far-right of a line - for RTL texts it can be on the left-side, or even in the middle of the line! The other issue is properly displaying the file with full right-to-left alignment, with the scrollbar positioned on the left.

The next tutorial will look at adding keyboard support to the TextView. We will focus only on caret-movement with the keyboard, as actual text-entry must wait until the TextDocument can actually edit text. The caret-movement code will be using Uniscribe’s ScriptBreak routine, which will probably result in a couple more UspLib functions being added.

Beyond this I will probably tackle syntax highlighting, and once the GUI is completely finished I will finally move onto file-editing. The end is getting alot closer now I feel!

Downloads

neatpad15.zip