Unicode Text Processing

19 minute read • 5 Dec 2005

The last tutorial presented an overview of the various encoding formats that are used to store Unicode text. It is now time to take this theory and apply it to Neatpad. Therefore the subject of this article will be Unicode text processing.

The image above shows Neatpad’s new Encoding menu option - with a UTF-8 file displayed in all it’s glory. At the top of this tutorial are a collection of Unicode files which you can use to test Neatpad’s Unicode capability.

Loading text files

Previous incarnations of Neatpad supported a single text encoding - plain ASCII text. A Unicode text editor must naturally support the various Unicode file-formats so our first step will be to modify the TextDocument’s init() function to detect what type of file we are opening.

Of course it isn’t possible to detect what type of encoding a text-file uses until we actually open the file and read the first few bytes. We will use what Unicode terms the “Byte Order Mark” - a specific sequence of bytes that can only appear at the start of a Unicode text file, and if present will determine the exact encoding method used to save the file.

Byte Signature	Unicode Format	Neatpad Format
none	Plain ASCII/ANSI	NCP_ASCII
EF BB BF	UTF-8	NCP_UTF8
FF FE	UTF-16, little-endian	NCP_UTF16
FE FF	UTF-16, big-endian	NCP_UTF16BE
FF FE 00 00	UTF-32, little-endian	NCP_UTF32
00 00 FE FF	UTF-32, big-endian	NCP_UTF32BE

Therefore a new function has been added to the TextDocument - TextDocument::detect_file_format - who’s purpose is to detect the format of the text-file as it is being loaded during TextDocument::init. In the absence of any file-signature we will assume that the file-contents is plain ASCII/ANSI text.

int TextDocument::detect_file_format(int *headersize);

This function’s sole task is to analyse the first x bytes of a file and compare them against the various Byte-Order-Mark values that are defined in the table above. It is literally a task of performing a series of memcmp ’s until we match a format. The detect_file_format function returns an appropriate NCP_xxx value (Neatpad Codepage) to indicate what type of file is being processed.

The file’s text-format is stored internally by the TextDocument (in member-variable fileformat). The length of the Byte-Order-Mark header is also saved away in the headersize member-variable - so that we can always identify the start of the real content no matter what type of file we are loading.

Internal text representations

Most text-editors (such as Notepad) will load an entire text-file into memory. No matter what the underlying file format (i.e. ASCII, UTF-8 or UTF-16), the contents will be converted to an internal format to make it easier to work with. For Windows programs, this is usually (but not always) the native UTF-16 format of Windows NT. This makes sense because all of the text-based Windows APIs are designed to handle UTF-16/UCS2.

This is a great way to structure a program because you maintain one set of source-code for the main editor (which interfaces directly with the OS’s text routines), and then write a set of simple file I/O conversion routines which load and save each of your supported formats. The editor is kept very simple because the text it processes is always in one format. When it comes back to saving a file in it’s original format then the entire text is converted back again.

Of course this method could require large amounts of memory because the entire file must be loaded at one time. In order for us to support our goal of a multi-gigabyte text editor we must leave the file in it’s “raw” state and only map specific parts of a file into memory as required - much like the HexEdit program on this site.

However this leaves us with a problem - how do we handle many different forms of text within the same program but still keep a single code-base which is not over-complicated by the various encodings it must process? Here are the two basic strategies available:

Write separate versions of the TextView/TextDocument for each specific file format. We would then create a specific instance of TextView (i.e. TextViewUtf8 / TextViewUtf16) depending on what type of file we encountered. We could potentially use macro’s / C++ templates to make our lives easier but I believe this method will be a code-maintanence-nightmare. Avoid at all costs!
Write a generic TextView which always handles text in the “native” format (i.e. UTF-16 for Windows). The TextView would have no knowledge of the underlying file-format, and it would be up to the TextDocument to convert the underlying file-format into UTF-16 as the TextView requested.

I think that method#2 will provide the greatest flexibility and with careful design should work well for Neatpad.

Generic text processing

The idea behind a “generic” design is that the TextView always gets to see and process UTF-16 text (i.e. standard wide-character Unicode strings). It is completely unaware that the underlying file the TextDocument is reading is anything other than UTF-16 text. This means that whenever the TextView asks for text to display, it is up to the TextDocument to translate (if necessary) the underlying file contents into UTF-16 (i.e. on the fly in realtime).

The TextDocument on the other hand understands “all” types of file-format. It knows how to read the various encodings that we will support - so this would be ASCII, UTF-8 and UTF-16.

I feel that this type of design will suit Neatpad very well. Because the user-interface (the TextView) has the potential to be so complicated, it is very important to try and isolate all of the text-conversion problems into one place so that we only have to worry about it once. It also has the advantage that we could add further text-formats to the TextDocument (i.e. UTF-32) and the TextView would never have to be modified. The TextView should only care about UTF-16.

Two coordinate systems

Deciding to move to this “generic” text model has introduced a major problem, because we now have two coordinate systems to consider - one for the TextView, and one for the TextDocument. At this point we could just say “we’ll support UTF-16 for the moment and add UTF-8 later on” - but this would be a mistake. The design of a “single format” editor is very different to an editor that must handle arbitrary file-formats and we must move to a more generic design or this will cause us even bigger problems later on.

So, we have decided that the TextView will work exclusively in UTF-16 units. This is a good thing. It basically means that the entire “user-interface” to the TextView control is in the Native Windows Unicode format. Don’t underestimate how important this is. We haven’t progressed this far yet, but try to imagine a “user” of the TextView control (i.e. a programmer) using it in a C++ project:

This programmer’s project will naturally be Unicode and all text operations will therefore also be UTF-16. TextView operations such as cursor positioning, selection management, getting and setting text at specific offsets, searching for text etc must be UTF-16 also. The user/programmer doesn’t care what the underlying format of the textfile is, all they see of the world is UTF-16, and all operations must match this view of the world. Therefore our cursor offsets and selection offsets - our entire coordinate “front end” to the control - must be UTF-16 based. This is where we hit our problem though:

The TextDocument has a different view of things. It must work with arbitrary file formats and it won’t know - until loaded - what format a text-file will be in. It could be dealing in single-byte formats (ASCII), multi-byte formats (UTF-8) or wide-character formats (UTF-16). The TextDocument must use a coordinate-system that is common to all these formats. Of course this will be a byte-oriented system - so all line-offsets and text-accesses will must be byte-based.

To try and illustrate this TextView/TextDocument divide, let’s look at a quick example of some Unicode text:

U+0041 LATIN CAPITAL LETTER • U+06AF ARABIC LETTER GAF • U+16D4 RUNIC LETTER DOTTED-P • U+10416 DESERAT CAPITAL LETTER JEE

The text above is just a random collection of four Unicode characters - with the Unicode code-point values listed to the side. To see how these characters relate to Neatpad, we will imagine that the text above has been encoded as UTF-8 and loaded into Neatpad. The TextDocument would therefore be working in UTF-8 multi-byte units:

The TextView of course sees the file as UTF-16. Hopefully the diagram illustrates just how separated the TextView has become from the underlying file. Apart from the first character (‘A’), the raw data that the TextView gets to see is completely different to how it really appears on disk. Remember, all this should be happening in realtime, not during the file-loading process.

But we still haven’t solved our problem. The TextView speaks UTF-16 and the TextDocument speaks byte-offsets. We need to devise some kind of mechanism to perform mappings between UTF-16 offsets (i.e. code-unit offsets) and the underlying file-content (whatever that may be). This task will fall to the TextDocument, and it will be the line-offset buffer that will be doing all the hard work.

Reading Unicode data

The decision to make the TextView UTF-16 only means that our TextDocument::getline routine must change. Remember that this is the main “gateway” between the TextView and the TextDocument. Let’s look at what we had before:

ULONG TextDocument::getline(ULONG lineno, ULONG offset, char *buf, size_t len, ULONG *fileoff)

The TextDocument::getline routine basically returns a block of text from the specified line - and always returns this text as plain ANSI. Two things are going to change here. Obviously the text-type must change from char* to wchar_* if we want to support Unicode. This change has been achieved by converting all char* types to TCHAR* on a project-wide basis and creating a separate Unicode Build.

The second change is to move away from a line-oriented text-retrieval model. What we have now is a getline replacement - called TextDocument::gettext. The purpose of this new routine is to return UTF-16 text from the specified byte offset within the current file:

int TextDocument::gettext(ULONG offset, ULONG maxbytes, TCHAR *buf, int *buflen)

No matter what the underlying text-format, this routine will always return UTF-16 data (for a Unicode build). The text is stored in the buf parameter, and the number of “characters” stored in buf is returned in the *buflen parameter.

TCHAR buf[200];
int buflen = 200;

// read a block of text as UTF-16 from the specified position
len_bytes = textDoc->gettext(byte_offset, max_bytes, buf, &buflen);

// adjust offsets ready for next read
off_bytes += len_bytes;
max_bytes -= len_bytes;

Most importantly though, the number of bytes that were processed from the underlying file is returned from the function directly - i.e. the return value represents the number of ASCII/UTF-8/UTF-16 bytes that were processed during the conversion to UTF-16. This is required so that we can keep track of the “byte position” in the underlying file - to allow us to continue reading blocks of UTF-16 in an iterative fashion.

Even though the TextView will be reading UTF-16 data (and using UTF-16 based offsets for cursor positioning etc), we must access the underlying file using byte-offsets. This is to make the text-retrieval a direct form of access to the underlying file, converting whatever data happens to be at the byte-offsets into UTF-16. If we used UTF-16 coordinates to access the file content, we would have to convert this character-offset to a byte-offset by performing lengthy processing.

The new TextDocument::gettext function is a little more complicated than what we had before:

int TextDocument::gettext(ULONG offset, ULONG lenbytes, WCHAR *buf, int *buflen)
{
    BYTE *rawdata = buffer + headersize + offset;

    switch(fileformat)
    {
    case NCP_ASCII:
        return ascii_to_utf16(rawdata, lenbytes, buf, len);

    case NCP_UTF8:
        return utf8_to_utf16(rawdata, lenbytes, buf, len);

    case NCP_UTF16:
        return copy_utf16(rawdata, lenbytes/sizeof(WCHAR), buf, len);

    case NCP_UTF16BE:
        return swap_utf16(rawdata, lenbytes/sizeof(WCHAR), buf, len);

    default:
        *len = 0;
        return 0;
    }
}

We must use the TextDocument::fileformat member-variable to decide how to convert the underlying text into UTF-16. Notice that there is one conversion routine for each type of text that we will support.

One thing I should mention which isn’t detailed here is the actual conversion process to UTF-16. We must be very careful that we never “split apart” UTF-16 surrogate pairs accidently when converting to UTF-16. This could potentially happen when converting from UTF-8 and we run out of buffer space to store both of the surrogate characters. The conversion routines all make sure that surrogate pairs are kept together.

Problems with MultiByteToWideChar

You may have noticed that I have written my own Unicode-conversion routines in the TextDocument::gettext function. I really wanted to use the MultiByteToWideChar API to perform all conversions to UTF-16. Unfortunately nothing is that simple. Although MultiByteToWideChar is good at converting valid UTF-8 data, it is not so good when it comes to invalid text-sequences (such as malformed or overlong sequences).

When it comes to processing this type of data the preferred behaviour for a text-editor is to indicate invalid sequences of UTF-8/UTF-16 by using a special Unicode character - “U+FFFD Unicode Replacement Character”. The problem with MultiByteToWideChar is, it doesn’t perform this conversion for invalid sequences - it just returns with a failure and you don’t know how many characters were invalid. This makes it impossible to restart the conversion process, because you don’t know where to restart.

By writing my own routines I was able to process both valid and invalid data in a manner that is more suitable for text-stream processing - i.e. more suitable for a text editor.

Line Buffer Management

Changing to Unicode and the “double coordinate system” means that the line-buffer scheme we developed earlier in the series needs revisiting. I’m not going to go into too much detail here because I know full well that I will be changing it yet again when we come to adding “gigabyte file support” later in the series. But this needs some discussion right now so here goes:

The line-buffer in Neatpad serves two purposes. Firstly, it provides a method to quickly locate a line of text’s physical location within a file. This provides a kind of “random access” to the file content. The second purpose of the line buffer is to perform the reverse operation - i.e. given a physical “cursor offset”, work out which line contains this offset.

Now that coordinate-system of the TextView is UTF-16, we need to rethink the design of the line-buffer. The problem is, we still need to know where lines of text are physically located within a file so we can’t just change the line-buffer to UTF-16 coordinates. Of course a “single-format” editor such as Notepad would go down this route but because we must support multiple file-formats we need to be dealing with real__physical locations.

What I have done is add a second line-buffer to the TextDocument - adjacent to the original “byte-based” line-buffer. So the original line-buffer still holds real, physical byte-offsets of each line’s starting position within the file. The new line-buffer records each line’s starting position, however this time it stores the information as UTF-16 offsets (character positions) rather than byte-offsets. Even if the underlying file is UTF-8, this second line-buffer stores each line’s offset as if it were encoded as UTF-16.

The diagram below should hopefully illustrate what I am trying to explain.

Moving to Unicode has introduced yet another problem: Exactly how do we initialize the line-buffer(s) with these “line start” offsets with all these extra formats? Previously the method used to search for CR/LF combinations was a simple byte search. Unfortunately this method will no longer be sufficient for our multi-format text editor:

We can’t do a byte-search for “\r” and “\n” characters and expect it to work any more. There is no problem with ASCII and UTF-8 (they are still byte-based and CL/LF are no different), but UTF-16 presents a challange. The CR-LF ASCII sequence (0x0D followed by 0x0A) is actually “U+0D0A MALAYALAM LETTER UU” in Unicode. We must search specifically for U+0x000D and U+0x000A when processing a UTF-16 text file.

Unicode also defines it’s own line-breaking and paragraph-breaking codepoints: U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR. However the convention in text-files is still to use CR/LF sequences so we must really support all these conventions.

We have two options for parsing lines of text and building the line-buffers. The first is to write separate routines - one for each format we will support. Although this might be the most efficient approach in terms of processing-speed, it is definitely not the most efficient in terms of code-maintanence. Perhaps when Neatpad is complete I will look at this approach, but for now I prefer the following method:

Quite simply, the better method for the time being is to use a generic line-parsing routine. The TextDocument::init_linebuffer remains intact, and still processes the file on a character-by-character basis, searching for CR/LF sequences. The difference is now, the file is converted into a stream of UTF-32 characters which enables us to handle all forms of text.

Text Iteration

You may be thinking that this is all getting quite complicated now - and you’d be right, it is! The main complication arises (as we already know) because the TextView deals in UTF-16 character offsets, whereas the TextDocument deals in byte-offsets. Although the TextView always retrieves UTF-16 text from the TextDocument, it must still do so using byte-offsets. This is not a terribly neat solution.

To solve the problem I have introduced a third C++ class called TextIterator. The purpose of this class is to provide a “bridge” between the coordinate system of the TextView and the underlying file-format that the TextDocument understands. This means that the TextView no longer asks the Document directly for text - all text retrieval now goes through the Iterator.

class TextIterator
{
    friend class TextDocument;

public:
    int gettext(WCHAR *buf, int len);

private:
    // only "friends" of the TextIterator can create them
    TextIterator(ULONG off, ULONG len, TextDocument *doc);

    // keep track of position within the specified TextDocument
    TextDocument * text_doc;
    ULONG off_bytes;
    ULONG len_bytes;
};

As you can see the class definition for a TextIterator is very simple. It keeps track of the TextDocument that it is being used for, and the byte-offset within the document. These values are set when the TextIterator is constructed. The only code which actually does anything useful is shown below in the TextIterator::gettext function.

int TextIterator::gettext(WCHAR *buf, int buflen)
{
    // get text from the TextDocument at the specified byte-position
    int len = text_doc->gettext(off_bytes, len_bytes, buf, &buflen);

    // adjust the iterator's internal position
    off_bytes += len;
    len_bytes -= len;

    return buflen;
}

The TextIterator basically encapsulates the byte-based fileoffset details and hides them from the TextView. A single function has also been added to the TextDocument which is used to start a line-iteration:

TextIterator TextDocument::iterate_line(ULONG lineno, ULONG *linestart, ULONG *linelen)
{
    ULONG offset_bytes;
    ULONG length_bytes;

    lineinfo_from_lineno(lineno, linestart, linelen, &offset_bytes, &length_bytes);

    return TextIterator(offset_bytes, length_bytes, this);
};

The interate_line function returns an independent TextIterator object which can then be used to access the file’s text in a transparent manner. An example of text-iteration using this new class is shown below:

TextIterator itor = m_pTextDoc->iterate_line(100);

WCHAR buf[200];
int len;

len = itor(buf, 200);

You can see how simple the process is now. The TextView now accesses the file-content through the TextIterator. Everything is line/character offset based as far as the TextView is concerned. The nasty byte-offsets and conversion details are hidden away in the Iterator and TextDocument, which is exactly how we want it.

In all probability I will end up changing the design yet again when some other issue crops up (I am expecting headaches with bidirectional text and complex scripts), but for the time being the TextView/TextIterator/TextDocument design that I have outlined here seems to work pretty well.

Additions to Neatpad

A quick mention on some changes to the actual Neatpad application. I have added three things. The first is command-line support. Now it is possible to specify a text-file at the commandline (just like with Notepad) and the file opens automatically when Neatpad starts.

The second addition is shell-menu support. There is a new setting in Neatpad’s options to add an entry to Explorer’s shell context menu for all filetypes, enabling you to right-click any file and select “Open with Neatpad”. I always add this entry for Notepad when I build a new system and having the same (automatic) feature for Neatpad will be very useful in my opinion.

The last addition is window-position persistence. You may have noticed with Notepad that it saves it’s window-position each time it exits, so that the next time it starts the window is restored to the saved position. I have gone one step further than this - Neatpad saves the window position for individual files rather than for the application as a whole. This means that you could open+close different files in Neatpad and they each remember their own position on screen.

The way I have done this is to use NTFS Alternate Data Streams. I have been dying to find a use for “NTFS Streams” since they first appeared in Windows NT and I believe I have found the perfect use for them. Each time a file is opened, an NTFS steam attached to the file (called Neatpad.WinPos) is opened as well. A WINDOWPLACEMENT structure is saved in this stream - so when a file is opened the SetWindowPlacement API is called using the saved structure. And when a file is closed, Neatpad’s current window position is retrieved with and saved back into the main file’s Neatpad.WinPos stream.

Coming up in Part 10

The subject of Unicode has proven to be rather difficult to solve. In fact I went through several rewrites of the TextView/TextDocument classes before I arrived at the solution I’ve presented here. This is just one of the reasons why it took such a long time to get right - the other reason being that I had to do alot of background reading to make sure I understood all of the issues surrounding Unicode before I started.

I will mention again the book “Unicode Demystified” by Richard Gillam - this book is well worth a read and covers many more Unicode topics than I can present here. Although it was written for Unicode 3.0 don’t let this put you off - the changes between Unicode 3.0 and 4.0 are fairly minimal and are basically just things like additions to the character repertoire.

Moving on to Part 10. The next tutorial will focus on the proper display of Unicode text. Understand that at the moment, all I have really done is turn Neatpad into a “wide-character” text viewer which happens to support UTF-8 and UTF-16 encoding formats. Although we are now using the “unicode” Windows APIs (specifically TextOutW) we are still a long way off being a real Unicode editor. Complex scripts, combining characters and bidirectional text are not supported yet. If you thought displaying Unicode text was a matter of simply calling TextOutW then think again - Unicode text display is a very complicated problem which cannot be solved using TextOut on it’s own.

The next tutorial will therefore focus on the Uniscribe API. This API (available since Windows 2000) provides support for displaying complex-scripts and bidirectional text. We will have to redesign Neatpad’s text-display engine slightly because of the way Uniscribe works, and also modify the mouse-input and selection routines, but hopefully after the next tutorial we will be in a very good position in terms of Unicode support.

Downloads

neatpad9.zip