From aa5ccb01a59901cb15a25995b70a7f49d2b03b57 Mon Sep 17 00:00:00 2001 From: Eli Zaretskii Date: Sun, 6 Apr 2014 18:56:01 +0300 Subject: [PATCH] src/bidi.c: Describe the design of reordering engine in the commentary. --- src/bidi.c | 190 +++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 186 insertions(+), 4 deletions(-) diff --git a/src/bidi.c b/src/bidi.c index b96cc24bbd1..53c2dad1b6b 100644 --- a/src/bidi.c +++ b/src/bidi.c @@ -22,9 +22,16 @@ along with GNU Emacs. If not, see . */ A sequential implementation of the Unicode Bidirectional algorithm, (UBA) as per UAX#9, a part of the Unicode Standard. - Unlike the reference and most other implementations, this one is - designed to be called once for every character in the buffer or - string. + Unlike the Reference Implementation and most other implementations, + this one is designed to be called once for every character in the + buffer or string. That way, we can leave intact the design of the + Emacs display engine, whereby an iterator object is used to + traverse buffer or string text character by character, and generate + the necessary data for displaying each character in 'struct glyph' + objects. (See xdisp.c for the details of that iteration.) The + functions on this file replace the original linear iteration in the + logical order of the text with a non-linear iteration in the visual + order, i.e. in the order characters should be shown on display. The main entry point is bidi_move_to_visually_next. Each time it is called, it finds the next character in the visual order, and @@ -52,7 +59,182 @@ along with GNU Emacs. If not, see . */ A note about references to UAX#9 rules: if the reference says something like "X9/Retaining", it means that you need to refer to rule X9 and to its modifications described in the "Implementation - Notes" section of UAX#9, under "Retaining Format Codes". */ + Notes" section of UAX#9, under "Retaining Format Codes". + + Here's the overview of the design of the reordering engine + implemented by this file. + + Basic implementation structure + ------------------------------ + + The sequential processing steps described by UAX#9 are implemented + as recursive levels of processing, all of which examine the next + character in the logical order. This hierarchy of processing looks + as follows, from the innermost (deepest) to the outermost level, + omitting some subroutines used by each level: + + bidi_fetch_char -- fetch next character + bidi_resolve_explicit -- resolve explicit levels and directions + bidi_resolve_weak -- resolve weak types + bidi_resolve_neutral -- resolve neutral types + bidi_level_of_next_char -- resolve implicit levels + + Each level calls the level below it, and works on the result + returned by the lower level, including all of its sub-levels. + + Unlike all the levels below it, bidi_level_of_next_char can return + the information about either the next or previous character in the + logical order, depending on the current direction of scanning the + buffer or string. For the next character, it calls all the levels + below it; for the previous character, it uses the cache, described + below. + + Thus, the result of calling bidi_level_of_next_char is the resolved + level of the next or the previous character in the logical order. + Based on this information, the function bidi_move_to_visually_next + finds the next character in the visual order and updates the + direction in which the buffer is scanned, either forward or + backward, to find the next character to be displayed. (Text is + scanned backwards when it needs to be reversed for display, i.e. if + the visual order is the inverse of the logical order.) This + implements the last, reordering steps of the UBA, by successively + calling bidi_level_of_next_char until the character of the required + embedding level is found; the scan direction is dynamically updated + as a side effect. See the commentary before the 'while' loop in + bidi_move_to_visually_next, for the details. + + Fetching characters + ------------------- + + In a nutshell, fetching the next character boils down to calling + STRING_CHAR_AND_LENGTH, passing it the address of a buffer or + string position. See bidi_fetch_char. However, if the next + character is "covered" by a display property of some kind, + bidi_fetch_char returns the u+FFFC "object replacement character" + that represents the entire run of text covered by the display + property. (The ch_len and nchars members of 'struct bidi_it' + reflect the length in bytes and characters of that text.) This is + so we reorder text on both sides of the display property as + appropriate for an image or embedded string. Similarly, text + covered by a display spec of the form '(space ...)', is replaced + with the u+2029 paragraph separator character, so such display + specs produce the same effect as a TAB under UBA. Both these + special characters are not actually displayed -- the display + property is displayed instead -- but just used to compute the + embedding level of the surrounding text so as to produce the + required effect. + + Bidi iterator states + -------------------- + + The UBA is highly context dependent in some of its parts, + i.e. results of processing a character can generally depend on + characters very far away. The UAX#9 description of the UBA + prescribes a stateful processing of each character, whereby the + results of this processing depend on various state variables, such + as the current embedding level, level stack, and directional + override status. In addition, the UAX#9 description includes many + passages like this (from rule W2 in this case): + + Search backward from each instance of a European number until the + first strong type (R, L, AL, or sos) is found. If an AL is found, + change the type of the European number to Arabic number. + + To support this, we use a bidi iterator object, 'struct bidi_it', + which is a sub-structure of 'struct it' used by xdisp.c (see + dispextern.h for the definition of both of these structures). The + bidi iterator holds the entire state of the iteration required by + the UBA, and is updated as the text is traversed. In particular, + the embedding level of the current character being resolved is + recorded in the iterator state. To avoid costly searches backward + in support of rules like W2 above, the necessary character types + are also recorded in the iterator state as they are found during + the forward scan, and then used when such rules need to be applied. + (Forward scans cannot be avoided in this way; they need to be + performed at least once, and the results recorded in the iterator + state, to be reused until the forward scan oversteps the recorded + position.) + + In this manner, the iterator state acts as a mini-cache of + contextual information required for resolving the level of the + current character by various UBA rules. + + Caching of bidi iterator states + ------------------------------- + + As described above, the reordering engine uses the information + recorded in the bidi iterator state in order to resolve the + embedding level of the current character. When the reordering + engine needs to process the next character in the logical order, it + fetches it and applies to it all the UBA levels, updating the + iterator state as it goes. But when the buffer or string is + scanned backwards, i.e. in the reverse order of buffer/string + positions, the scanned characters were already processed during the + preceding forward scan (see bidi_find_other_level_edge). To avoid + costly re-processing of characters that were already processed + during the forward scan, the iterator states computed while + scanning forward are cached. + + The cache is just a linear array of 'struct bidi_it' objects, which + is dynamically allocated and reallocated as needed, since the size + of the cache depends on the text being processed. We only need the + cache while processing embedded levels higher than the base + paragraph embedding level, because these higher levels require + changes in scan direction. Therefore, as soon as we are back to + the base embedding level, we can free the cache; see the calls to + bidi_cache_reset and bidi_cache_shrink, for the conditions to do + this. + + The cache maintains the index of the next unused cache slot -- this + is where the next iterator state will be cached. The function + bidi_cache_iterator_state saves an instance of the state in the + cache and increments the unused slot index. The companion function + bidi_cache_find looks up a cached state that corresponds to a given + buffer/string position. All of the cached states must correspond + 1:1 to the buffer or string region whose processing they reflect; + bidi.c will abort if it finds cache slots that violate this 1:1 + correspondence. + + When the parent iterator 'struct it' is pushed (see push_it in + xdisp.c) to pause the current iteration and start iterating over a + different object (e.g., a 'display' string that covers some buffer + text), the bidi iterator cache needs to be "pushed" as well, so + that a new empty cache could be used while iterating over the new + object. Later, when the new object is exhausted, and xdisp.c calls + pop_it, we need to "pop" the bidi cache as well and return to the + original cache. See bidi_push_it and bidi_pop_it for how this is + done. + + Some functions of the display engine save copies of 'struct it' in + local variables, and restore them later. For examples, see + pos_visible_p and move_it_in_display_line_to in xdisp.c, and + window_scroll_pixel_based in window.c. When this happens, we need + to save and restore the bidi cache as well, because conceptually + the cache is part of the 'struct it' state, and needs to be in + perfect sync with the portion of the buffer/string that is being + processed. This saving and restoring of the cache state is handled + by bidi_shelve_cache and bidi_unshelve_cache, and the helper macros + SAVE_IT and RESTORE_IT defined on xdisp.c. + + Note that, because reordering is implemented below the level in + xdisp.c that breaks glyphs into screen lines, we are violating + paragraph 3.4 of UAX#9. which mandates that line breaking shall be + done before reordering each screen line separately. However, + following UAX#9 to the letter in this matter goes against the basic + design of the Emacs display engine, and so we choose here this + minor deviation from the UBA letter in preference to redesign of + the display engine. The effect of this is only seen in continued + lines that are broken into screen lines in the middle of a run + whose direction is opposite to the paragraph's base direction. + + Important design and implementation note: when the code needs to + scan far ahead, be sure to avoid such scans as much as possible + when the buffer/string doesn't contain any RTL characters. Users + of left-to-right scripts will never forgive you if you introduce + some slow-down due to bidi in situations that don't involve any + bidirectional text. See the large comment near the beginning of + bidi_resolve_neutral, for one situation where such shortcut was + necessary. */ #include #include -- 2.39.2