joe   Joe's Own Editor

Joe's Own Editor - Hacking

Home page
Change Log
News
Man page
Hints
Commands
Options
Hacking
History

Project page
Download source

Coroutines in JOE

JOE 3.6 now uses co-routines to help reduce the amount of event-driven code which has to be written. All previous versions of JOE were entirely in the event driven coding style, which made user dialog code a chore to write. Now, with the help of co-routines, dialog code can be written as old fashioned blocking code which is much cleaner, even though JOE can have multiple un-finished dialogs on the screen.

The coroutine library has only three (now four) functions:

int co_call(int (*func)(va_list args),...);Call a function as a co-routine.
int co_yield(Coroutine *t, int val);Suspend current co-routine and return to caller with given return value.
int co_resume(Coroutine *t, int val);Suspend current co-routine and resume specified one with given return value.
int co_suspend(Coroutine *u, int val);Suspend current co-routine (remembering its suspended invoking coroutines) and resume the top-level. The current coroutine is stored in u. When u is resumed and returns, the suspended chain is resumed (the resumer is continued when the chain tries to continue the top level.

co_call() invokes 'func' as a co-routine- it will run on its own stack. 'func' is called immediately upon the call to co_call. All of the arguments after func are passed as a va_list to the function. co_call calls va_start and va_end, 'func' should call va_arg once for each passed argument (see stdarg.h). The co-routine should not yield before it has finished reading all of the arguments, otherwise va_end will be called early. When 'func' returns, the co-routine is destroyed and co_call() returns with the return value of 'func'. If the co-routine yields by calling co_yield, co_call returns with the return value specified in the call to co_yield. Note that co_call only ever returns once for a particular call.

co_yield() suspends the current co-routine and returns with the specified return value to the caller (either to co_call or co_resume). co_yield() writes information about the co-routine to the 'Coroutine' structure whose address is passed to co_yield. The Coroutine structure's address is used as a handle for the co-routine and can be passed to co_resume().

co_resume() suspends the current co-routine (henceforth called the original co-routine) and resumes the specified co-routine. When the specified co-routine returns (when the function given in the co_call that created the co-routine returns), the original co-routine is continued and co_resume() returns with the function's return value. If the specified co-routine yields, the original co-routine is continued and co_resume() returns with the return value specified in the call to co_yield.

co_suspend() is like co_resume, except that the top-level task is resumed instead of the calling co-routine. This has the effect of suspending the entire chain of co-routines which invoked the caller of co_suspend. Co_suspend stores the suspended co_routine in another co_routine. That co_routine will resume this one when it finally returns.

The co-routine library is structured this way (which is different from the usual way where you get a co-routine handle immediately upon creation, for example see Knuth) to encourage proper stack un-winding. Basically, a co-routine can not be forcibly destroyed. Instead, each created co-routine is only destroyed when the initial function returns.

The co-routine library is based on ucontext.h and the getcontext, swapcontext, makecontext and setcontext system calls. If the system does not have these calls, it uses setjmp/longjmp tricks in a highly portable way: the stacks for each co-routine are allocated off the top of the main stack. The main stack grows after each allocation. The stacks are never freed, but they are reused. This works for any machine with a single normal stack. It will not work on IA64 (two stacks) or Cray (the stack is really a linked list).

I was originally going to use simple cooperative multi-threading, with only two basic calls: sleep_on() and wake_up() (inspired by old versions of the Linux kernel) and a top-level scheduler. Sleep_on would suspend the current thread, and create a new thread which runs just the scheduler. Wake_up would mark a thread as ready to run, and on the next return to the scheduler, it would be run.

This turns out to not be powerful enough to implement JOE's keyboard macro language, at least without much rewriting and inelegance. It could be done, but the keyboard macro player would have to feed events into the top-level scheduler or would have to be integrated with it, whereas in older versions of JOE (even without co-routines) the macro player just calls the edit functions (more or less) directly. The problem is that sleep_on (called by a function asking for user input) would suspend the task all the way up to the scheduler including the macro player, thus preventing the macro player from simulating user input.

On the other hand with real co-routines, the macro player will call each edit function as a co-routine (which I like to think of as a task with a very clear return point). If the edit function has to yield, it returns immediately to the macro player for the next macro step. If the yield is due to a request for input, the macro player supplies it by calling the recorded edit functions. If the macro wants to allow the user to provide input (using the "query" function, also known as an interactive macro), it can itself yield (since it is also running in a coroutine) back to the top-level edit loop (in actuality, the macro player calls a function which calls co_suspend() to do this). The macro player is continued when the user finishes supplying the input (hits the return key at a prompt). Several interactive macros can be running at once with this scheme.

Semi-Automatic Variables

JOE 3.6 uses "semi-automatic" variables for strings. There is a global "object-stack", accessed through these functions:

void *obj_malloc(int size);Allocate a block of memory.
void *obj_realloc(void *ptr, int size);Change allocation size.
void obj_free(void *ptr);Free a specified block and all newer blocks.
void obj_perm(void *ptr);Mark a block as permanent, which in effect moves the block from the stack to the heap.
int obj_len(void *ptr);Access hidden "length" part of block (string length).
int obj_size(void *ptr);Access hidden "size" part of block (allocation size).

These functions remember the order of allocation. When an allocated block is freed, it and all newer blocks are freed as well. For example, in:

unsigned char *a = obj_malloc(10);
unsigned char *b = obj_malloc(10);
unsigned char *c = obj_malloc(10);
unsigned char *d = obj_malloc(10);

obj_free(b);

b, c, and d are all freed, but a remains allocated. The top-level edit loop is structured to call obj_free after every editor command:

unsigned char *gc = obj_malloc(1);
execmd(...); /* Execute edit commands */
obj_free(gc); /* Collect garbage */

Thus, you rarely have to call obj_free- the only time you might want to do it is if you are calling obj_malloc in a loop. This could cause a large amount of memory to be allocated, so obj_free() should be called at the end of each loop iteration.

obj_perm() marks an object as a heap object instead of a stack object. When a heap object is freed, only itself is freed- the stack is not modified. Also when the stack is freed, heap objects are not touched.

Variable Strings

JOE uses dynamic strings built on top of the global object stack functions. These strings automatically resize themselves to fit their contents. They also know their own length, which allows you to have NULs in the middle of a string. They also always have a terminating NUL to make them compatible with C-strings. In fact they have (almost) the same type as C-strings: unsigned char *. The length is hidden in the obj_malloc data structure. These functions are provided:

unsigned char *vsmk(int prealloc);Allocate a zero length string, but with enough space to grow to the specified length.
int vslen(unsigned char *s);Return string length or 0 if s is NULL.
unsigned char *vsensure(unsigned char *s, int len);Make sure there is enough space for a string of the specified length.
unsigned char *vstrunc(unsigned char *s, int len);Set length of string: the string is truncated or extended with spaces to the specified length.
unsigned char *vsncpy(unsigned char *s,int offset,unsigned char *block, int block_size);Copy a memory block to a string. The string is extended with spaces if offset > string length.
unsigned char *vsdupz(unsigned char *s);Duplicate a z-string to a variable string.
unsigned char *vsdup(unsigned char *s);Duplicate a variable string.
unsigned char *vscpyz(unsigned char *s,unsigned char *z);Copy a z-string to a variable string.
unsigned char *vscpy(unsigned char *s,unsigned char *block,int size);Copy a memory block to a variable string.
unsigned char *vscatz(unsigned char *s,unsigned char *z);Append a z-string to a variable string.
unsigned char *vscat(unsigned char *s,unsigned char *block,int size);Append a memory block to a variable string.
unsigned char *vsadd(unsigned char *s,unsigned char c);Append a single character to a string.
unsigned char *vsfmt(unsigned char *s,int offset,const char *format,...);Sprintf() to a variable string.
unsigned char *vsgets(unsigned char **sp,FILE *f);fgets() to a variable string (returns NULL on EOF, does not put '\n' in the string).

Note that it is OK to pass NULL in place of 's' in any of the above functions. In this case, a new string is allocated.

Most of these function return the address of the variable string, in case it was modified. One exception is vsgets()- it takes an address of a variable holding the string. It returns the string address, but it also writes it to the variable directly.

Many of the above functions take a memory block, which is specified by an address and a length. These macros can be used to help construct these arguments:

#define sc(s) (s), (sizeof(s)-1)For string constants.
#define sz(s) (s), zlen(s)For C-strings (zlen is the same as strlen).
#define sv(s) (s), vslen(s)For variable strings.

So for example, vscatz is defined as:

unsigned char *vscatz(unsigned char *s, unsigned char *z);
{
	return vsncpy(sv(s), sz(z));
}

Variable arrays of variable strings

JOE also semi-automatic variable arrays of variable strings. These functions are provided:

unsigned char *vamk(int prealloc);Allocate a zero length array, but with enough space to grow to the specified length.
void varm(unsigned char **a);Free an array.
int valen(unsigned char *a);Return array length or 0 if a is NULL.
unsigned char *vaensure(unsigned char *s, int len);Make sure there is enough space for an array of the specified length.
unsigned char *vatrunc(unsigned char *s, int len);Set length of array: the array is truncated or extended with NULLs to the specified length.
unsigned char *vaadd(unsigned char **a,unsigned char *s);Append a single string to an array.
unsigned char *vasort(unsigned char **a,int len);Sort an array.
unsigned char *vawords(unsigned char **a,unsigned char *s,int len, unsigned char *sep, int seplen);Convert a word list in s (words separated by characters given in sep/seplen) into an array of strings.
void vaperm(unsigned char **a);Mark an array as preferment (heap instead of stack).

When an array is marked as permanent, any strings in the array also marked as permanent, and any string later added to the array are also marked as permanent.

Edit Buffers

API:

Look at the comments in b.h for more information.

B *bfind(unsigned char *name); Load disk file into memory buffer 'B'.

bsave(P *p,unsigned char *name,int size); Write size bytes from buffer beginning at p to disk file

brm(b); Free data structure

Once you have a B you can access the characters in it via P pointers (which are like C++ STL iterators).

B *b = bfind("foo"); Load file into memory

P *p = pdup(b->bof); Get pointer to beginning of file (duplicate b->bof which is a P).

prm(p); Free p when we're done with it.

int c=brch(p); Get character at p.

int c=pgetc(p); Get character at p and advance it.

int c=prgetc(p); Retract p, then return character at it.

- These return -1 (NO_MORE_DATA) if you try to read end of file or before beginning of file.

- A pointer can point to any character of the file and right after the end of the file.

- For UTF-8 files, character can be between 0 and 0x7FFFFFFF

Publicly readable members of P:

p->byte The byte offset into the buffer

p->line The line number

p->xcol If P is the cursor, this is the column number where the cursor will be displayed on the screen (which does not have to match the column of the character at P).

Some predicates:

pisbof(p); True if pointer is at beginning of buffer

piseof(p); True if pointer is at end of buffer

pisbol(p); True if pointer is at beginning of line

piseol(p); True if pointer is at end of line

pisbow(p); True if pointer is at beginning of a word

piseow(p); True if pointer is at end of a word

More information about character at p:

piscol(p); Get column number of character at p.

Some other ways of moving a P through a B:

pnextl(p); Go to beginning of next line

pprevl(p); Go to end of previous line

pfwrd(p,int n); Move forward n characters

pbkwd(p,int n); Move backward n characters

p_goto_bof(p);

p_goto_eof(p);

p_goto_bol(p);

p_goto_eol(p);

pset(p,q); Move p to same position as q.

pline(p,n); Goto to beginning of a specific line.

pgoto(p,n); Goto a specific byte position.

pfind(P,unsigned char *s,int len); Fast Boyer-Moore search forward.

prfind(P,unsigned char *s,int len); Fast Boyer-Moore search backward.

These are very fast- they look at low level data structure and don't go through pgetc(). Boyer-Moore allows you to skip over many characters without reading them, so you can get something like O(n/len).

Some facts:

Local operations are fast: pgetc(), prgetc().

Copy is fast: pset().

pline() and pgoto() are slower, but look for the closest existing P to start from.

The column number is stored in P, but it is only updated if it is easy to do so. If it's hard (like you crossed a line boundary backward) it's marked as invalid. piscol() then has to recalculate it.

Modifying a buffer:

binsc(p,int c); Insert character at p.

bdel(P *from,P *to); Delete character between two Ps.

Note that when you insert or delete, all of the Ps after the insertion/ deletion point are adjusted so that they continue to point to the same character before the insert or delete.

Insert and Delete create undo records.

Insert and Delete set dirty flags on lines which are currently being displayed on the screen, so that when you return to the edit loop, these lines automatically get redrawn.

Internal:

An edit buffer is made up of a doubly-linked list of fixed sized (4 KB) gap buffers. A gap buffer has two parts: a ~16 byte header, which is always in memory, and the actual buffer, which can be paged out to a swap file (a vfile- see vfile.h). A gap buffer consists of three regions: text before the gap, the gap and text after the gap (which always goes all the way to the end of buffer). (hole and ehole in the header indicate the gap position). The size of the gap may be 0 (which is the case when a file is first loaded). Gap buffers are fast for small inserts and deletes when the cursor is at the gap (for delete you just adjust a pointer, for insert you copy the data into gap). When you reposition the cursor, you have to move the gap before any inserts or deletes occur. If you had only one window and a single large gap buffer for the file, you could always keep the gap at the cursor- the right-arrow key copies one character across the gap.

Of course for edits which span gap buffers or which are larger than a gap buffer, you get a big mess of gap buffer splitting and merging plus doubly-linked list splicing.

Still, this buffer method is quite fast: you never have to do large memory moves since the gap buffers are limited in size. To help search for line numbers, the number of newlines '\n's contained in each gap buffer is stored in the header. Reads are fast as long as you have a P at the place you want to read from, which is almost always the case.

It should be possible to quickly load files by mapping them directly into memory (using mmap()) and treating each 4KB page as a gap buffer with 0 size gap. When page is modified, use copy-on-write to move the page into the swap file (change pointer in header). This is not done now. Instead the file is copied when loaded.

Windowing System

There is a tiny object-oriented windowing system built into JOE. This is the class hierarchy:

SCRN A optimizing terminal screen driver (very similar to 'curses'). has a pointer to a CAP, which has the terminal capabilities read from termcap or terminfo.

writes output to screen with calls to the macro ttputc(). (tty.c is the actual interface to the tty device).

cpos() - set cursor position

outatr() - draw a character at a screen position with attributes

eraeol() - erase from some position to the end of the line

SCREEN Contains list of windows on the screen (W *topwin).

Points to window with focus (W *curwin).

Contains pointer to a 'SCRN', the tty driver for the particular terminal type.

W A window on a screen.

Has position and size of window.

Has: void *object- pointer to a structure which inherits window (W should really be a base class for these objects- since C doesn't have this concept, a pointer to the derived class is used instead- the derived class has a pointer to the base class: it's called 'parent').

Currently this is one of:

BW * a text buffer window (screen update code is here.)

QW * query window (single character yes/no questions)

MENU * file selection menu

BW * is inherited by (in the same way that a BW inherits a W):

PW * a single line prompt window (file name prompts)

TW * a text buffer window (main editing window).

WATOM *watom- Gives type of this window. The WATOM structure has pointers to virtual member functions.

KBD *kbd- The keyboard handler for this window. When window has focus, keyboard events are sent to this object. When key sequences are recognized, macros bound to them are invoked.

Some window are operators on others. For example ^K E, load a file into a window prompt operates on a text window. If you hit tab, a file selection menu which operates on the prompt window appears below this. When a window is the target of operator windows is killed, the operators are killed also.

Currently all windows are currently the width of the screen (should be fixed in the future). The windows are in a big circular list (think of a big loop of paper). The screen is small window onto this list. So unlike emacs, you can have windows which exist, but which are not on the screen.

^K N and ^K P move the cursor to next or previous window. If the next window is off the screen it is moved onto the screen, along with any operator windows are target it.

MACROS

- add something here.

Screen update

- add something here.

Syntax Highlighter

There are two parts to the syntax highlighter: the parser (in syntax.c) and the line attribute database (in lattr.c). The parser is simple enough: it is a finite state machine interpreter. You supply a line of text and the starting state, and it returns the starting state of the next line and an array containing the color to be used for each character. Each state has a color. The line is scanned left to right, character by character: the state and the character index a table, which gives an action, which includes the next state.

The action specifies the next state, and also indicates whether some previous N characters (including the current) should be "recolored" with the color of the next state, or whether the color of the current state should be used for the character. The action also indicates whether to move on to the next character, or to stay put and re-evaluate the current character with the new state. This leads to simpler hand-coding of the state machine, but can create an infinite loop. The action also indicates if the character should be added to a string buffer. The string buffer can be used (when specified by an action) as a key for a hash table look-up (typically filled with key-words). If there is a match, the hash-table entry supplies the action instead of the state table.

This parser is fast and simple and powerful enough to lexically analyze more than 40 languages. However a few enhancements were made to improve both the class of languages which can be highlighted and to improve the ease of writing the state machine. To support "here" documents in perl and sh, as well as to simplify matching delimiters (like open with close parenthesis), a string (from the string buffer) can be stored along with the state. This can be matched against the string buffer. So first the terminating word from the here document is stored in the string, and then the state machine looks forward until it finds this string to determine the end of the here document.

The state is only a single integer, not an entire stack. The parser can therefore not handle recursive grammars. Unfortunately, some languages have recursive elements in their lexical structure. One way to deal with this is to support only a very limited recursion depth by tediously repeating a set of states in the hand written state machine N times, where N is your maximum recursion depth. To ease writing this type of thing, and also to allow state machine to be reused in cases where one language is embedded within another (PHP in HTML, for example), a macro system was added to the state machine loader. Recursion is allowed in the macro calls, but is limited to a depth of 5.

Line attribute cache

The second part of the syntax highlighter is the line attribute cache. In the original implementation, the state of the first line of each window is maintained and the parser is invoked during the screen update algorithm to highlight all of the lines on the screen. This works well when the window remains in place or when it scrolls forward through the file. However, if the window scrolls backward you may have to reparse the entire file or from the nearest previous window if there happens to be one. This can be slow for large files, and certainly wastes a lot of CPU time when scrolling backward through a file.

One attempt to solve this was to instead of starting at the beginning of the file, to instead assume that the line 200 lines (a configurable option) back was at the initial state, parse forward from there and hope for the best. Sometimes the state machine would find itself back in the proper state, but not always- to the point where most syntaxes had specified to just always reparse from the beginning of the file.

Finally, the line attribute cache was added. It stores the parser state at the beginning of each line, from the beginning of the file to last line actually displayed on the screen. This solves the scrolling problem, but introduces a new one: how to keep the line attribute cache up to date. This turns out to not be so difficult. First of all, the states are stored in a gap-buffer, which makes it easy to perform insertions and deletions. Whenever lines are inserted or deleted in the file, the line attribute cache is expanded or contracted to keep in sync. Now, when any change is made to the file, we reparse from just before the start of the change until the state of the beginning of a line again matches the corresponding entry in the cache. If a match is found, the rest of the cache is assumed to be correct. Some changes will require a large amount of reparsing, but many will not. In any case, the the highlighting on the screen is always correct.

Files

b.cText buffer management
blocks.cLibrary: fast memcpy() functions (necessary on really old versions of UNIX).
bw.cA class: text buffer window (screen update code is here)
charmap.cUNICODE to 8-bit conversion functions
cmd.cTable of user edit functions
coroutine.cCoroutine library
dir.cDirectory reading functions (for old UNIXs).
hash.cLibrary: automatically expanding hash table functions.
help.cImplement the on-line help window
i18n.cUnicode character type information database
kbd.cKeymap data structure (keysequence to macro bindings).
lattr.cLine attribute cache
macro.cKeyboard and joerc file macros
main.cHas main() and top level edit loop
menu.cA class: menu windows
obj.cSemi-automatic objects, plus dynamic string and array library
path.cLibrary: file name and path manipulation functions
poshist.cCursor position history
pw.cA class: prompt windows
queue.cLibrary: doubly linked lists
qw.cA class: single-key query windows
rc.cJoerc file loader
regex.cRegular expressions
scrn.cTerminal update functions (like curses)
selinux.cSecure linux (for RedHat) functions
syntax.cSyntax highlighter
tab.cTab completion for file selection prompt
termcap.cLoad terminal capabilities from /etc/termcap file or terminfo database
tw.cA class: main text editing window
ublock.cUser edit functions: block moves
uedit.cUser edit functions: basic edit functions
uerror.cUser edit functions: parse compiler error messages and goto next error, previous error
ufile.cUser edit functions: load and save file
uformat.cUser edit functions: paragraph formatting, centering
uisrch.cUser edit functions: incremental search
umath.cUser edit functions: calculator
undo.cUndo system
usearch.cUser edit functions: search & replace
ushell.cUser edit functions: subshell
utag.cUser edit functions: tags file search
utf8.cUTF-8 to unicode coding functions
utils.cMisc. utilities
vfile.cLibrary: virtual memory functions (allows you to edit files larger than memory)
w.cA class: base class for all windows