Poly/ML Interface to the C Programming Language

Nick Chapman    June 6, 1994

  1. Introduction
  2. Dynamic Libraries
  3. Creating a Dynamic Library
  4. Calling Simple C-functions
  5. A family of calln functions
  6. Predefined Conversions
  7. Volatile Types: vol, sym and dylib.
  8. Calling C-functions with return-parameters
  9. A family of callnretr functions
  10. C structures
  11. A family of structn Conversionals
  12. Lower Level Calling Mechanism: call_sym
  13. Creating New Conversions
  14. Enumerated Types
  15. C Programming Primitives
  16. Example: Quicksort
  17. Volatile Implementation

1 Introduction

It is now possible for Poly/ML to call functions which have been written in the C programming language. These functions are accessed from a dynamic library, and so don't have to be statically linked into the Poly/ML runtime system. The C interface is contained in the structure CInterface, which is built into every ML database. The facilities available allow dynamic libraries to be loaded and for symbols to be extracted from these libraries. symbols which represent C-functions can be executed.

The arguments to a C-function need to be in a format which the C-function can understand. Similarly, the return value from a C-function will be in a standard C format. All such C-values are represented in ML using the abstract type vol. Values of this type are volatile because they do not persist from one ML session to the next. There are facilities to convert between ML-values and vols, together with a collection of 'C-programming' primitives to manipulate vols.

2 Dynamic Libraries

exception Foreign of string
val load_lib : string -> dylib
val load_sym : dylib -> string -> sym
val get_sym : string -> string -> sym

The function load_lib takes an ML string containing the pathname of a dynamic library. This should preferably be a full pathname. If it is a relative pathname it will be interpreted with respect to the directory in which the ML session was started from. The return value is a dylib representing the dynamic library. If the dynamic library cannot be found, the exception Foreign is raised with a string describing the problem.

If the file named by the filename exists but is not in the correct format for a dynamic library, the underlying C-function dlopen prints an error message and then kills the ML session. So far, I have been unable to catch this error.

Once a library has been opened, a symbol may be extracted from the library with the function load_sym. This takes a dylib representing the dynamic library and an ML string naming the symbol. The return value is a sym representing the symbol. If the symbol is not contained in the dynamic library, the exception Foreign is raised with a string describing the problem.

Often the return value of the function load_lib is passed directly to the function load_sym . This combination is captured by the function get_sym, which takes two strings naming the dynamic library and the symbol, and returns the sym representing the symbol, or raises the exception Foreign.

fun get_sym lib sym = load_sym (load_lib lib) sym;

Values of type dylib and sym share the volatile nature of vol ; they do not persist from one ML session to the next. This is explained in more detail in Section 7.

3 Creating a Dynamic Library

Suppose we have written a C-function called difference, which computes the difference of two integers. The function is contained in a file named sample. c.

int difference (int x, int y) {
    return x > y ? x - y : y - x;

To create a dynamic library containing this function we carry out the following steps at the shell prompt:

Pinky$ gcc -c sample.c -o sample.o
Pinky$ ld -o sample.so sample.o

These steps create a dynamic library named sample.so. Often many symbols will be retrieved from the same dynamic library, and so it is useful to partially apply the function get_sym to the name of the common library. Most of the examples in this document use symbols retrieved from the library samples.so.

val get = get_sym "sample.so";

4 Calling Simple C-functions

To call the C-function difference we use the function call2 from the structure CInterface. This function allows us to call C-functions that take two arguments:

val call2 : sym -> 'a Conversion * 'b Conversion -> 'c Conversion
                -> 'a
           * 'b ->            'c

The first parameter of call2 is the sym representing the symbol that we wish to call. This is usually obtained from a call to get_sym. The second parameter is a pair of Conversions describing the two arguments to the C-function; the third parameter is a Conversion describing the return value of the C-function. The fourth parameter is a pair containing the actual arguments to be passed to the C-function. Notice how the type of each argument matches the type variable contained in the corresponding Conversion parameter.

The purpose of a Conversion is twofold. Firstly, it specifies the C-type required by the C-function. This needs to be known at the lowest level so that the correct argument passing and return conventions can be used when calling the C-function. Secondly, the Conversion performs the conversion between a C-value (in this case a C integer) and an ML-value. The conversion necessary to call the example C-function difference is INT which has type int Conversion .We can now define an ML function as a wrapper around the underlying C-function.

val diff = call2 (get "difference") (INT,INT) INT;

Because the Conversion INT has type int Conversion, the type of diff is constrained to being int->int->int - which is just what we require. We can now apply the ML function, for example: (diff (13,50)), which evaluates to 37.

5 A family of calln functions

There is a family of calln functions from call0 to call9.

val calln :
   sym -> 'a1 Conversion *  ... * 'an Conversion
       -> 'b Conversion
       -> 'a1 * ... * 'an -> 'b

We need a collection of functions because we cannot give a legal ML type to a function which takes a list of Conversions without forcing them all to have the same type parameter. C-functions with more than nine parameters can still be called, but the lower level calling mechanism must be used, see Section 12.

6 Predefined Conversions

In the structure CInterface, there are various predefined Conversions. The name of each Conversion indicates the C-type required/returned, whereas the ML type of the Conversion constrains the resulting type when the Conversion is used as an argument to a calln function.

val CHAR: char Conversion
val DOUBLE : real Conversion
val FLOAT : real Conversion
val INT : int Conversion
val LONG : int Conversion
val SHORT : int Conversion
val STRING :string Conversion
val VOID : unit Conversion
val BOOL : bool Conversion
val POINTER :vol Conversion

The Conversions CHAR, DOUBLE, FLOAT, INT, LONG and SHORT are primitive in the sense that they convert between small fixed-size C types.

The Conversion STRING converts between an ML string and a C pointer; the pointer points at a null terminated array of characters. This Conversion is built out of the CHAR Conversion and the C programming primitives, see Section 15.

The Conversion VOID is really a one way Conversion intended for the result of C-functions that return void. Attempts to use this Conversion the other way around raise the exception Foreign with an appropriate message.

The Conversion BOOL is build on top of the Conversion INT. It converts between an ML bool and a C integer.

The Conversion POINTER is basically the identity Conversion. No conversion is performed and the underlying vol becomes accessible.

7 Volatile Types: vol, sym and dylib.

There is a problem with the definition of the ML-function diff given above. The call to get_sym (within the partial application get) returns a value of type sym which like values of type vol does not persist from one ML session to the next. If after the definition of diff we were to commit the database and leave the ML session, we would find that on restarting the ML session, the function diff no longer operates as expected, but instead causes the exception Foreign to be raised:

> commit();
> diff (13,50);
val it = 3
> quit();
Pinky$ ml
> diff (13,50);
Exception- Foreign "Invalid volatile" raised

One solution is to redefine the ML function diff as:

fun diff args =
cal12 (get "difference") (INT,INT) INT args;

The new version of diff is very similar to the old version, except that the subexpression get "difference" will be executed every time the function is applied to the tuple of arguments, instead of just once. This causes the library and symbol to be reloaded on every invocation of the function diff ensuring that the vol is valid. Efficiency wise this is not as horrific as it sounds. The underlying dynamic library manipulation functions appear to cache what has already been loaded, and so do little work on a subsequent calls to load the same library or symbol.

8 Calling C-functions with return-parameters

Although C is strictly a call-by-value language, call-by-reference is often simulated with the use of parameters of a pointer type. When a function is called with a parameter that has a pointer type, the called function can then modify the value pointed at by the pointer. For example, the C-function below diff_sum computes both the difference and the sum of two integers. The function has four parameters-two input parameters and two return-parameters.

void diff_sum (int x, int y, int *diff, int *sum) {
  *diff = x > y ? x - y : y - x;
  *sum = x+y;

With C, this function would be invoked with something like:

  int diff,sum;

To call the C-function diff_sum from ML we use the function call4ret2. This allows us to call C-functions that have four parameters, the last two being return-parameters.

val call4ret2 : sym
  -> 'a Conversion * 'b Conversion -> 'c Conversion * 'd Conversion
  -> 'a            * 'b             -> 'c             * 'd

Now we can write an ML wrapper function:

fun diff_sum x y =
   call4ret2 (get "diff_sum") (INT,INT) (INT,INT) (x,y);

Evaluating (diff _sum 13 50) results in (37,63).

9 A family of callnretr functions

There is a limited family of callnretr functions defined to call C~functions that have n - r input-parameters followed by r return-parameters. This family contains functions for n ranging from 1 to 5, with r as either 1 or 2. (Exception: there is no call1ret2 because this makes no sense.)

val call1ret1 : sym -> unit -> 'a Conversion -> unit -> 'a
val callnretr :
   sym -> 'a1 Conversion * ... * 'an-r Conversion
       -> 'an-r+1 Conversion * ... * 'an Conversion
       -> 'a1 * ... *'an-r -> 'an-r+1 * ... 'an

For other combinations of n and r; requiring a non-final parameter in the parameter list to be a return-parameter; or requiring the actual return result together with the use of return parameters, the lower level calling mechanism can be used (Section 12).

10 C structures

C functions may be called which take/return C structure values. For example, the following piece of C defines a typedefed structure called Point, and a function which manipulates these Points called addPoint.

typedef struct {int x; int y;} Point;

Point addPoint (Point p1, Point p2) {
  p1.x += p2.x;
  p1.y += p2.y;
  return p1;

To create the necessary Conversion for Points we can use the Conversional, STRUCT2. This function takes a pair of Conversions and returns a new Conversion suitable for a C structure containing those types. The type of STRUCT2 is:

val STRUCT2 : 'a Conversion * 'b Conversion -> ('a * 'b) Conversion

We now define an ML wrapper function for addPoint:

fun addPoint p1 p2 =
   cal12 (get "addPoint") (POINT,POINT) POINT (p1, p2);

Now, (addPoint (5, 6) (8,9)) evaluates to (13, 15).

11 A family of structn Conversionals

There is a family of structn functions from struct2to struct9.

val structn : 'a1 Conversion * ... * 'an Conversion
               -> ('a1 *... * 'an) Conversion

Manipulation of structures with more than nine components can be achieved with the use of the lower level calling mechanism, see Section 12.

12 Lower Level Calling Mechanism: call_sym

Occasionally it is necessary to access the dynamic calling mechanism at a lower level. The collection of functions calln and callnretr are all defined in terms of the function call_sym, which has the following type:

val call_sym : sym -> (Ctype * vol) list -> Ctype -> vol

The second argument to call_sym is a list of Ctype/vol pairs, which allows C-functions of any number of arguments to be called. This function is more cumbersome to use than the calln and callnretr functions because the two stages of; specification of the C-type, and conversion between ML-values and C-values (vols) have been separated. The specification of the C-type is achieved by using a constructor of the datatype Ctype:

datatype Ctype =
Cchar | Cdouble | Cfloat | Cint | Clong | Cshort | Cvoid
| Cpointer of Ctype
| Cstruct of Ctype list
| Cfunction of Ctype list * Ctype

The following collection of functions is used to convert from and to values of type vol.

val fromCstring : vol ->string
fromCchar : vol ->char
fromCdouble : vol ->real
fromCfloat : vol ->real
fromCint : vol ->int
fromClong : vol ->int
fromCshort : vol ->int
toCstring : string -> vol
toCchar : char -> vol
toCdouble : real ->vol
toCfloat : real ->vol
toCint : int ->vol
toClong : int ->vol
toCshort : int ->vol

For example, this is how to define diff directly in terms of call_sym.

fun diff x y =
  fromCint (call_sym (get "difference")
    [(Cint, toCint x),(Cint, toCint y)] Cint)

Manipulation of C structures is achieved with the following two functions:

val make_struct : (Ctype * vol) list -> vol
val break_struct
: Ctype list -> vol -> vol list

13 Creating New Conversions

Recall a Conversion encapsulates three things: an underlying C-type; a function to convert from the C-value (of type vol) to an ML value of a given type; a function which converts from the ML value back into the C-value (of type vol). Sometimes it is useful to be able to create new Conversions, or to retrieve the components from an existing Conversion.

val mkConversion : (vol -> 'a) -> ('a -> vol) -> Ctype -> 'a Conversion
val breakConversion
: 'a Conversion -> (vol -> 'a) * ('a -> vol) * Ctype

The function mkConversion creates a new Conversion from its three components. The function breakConversion takes an existing Conversion and returns a triple containing the components. For example, the standard conversion INT might be defined as:

val INT = mkConversion fromCint toCint Cint

A good reason for creating a new Conversion is to give a different ML type to values of type vol which are to be used in a particular way. For example, we may be interfacing to a collection of C-functions that take/return pointers which are being used to implement a particular abstract type, for example a tree node. By creating a new conversion we can use the ML type system to avoid mixing values of this new type with other normal vols.

abstype node = Node of vol
with val NODE = mkConversion Node (fn (Node n) => n) (Cpointer Cvoid)

fun lookupNode s = call1 (get "lookupNode") STRING NODE s
fun printNode n = call1 (get "printNode") NODE VOID n

The types of these two functions are:

val lookupNode : string -> node
val printNode
: node -> unit

14 Enumerated Types

Another reason for creating a new Conversion is for when we want to call a C-function that takes/returns values of an enumerated type. For example, suppose colour is declared as:

typedef enum {
  red = 5,
  /* leave room for extra colours in the future */
  black = 100
} colour;

This example shows that C enumerations are just sugar for integers, so much so, we can even specify which constructors correspond to which integer values. When an enumeration is declared that specifies integer values for just some constructors, (as in colour above): if the first constructor is unspecified, it is assigned 0; successive unspecified constructors are assigned successive integer values, e.g. green is 6.

We would like to convert C-enumerations like colour into an equivalent ML datatype, together with functions to convert between values of the datatype and ML integers. This can be achieved automatically by using the script proc-enums, contained in the scripts subdirectory of the source tree.

Usage: proc-enums <struct-name> {<filename>}+

The first parameter to proc-enums is the name of the generated ML structure. The remaining parameters specify C-files in which to search for C typedefed enumeration declarations. No formatting conventions are assumed, i.e. arbitrary white space and comments are allowed within the declaration. Other declarations and definitions are ignored. The generated file is named <struct-name>.ML.

For the colour example, we would type 'proc-enums colour colour.h' at the shell prompt. This would generate a file colour.ML containing the following ML definitions.

structure colour = struct

datatype colour
= white
| red
| green
| blue
| black

exception Int2colour

fun int2colour i = case i of
  0 => white
| 5 => red
| 6 => green
| 7 => blue
| 100 => black
| _ => raise Int2colour

fun colour2int i = case i of
  white => 0
| red => 5
| green =
| blue => 7
| black => 100

end (* struct *)

Once these definitions have been generated we can create a new Conversion:

val COLOUR =
  mkConversion (int2colour o fromCint) (toCint o colour2int) Cint;

Now, suppose we have a C-function nameOfColour,

#include "colour.h"
char* nameOfColour (colour c) {
  switch (c) {
    case white: return"white";
    case red:   return"red";
    case green: return"green";
    case blue:  return"blue";
    case black: return"black";
    default:    return"Error: No such colour";

we can write a ML wrapper for this function as:

fun nameOfColour c =
   call1 (get "nameOfColour") COLOUR STRING c;

Now we can execute, (nameOfColour blue), which evaluates to the ML string "blue".

15 C Programming Primitives

Occasionally, we need to manipulate C-values in greater detail. The following example shows how an ML wrapper can be written for the C-function diff _sum, without using a callnretr function.

fun diff_sum x y =
    let val diff = alloc 1 Cint
        val sum = alloc 1 Cint
        cal14 (get "diff_sum") (INT,INT,POINTER,POINTER) VOID
            (x, y, address diff, address sum);
        (fromCint diff, fromCint sum)

This example uses two of a collection of six ML functions allowing basic C-programming.

val sizeof  : Ctype -> int
val alloc   : int -> Ctype -> vol
val address : vol -> vol
val deref   : vol -> vol
val assign  : Ctype -> vol -> vol -> unit
val offset  : int -> Ctype -> vol -> vol

These functions are intrinsically unsafe-incorrect usage can cause the ML session to die.

The application (sizeof t) returns the size (in bytes) of the Ctype t.

The application (alloc n t) returns a vol encapsulating some freshly allocated memory of size (n*sizeof t) bytes. Unlike allocation facilities in C which return a pointer to the newly allocated space,the result of alloc encapsulates the space directly.

The underlying implementation of alloc does in fact use malloc to gain some newly allocated space, and does in fact consist of a pointer to this space. However, all the above ML functions work at an extra level of indirection to the corresponding C-operation. This extra indirection is removed before the C-value is passed to a real C-function.

The application (address v) returns a new vol containing the address of v. This function corresponds to the C operator &.

The application (deref v) returns a vol which is the result of dereferencing the address contained in v. This function corresponds to the C operator *. If v is not a valid address, the ML session will die with a segmentation error.

The application (assign t v w) copies (sizeof t) bytes of data from w into v. This function corresponds to the C operator =, or the standard C function memcpy.

The application (offset i t v) returns a new vol that is offset (i*sizeof t) bytes in memory from v. The closest corresponding operator in C is structure dereferencing (.). Pointer arithmetic can be achieved by combining the function offset with the functions address and deref.

The functions address and deref create the same aliasing as the corresponding C operators. For example, the following sequence of C statements causes the final value of i to be 123:

  int i = 0;
  int *p = &i;
  *p = 123;

Likewise, the following sequence of ML statements:

> val i = toCint 0;
> val p = address i;
> assign Cint (deref p) (toCint 123);
> fromCint i;
val it = 123

16 Example: Quicksort

The following example shows how the C-programming primitives are intended to be used. The example involves interfacing to the standard C-function qsort. On many Unix systems this function can be retrieved from a dynamic library in /usr/lib.

val getC = get_sym "/usr/lib/libc.so.1.7";

The function qsort takes four parameters.

void qsort (void *base, int nel, int width, int (*compar)());

The first parameter, base, is a pointer to an array of elements to be sorted; the second parameter, nel, is the number of elements in the array; the third parameter, width, is the size (in bytes) of each element; the fourth parameter, compar is a comparison function which must return an integer less than, equal to, or greater than zero. See the qsort manual page for more details.

In our example we wish to sort pairs of strings. The first string is the key to be sorted, while the second string is arbitrary data. In C we would represent this pair as a structure, and would write the comparison function compare using strcmp.

typedef struct {
  char *key;
  char *data;
} pair;

int compare (pair x, pair y) {
   return strcmp(x.key, y.key);

We want to define an ML wrapper qsort which takes a list of string pairs and returns the sorted list. Other than the C-programming primitives, the only additional function needed is volOfSym. This is needed to supply the fourth argument to qsort, a pointer to a comparison function. The application (volOfSym s) returns the vol encapsulated in the symbol s.

val volOfSym : sym -> vol

We can now defined qsort, together with two auxiliary function fill and read.

val (fromPair,toPair,pairType) = breakConversion (STRUCT2 (STRING,STRING));

fun fill p [] = ()
  | fill p ((key,data)::xs) =
         (assign pairType p (toPair (key,data));
          fill (offset 1 pairType p) xs)

fun read p 0 = []
  | read p n = fromPair p :: read (offset 1 pairType p) (n-1)

fun qsort xs =
      val len = length xs
      val table = alloc len pairType
      val compare = volOfSym (get "compare")
      val sort = ca114 (getc "qsort") (POINTER,INT,INT,POINTER) VOID
      fill table xs;
      sort (address table, len, sizeof pairType, compare);
      read table len

The function fill takes a pointer into some allocated space (which must be big enough), and a string pair list. It fills the array with structures created from the list. The function offset is used to move along the allocated area.

The function read is the inverse of fill. It takes an array of structures and an integer n and reconstructs a list of n string pairs.

The ML function qsort operates by first allocating enough space for the array of structures, then using fill to fill this array from the argument list xs. A call to the C-function qsort is made to sort this array. Notice how the first argument to sort is (address table) which generates the required array pointer for the C-function qsort. Finally, a list is reconstructed from the sorted array using read.

Now we can evaluate the following:

> qsort [("one","fred"), ("two", "dave"), ("three", "bob"), ("four", "mary")];
val it =
  [( "four", "mary"), ("one", "fred"), ("three", "bob"), ("two", "dave")]

17 Volatile Implementation

The C-data contained in a volatile is managed in a separate space from normal ML data which is stored in the heap. There are two reasons for this. Data contained in the ML heap is liable to change its address during garbage collection, and C-functions cannot cope with this. The second reason is safety. We do not want foreign C-functions to obtain a pointer into the ML heap. Because the C-function is running in the same Unix process, it is always possible for it to corrupt the ML heap; however the most usual cause of corruption is caused by off-by-one errors. If the C-data is stored in the ML heap this would cause a neighbouring heap cell to be corrupted.

Every ML value of type vol has two components: (1) An ML heap cell; (2) A slot in the vols array, a runtime system variable declared and managed in the file Driver/foreign.c . The ML heap cell indexes a slot in the vols array. This slot contains three items: (1) A back pointer, pointing at the corresponding ML heap cell. (2) A C-pointer, pointing to the actual C-data; (3) A boolean, indicating whether this volatile owns the space pointed to by the C-pointer.

The combination of vols array index and the back pointer found there enables the validity of a volatile to be checked as it is dereferenced. If the volatile is invalid then the exception Foreign is raised.

The collection of functions that convert ML values into vols (e.g. toCint and toCfloat), together with the functions alloc and address create new volatiles; that is, volatiles that own the space pointed to by the C-pointer in their vols array slot. This space is obtained from a call to malloc. There is always exactly one owner of any piece of malloced space. The deref and offset functions create vols that point to previously allocated space and so are not regarded as the owner.

Volatiles are garbage collected in such a way that malloced space is freed when there are no remaining references to the ML cell which owns that space. However, by itself this scheme is too vicious. For example:

val a = address (toCint 999);

When a garbage collection occurs, although the space owned by a (containing the pointer) will be preserved, the space allocated to hold the C-integer 999 will be reclaimed because there are no references to its owner, the anonymous expression (toCint 999)

If we now evaluate the expression (fromCint (deref a)), it will result in whatever garbage happened to be pointed to by the redundant C-pointer contained in the volatile a. What is needed is a way to ensure that the volatile a holds an ML reference to the anonymous volatile (toCint 999) for the duration of its lifetime. In a similar manner, any volatile that does not own its own space, i.e. the result of the expression (deref (address (toCint 999))), needs to hold a reference to the owner of the space it points at. This scheme of maintaining references is implemented in Volatile.ML in the directory Prelude/Foreign, and is completely transparent to the user.

In some unusual situations we might want to allocate some space which persists after all ML references to it have disappeared. For example, we might have to allocate space for a buffer, and then hand a pointer to this buffer over to a foreign C-function. This can be achieved in two ways. We could carefully maintain an ML reference to the vol encapsulating the buffer. Alternatively, we could use the dynamic library manipulation functions to use the real C-function malloc.