Fundamental Data Structures


                              Introduction

The modern digital computer was invented and intended as a device that should facilitate and speed up complicated and time-consuming computations. In the majority of applications its capability to store and access large amounts of information plays the dominant part and is considered to be its primary characteristic, and its ability to compute, i.e., to calculate, to perform arithmetic, has in many cases become almost irrelevant.

In all these cases, the large amount of information that is to be processed in some sense represents an abstraction of a part of reality(aka a model).
  • The information that is available to the computer consists of a selected set of data about the actual problem, namely that set that is considered relevant to the problem at hand, that set from which it is believed that the desired results can be derived. 
  • The data represent an abstraction of reality in the sense that certain properties and characteristics of the real objects are ignored because they are peripheral and irrelevant to the particular problem. 
  • An abstraction is thereby also a simplification of facts.
We may regard a personnel file of an employer as an example. Every employee is represented (abstracted) on this file by a set of data relevant either to the employer or to his accounting procedures.
  • This set may include some identification of the employee, for example, his or her name and salary. But it will most probably not include irrelevant data such as the hair color, weight, and height.
  • In solving a problem with or without a computer it is necessary to choose an abstraction of reality, i.e., to define a set of data that is to represent the real situation. This choice must be guided by the problem to be solved. 
  • Then follows a choice of representation of this information. This choice is guided by the tool that is to solve the problem, i.e., by the facilities offered by the computer. In most cases these two steps are not entirely separable.
  • The choice of representation of data is often a fairly difficult one, and it is not uniquely determined by the facilities available. It must always be taken in the light of the operations that are to be performed on the data. 
  • A good example is the representation of numbers, which are themselves abstractions of properties of objects to be characterized. If addition is the only (or at least the dominant) operation to be performed, then a good way to represent the number n is to write n strokes. The addition rule on this representation is indeed very obvious and simple. The Roman numerals are based on the same principle of simplicity, and the adding rules are similarly straightforward for small numbers. On the other hand, the representation by Arabic numerals requires rules that are far from obvious (for small numbers) and they must be memorized.
  • However, the situation is reversed when we consider either addition of large numbers or multiplication and division. The decomposition of these operations into simpler ones is much easier in the case of representation by Arabic numerals because of their systematic structuring principle that is based on positional weight of the digits.
  • It is generally known that computers use an internal representation based on binary digits (bits). This representation is unsuitable for human beings because of the usually large number of digits involved, but it is most suitable for electronic circuits because the two values 0 and 1 can be represented conveniently and reliably by the presence or absence of electric currents, electric charge, or magnetic fields.
From this example we can also see that the question of representation often transcends several levels of detail. Given the problem of representing, say, the position of an object,
  • The first decision may lead to the choice of a pair of real numbers in, say, either Cartesian or polar coordinates
  • The second decision may lead to a floating-point representation, where every real number x consists of a pair of integers denoting a fraction f and an exponent e to a certain base (such that x = f×2^e). 
  • The third decision, based on the knowledge that the data are to be stored in a computer, may lead to a binary, positional representation of integers, and 
  • the final decision could be to represent binary digits by the electric charge in a semiconductor storage device.
Evidently, the first decision in this chain is mainly influenced by the problem situation, and the later ones are progressively dependent on the tool and its technology. Thus, it can hardly be required
that a programmer decide on the number representation to be employed, or even on the storage device characteristics. These lower-level decisions can be left to the designers of computer equipment, who have the most information available on current technology with which to make a sensible choice that will be
acceptable for all (or almost all) applications where numbers play a role.

In this context, the significance of programming languages becomes apparent.
A programming language represents an abstract computer capable of interpreting the terms used in this language, which may embody a certain level of abstraction from the objects used by the actual machine. Thus, the programmer who uses such a higher-level language will be freed (and barred) from questions of number representation, if the number is an elementary object in the realm of this language.
The importance of using a language that offers a convenient set of basic abstractions common to most problems of data processing lies mainly in the area of reliability of the resulting programs. It is easier to design a program based on reasoning with familiar notions of numbers, sets, sequences, and repetitions
than on bits, storage units, and jumps.
  1. Of course, an actual computer represents all data, whether numbers, sets, or sequences, as a large mass of bits. But this is irrelevant to the programmer as long as he or she does not have to worry about the details of representation of the chosen abstractions, and as long as he or she can rest assured that the corresponding representation chosen by the computer (or compiler) is reasonable for the stated purposes.
  2. The closer the abstractions are to a given computer, the easier it is to make a representation choice for the engineer or implementor of the language, and the higher is the probability that a single choice will be suitable for all (or almost all) conceivable applications. 
  3. This fact sets definite limits on the degree of abstraction from a given real computer. For example, it would not make sense to include geometric objects as basic data items in a general-purpose language, since their proper representation will, because of its inherent complexity, be largely dependent on the operations to be applied to these objects. The nature and
    frequency of these operations will, however, not be known to the designer of a general-purpose language and its compiler, and any choice the designer makes may be inappropriate for some potential applications.
Here these deliberations determine the choice of notation for the description of algorithms and their data. Clearly, we wish to use familiar notions of mathematics, such as numbers, sets, sequences, and so on, rather than computer-dependent entities such as bitstrings.
But equally clearly we wish to use a notation for which efficient compilers are known to exist.
It is equally unwise to use a closely machine-oriented and machine-dependent language, as it is unhelpful to describe computer programs in an abstract notation that leaves problems of representation widely open.

The programming language Pascal had been designed in an attempt to find a compromise between these extremes, and the successor languages Modula-2 and Oberon are the result of decades of experience [1-3]. Oberon retains Pascal's basic concepts and incorporates some improvements and some extensions. It has been successfully implemented on several computers, and it has been shown that the notation is sufficiently close to real machines that the chosen features and their representations can be clearly explained. The language is also sufficiently close to other languages, and hence the lessons taught here may equally well be applied in their use.


                      The Concept of Data Type

In mathematics it is customary to classify variables according to certain important characteristics. Clear distinctions are made between real, complex, and logical variables or between variables representing individual values, or sets of values, or sets of sets, or between functions, functionals, sets of functions, and so on.
This notion of classification is equally if not more important in data processing.
We will adhere to the principle that every constant, variable, expression, or function is of a certain type. This type essentially characterizes the set of values to which a constant belongs, or which can be assumed by a variable or expression, or which can be generated by a function.
In mathematical texts the type of a variable is usually deducible from the typeface without consideration of context; this is not feasible in computer programs. Usually there is one typeface available on computer equipment (i.e., Latin letters).
The rule is therefore widely accepted that the associated type is made explicit in a declaration of the constant, variable, or function, and that this declaration textually precedes the application of that constant, variable, or function. 
This rule is particularly sensible if one considers the fact that a compiler has to make a choice of representation of the object within the store of a computer.
Evidently, the amount of storage allocated to a variable will have to be chosen according to the size of the range of values that the variable may assume. If this information is known to a compiler, so-called dynamic storage allocation can be avoided. This is very often the key to an efficient realization of an algorithm.

The primary characteristics of the concept of type that is also  embodied in the programming language Oberon, are the following [1-2]:
  1. A data type determines the set of values to which a constant belongs, or which may be assumed by a variable or an expression, or which may be generated by an operator or a function.
  2. The type of a value denoted by a constant, variable, or expression may be derived from its form or its declaration without the necessity of executing the computational process.
  3. Each operator or function expects arguments of a fixed type and yields a result of a fixed type. If an operator admits arguments of several types (e.g., + is used for addition of both integers and real numbers), then the type of the result can be determined from specific language rules.
As a consequence, a compiler may use this information on types to check the legality of various constructs.
For example, the mistaken assignment of a Boolean (logical) value to an arithmetic variable may be detected without executing the program.
This kind of redundancy in the program text is extremely useful as an aid in the development of programs, and it must be considered as the primary advantage of good high-level languages over machine code (or symbolic assembly code). 
Evidently, the data will ultimately be represented by a large number of binary digits, irrespective of whether or not the program had initially been conceived in a high-level language using the concept of type or in a typeless assembly code.
To the computer, the store is a homogeneous mass of bits without apparent structure. But it is exactly this abstract structure which alone is enabling human programmers to recognize meaning in the  monotonous landscape of a computer store.
The theory presented here and the programming language Oberon specify certain methods of defining data types.
  1. In most cases new data types are defined in terms of previously defined data types. Values of such a type are usually conglomerates of component values of the previously defined constituent types, and they are said to be structured
  2. If there is only one constituent type, that is, if all components are of the same constituent type, then it is known as the base type
  3. The number of distinct values belonging to a type T is called its cardinality. The cardinality provides a measure for the amount of storage needed to represent a variable x of the type T, denoted by x: T.
  4. Since constituent types may again be structured, entire hierarchies of structures may be built up, but, obviously, the ultimate components of a structure are atomic. Therefore, it is necessary that a notation is provided to introduce such primitive, unstructured types as well. 
  5. A straightforward method is that of enumerating the values that are to constitute the type. For example in a program concerned with plane geometric figures, we may introduce a primitive type called shape, whose values may be denoted by the identifiers rectangle, square, ellipse, circle. 
  6. But apart from such programmer-defined types, there will have to be some standard, predefined types. They usually include numbers and logical values. 
  7. If an ordering exists among the individual values, then the type is said to be ordered or scalar. In Oberon, all unstructured types are ordered; in the case of explicit enumeration, the values are assumed to be ordered by their enumeration sequence.
With this tool in hand, it is possible to define primitive types and to build conglomerates, structured types up to an arbitrary degree of nesting. In practice, it is not sufficient to have only one general method of combining constituent types into a structure. With due regard to practical problems of representation and use, a general-purpose programming language must offer several methods of structuring.
In a mathematical sense, they are equivalent; they differ in the operators available to select components of these structures.
The basic structuring methods presented here are the
  • array, 
  • the record, 
  • the set, and 
  • the sequence.
More complicated structures are not usually defined as static types, but are instead dynamically generated during the execution of the program, when they may vary in size and shape.
Such structures include
  • lists, 
  • rings, 
  • trees, and 
  • general, finite graphs.
Variables and data types are introduced in a program in order to be used for computation. To this end, a set of operators must be available.
For each standard data type a programming language offers a certain set of primitive, standard operators, and likewise with each structuring method a distinct operation and notation
for selecting a component.
The task of composition of operations is often considered the heart of the art of programming. However, it will become evident that the appropriate composition of data is equally fundamental and essential.
The most important basic operators are
  1. comparison and 
  2. assignment,
i.e., the test for equality (and for order in the case of ordered types), and the command to enforce equality.
The fundamental difference between these two operations is emphasized by the clear distinction in their denotation throughout this text.
Test for equality: x=y (an expression with value TRUE or FALSE)
Assignment to x: x := y (a statement making x equal to y)
These fundamental operators are defined for most data types, but it should be noted that their execution may involve a substantial amount of computational effort, if the data are large and highly structured.

For the standard primitive data types, we postulate not only the availability of assignment and comparison, but also a set of operators to create (compute) new values.
Thus we introduce the standard operations of arithmetic for numeric types and the elementary operators of propositional logic for logical values.

                     Primitive Data Types

A new, primitive type is definable by enumerating the distinct values belonging to it. Such a type is called an enumeration type. Its definition has the form

TYPE T = (c1, c2, ... , cn)
T is the new type identifier, and the ci are the new constant identifiers.

Examples

TYPE shape = (rectangle, square, ellipse, circle)
TYPE color = (red, yellow, green)
TYPE sex = (male, female)
TYPE weekday = (Monday, Tuesday, Wednesday, Thursday, Friday,
Saturday, Sunday)
TYPE currency = (franc, mark, pound, dollar, shilling, lira, guilder,
krone, ruble, cruzeiro, yen)
TYPE destination = (hell, purgatory, heaven)
TYPE vehicle = (train, bus, automobile, boat, airplane)
TYPE rank = (private, corporal, sergeant, lieutenant, captain, major,
colonel, general)
TYPE object = (constant, type, variable, procedure, module)
TYPE structure = (array, record, set, sequence)
TYPE condition = (manual, unloaded, parity, skew)

The definition of such types introduces not only a new type identifier, but at the same time the set of identifiers denoting the values of the new type.
These identifiers may then be used as constants throughout the program, and they enhance its understandability considerably. If, as an example, we introduce variables s, d, r, and b.
VAR s: sex
VAR d: weekday
VAR r: rank
then the following assignment statements are possible:
s := male
d := Sunday
r := major
b := TRUE
Evidently, they are considerably more informative than their counterparts
s := 1 d := 7 r := 6 b := 2
which are based on the assumption that c, d, r, and b are defined as integers and that the constants are mapped onto the natural numbers in the order of their enumeration.
Furthermore, a compiler can check against the inconsistent use of operators. For example, given the declaration of s above, the statement
s := s+1 
would be meaningless.
If, however, we recall that enumerations are ordered, then it is sensible to introduce operators that generate the successor and predecessor of their argument. We therefore postulate the following standard operators, which assign to their argument its successor and predecessor respectively:
INC(x)
DEC(x)


                  Standard Primitive Types

Standard primitive types are those types that are available on most computers as built-in features.
They include
  • the whole numbers, 
  • the logical truth values, and 
  • a set of printable characters.
  • On many computers fractional numbers are also incorporated, together with the standard arithmetic operations.
We denote these types by the identifiers

INTEGER, REAL, BOOLEAN, CHAR, SET


                            Integer types

The type INTEGER comprises a subset of the whole numbers whose size may vary among individual computer systems.
  1. If a computer uses n bits to represent an integer in two's complement notation, then the admissible values x must satisfy -2^n-1 <= x < 2^n-1. 
  2. It is assumed that all operations on data of this type are exact and correspond to the ordinary laws of arithmetic, and that the computation will be interrupted in the case of a result lying outside the representable subset. This event is called overflow
  3. The standard operators are the four basic arithmetic operations of addition (+), subtraction (-), multiplication (*), and division (/, DIV).
  4. Whereas the slash denotes ordinary division resulting in a value of type REAL, the operator DIV denotes integer division resulting in a value of type INTEGER. If we define the quotient q = m DIV n and the remainder r = m MOD n, the following relations hold, assuming n > 0: q*n + r = m and 0 <= r
Examples:

 31 DIV 10 = 3           31 MOD 10 = 1
-31 DIV 10 = -4         -31 MOD 10 = 9

We know that dividing by 10^n can be achieved by merely shifting the decimal digits n places to the right and thereby ignoring the lost digits.
The same method applies, if numbers are represented in binary instead of decimal form.
If two's complement representation is used (as in practically all modern computers), then the shifts implement a division as defined by the above DIV operation. Moderately sophisticated compilers
will therefore represent an operation of the form m DIV 2^n or m MOD 2n by a fast shift (or mask) operation.

                                The type REAL

The type REAL denotes a subset of the real numbers.
Whereas arithmetic with operands of the types INTEGER is assumed to yield exact results, arithmetic on values of type REAL is permitted to be inaccurate within the limits of round-off errors caused by computation on a finite number of digits. This is the principal reason for the explicit distinction between the types INTEGER and REAL, as it is made in most programming languages.
The standard operators are the four basic arithmetic operations of
addition (+), subtraction (-),multiplication (*), and division (/).
It is an essence of data typing that different types are incompatible under assignment. An exception to this rule is made for assignment of integer values to real variables,because here the semanitcs are unambiguous. After all, integers form a subset of real numbers.
However, the inverse direction is not permissible: Assignment of a real value to an integer variable requires an operation such as truncation or rounding. 
The standard transfer function Entier(x) yields the integral part of
x.
Rounding of x is obtained by Entier(x + 0.5).

Many programming languages do not include an exponentiation operator. The following is an algorithm for the fast computation of y = x^n, where n is a non-negative integer.

y := 1.0;
i := n;
WHILE i > 0 DO (* x0^n = x^i * y *)
  IF ODD(i) THEN
    y := y*x
  END ;
  x := x*x;
  i := i DIV 2
END



[to be continued...]

Comments