Language

The storage machinery layout

This document is very technical, you don't need to read it if you just expect to run a Pliant database. Also reading it will explain you what's going under the hood.

The Pliant global cache

At the bottom of Pliant storage machinery is Pliant global cache.
Global cache is implemented in /pliant/language/data/cache.pli
The global cache is associating a Pliant object to an identifier (a string in facts).

Let's think about fonts. Loading a font from disk is expensive, so when done, it's better to keep the decoded font in memory because it's likely that we will use it again later in program; unless we are under memory pressure so that it's better to drop some fonts in order reduce memory consuming since should we need one of them at a later point, it will still be possible to reload it from the disk. This is just what Pliant global cache is designed for: keep objects in main memory to speed up things, unless too much memory is already consumed so that it's wise to start dropping some of them.

The amount of memory Pliant process is expected to use is defined in 'memory_assigned' global variable, set at startup time by /pliant/language/context/memory.pli module to the value defined by '/pliant/memory/assigned' variable of this_computer.pdb Pliant global configuration database.

All objects in the global cache must have a data type that inherits from CachePrototype.
An object in the global cache can implement three generic methods (update, dump and signal) to deal with the upper services of Pliant storage machinery.

module "/pliant/language/compiler.pli"
module "/pliant/language/compiler/inherit.pli"
type MyCache
  inherit CachePrototype
  field Int my_field
CachePrototype maybe MyCache

Fetching an object in the global cache is done using 'cache_open':

var Str id := "abc"
var Link:MyCache c
if (cache_open "/my_org/my_cache/"+id MyCache ((addressof Link:MyCache c) map Link:CachePrototype))
  c my_field := undefined
  (var Stream s) open "data:/my_corp/my_table" in+safe
  while not s:atend
    if (s:readline parse word:id (var Int i))
      c my_field := i
  s close
  if c:my_field<>undefined
    cache_setup ((addressof Link:MyCache c) map Link:CachePrototype) cache_class_cheap
    cache_ready ((addressof Link:MyCache c) map Link:CachePrototype)
  else
    cache_cancel ((addressof Link:MyCache c) map Link:CachePrototype)

'cache_open' will return true if the object was not yet in the cache, so has to be constructed. If construction succeeds, the program must call 'cache_ready' to make the object available, but if it fails, it must call 'cache_cancel' to immediately remove the object from the cache.

When the Pliant process detects that it's consuming too much memory, it will free (drop) some objects in the global cache. Objects are expected to be dropped on a least recently used base. But, each object is assigned a class, and all objects of a given class will be freed before starting to free some of the next ones. Setting the class of an object is done through 'cache_setup' function.
The classes are:
cache_class_cheap
cache_class_standard
cache_class_costy
cache_class_should_keep
cache_class_must_keep
Standard is the default class, so there is no need to call 'cache_setup' if it's the expected class for the new object. Class 'should_keep' means that if we drop the object, the program will miss behave, but it's better than crashing. Class 'must_keep' means that dropping the object is just as bad as crashing.

In the previous listing, the hard to understand part might be:

((addressof Link:MyCache c) map Link:CachePrototype)

Basically, in order to be abble to access our application specific field 'my_field', you need 'c' variable to have Link:MyCache data type. On the other hand, 'cache_open' function and others expect a variable with type Link:CachePrototype. If we write:

addressof:c map CachePrototype

then it's wrong because the result of expression is a temporary variable with data type CachePrototype, not Link:CachePrototype. That is why we use the version of addressof with two arguments instead of one, in order to be abble to specify that we want to get the address of the link, not the address of the object the link is pointing to.

Now come the less frequently used functions to deal with the global cache:

var Str id := "abc"
var CBool found := cache_search "/my_corp/my_cache/"+id (var Link:CachePrototype p)

Please notice that if 'found' get's true value, then 'p' is pointing the object found in the cache, but we are not granted the object to have type 'MyCache', so we could write something like:

if found and (entry_type addressof:p)=MyCache
  ...

We can also force an object to be immediately dropped:

var Str id := "abc"
cache_delete "/my_org/my_cache/"+id

We can also force the cache to drop many objetcs immediately through:

cache_shrink 64*2^20 cache_class_costy

The exact meaning is: drop some objects that have not a class greater than the specified one, until the memory consuming falls bellow the specified amount, here 64 MB.

The last function available is 'cache_broadcast' that enables to apply a specified function on all objects currently in the cache. The following code will display on the console the identifier associated with each object in the cache, and the exact data type of the object:

function display object param
  oarg_rw CachePrototype object ; arg_rw Universal param
  console object:cache_id " " (entry_type addressof:object):name
cache_broadcast (the_function display CachePrototype Universal) void

Objects on disk

Objects of the Pliant storage are stored in memory in the global cache that we have just described, and on disk as a set of PML encoded instructions. PML encoding is described in another article.

The biggest difficulty for building classes of objects on top of Pliant storage mechanism is to define the instructions set that will be use to immediately write on disk the changes applied to the object in memory, so that in case of application crash or restart, or if the object is dropped from the global cache because of lack of available memory, it will be later possible to rebuild the object in memory through parsing the set of PML instructions.
The bad solution would be to write the all content of the object each time it is slightly modified, which is what desktop applications tend to do (file save menu).
Two application level classes of objects are currently provided as part of Pliant storage system: one for storing databases, and one for storing desktop documents (abritrary XML trees).

Several objects can be stored in a single PML file. Each of them is associated with an identifier, and we will call it fiber. In other words, we do multiplexing.
If you study the content of a storage PML file, you will see that it is a sequence of

open 'fiber' "an id" 'host' "a server" 'user' "a user account" 'ts' a_date_and_time body ... close

and

open 'f' body ... close

token sequences.
The 'f' sequence is just a more compact form of 'fiber' sequence that can be used when the 'host' and 'user' is the same as in the previous sequence, and 'ts' (which means timestamp) is not too different.

As a result, not only the fiber notion in PML encoded file brings multiplexing capabilties, but also it brings the ability to record the identity of the user that was responsible for each change applied, and when it appent. That the full tracability.
Please notice that the multiplexing capability has been introduced because we wanted the tracability (each change to the object is logged with an associated user and timestamp) so that there was need to encode meta datas in the PML file. Multiplexing (fibers) came at nearly no extra cost. In facts, multiplexing might be usefull in only two situation: adding meta informations to the file that will just be ignored by the main application (desktop icons and others) but it might not be a good choice from the computational point of view since reading the meta information, which is general small, will need reading the all content of the file which can be very long (even if the clever use if 'fiber_dump' might bring the overhead down in this situation). The other situation would be handling many very small objects, so that storing each of them in a file might be costy (mainstream filesystems still assign room to files on a block base, general 4 KB, not on a byte base, and multi server synchronisation migth suffer from connection latency). Here again, the computational cost might be high because reading one small object will require reading all of them in facts. So, in the end, in 99%, fibers multiplexer feature of no use and the only used one is the one assicated with the empty identifier.

Storage control

The 'StorageControl' objects are used to organize the connection between objects in the global cache and the associated PML encoded representation on disk.
It is also used to store access control rights for data sharing between servers.
'StorageContol' data type in implemented in /pliant/storage/ground/control.pli module.

Assuming that 'obj' is a link to an object in the global cache, you can get a link to the associated storage control using 'storage_control' method:

module "/pliant/storage/ground/object.pli"
var Link:CachePrototype obj
...
var Link:StorageControl c :> obj storage_control

Then the main methods on a storage control are 'fiber_modify_bgein' and 'fiber_modify_end':

var Str fiber := ""
var Str user := "me"
var Link:Stream s :> c fiber_modify_begin fiber user
s otag "some_instruction"
fiber_modify_end

'fiber_modify_open' will get a Stream to access the associated PML file, do some locking in order to prevent several objects that would use different fibers on the same PML file to write at the same time, then write the open 'fiber' ... body sequence or the 'f' body sequence.
'fiber_modify_end' will write the close token in the PML file, then unlock acces to the stream.

Since the PML stream is an infinit log that grows over time, on some applications that modify the same data again and again over time, reading the log might become very long. So a method 'fiber_dump' is provided to write a fresh second version of the object content, that will contain only up to data content instead of the all history. Starting from that point, if the object content is modified again, reading the object will later parse the dump file, plus only the tail of the infinit log that have been written after the dump was last written.

fiber_dump ""

The last important method is:

sync

and, on a secondary server, it will do synchronisation with the master server: send it what as been modified localy, apply changes that have been modified on other servers. In order to have syncronisation work on several servers, you need three things:

   •   

define a resolver as described in the next paragraph

   •   

have the storage server service running on each server (see the 'Service' button on the Fullpliant main menu)

   •   

have proper cryptographic keys in each server so that server can talk to each other using Pliant secured channel notion.

Resolving

A resolver is a function you provide that Pliant can use to ask your application to find the value of a variable associated to a provided identifier.

In the following sample, we record 'my_resolver' function that resolves identifiers with class 'my_int' through trying to find a line with the specified identifier and the associated value in an 'data:/my_corp/my_table' ascii file.

module "/pliant/language/unsafe.pli"
module "/pliant/language/data/resolve.pli"
module "/pliant/language/stream.pli"
function my_resolver class id adr t -> status
  arg Str class id ; arg Address adr ; arg Type t ; arg Status status
  status := failure
  if class="my_int" and t=Int
    (var Stream s) open "data:/my_org/my_table" in+safe
    while not s:atend
      if (s:readline parse word:id (var Int i))
        adr map Int := i
        status := success
    s close
resolve_domain 1 (the_function my_resolver Str Str Address Type)

The first parameter of 'resolve_domain' is the priority of the resolver. The resolver with higher priority will be tried first. Here is a sample usage:

var Int i
if (resolve_value "my_int" "foo" Int i)=success
  console "found it." eol

The resolving machinery is implemented in /pliant/language/data/resolve.pli

At the moment, the only place in Pliant where resolvers machinery is really used is to attach replication informations to storage objects.
See 'Sharing the database among several servers' in the 'Using database avanced features' article.

Filesystem

We have seen that the StorageControl object associated with a storage object in the main memory global cache is handling the connection with the on disk PML encoded representation of the object.
We have also seen that the on disk PML file is multiplexed.
The Pliant fielsystem implemented in /pliant/storage/ground/filesystem.pli is implementing easy reading or writing of one of the fibers in the PML file.

module "/pliant/language/stream.pli"
module "/pliant/storage/ground/filesystem.pli"
module "/pliant/util/pml/io.pli"
(var Stream s) open "storage:/my_org/my_app/object1" in+safe
while not s:atend
  if (s ipick open (var Ident tag))
    console tag eol
  s iskip
s close

Should we want to read fiber "abc" instead of fiber "", then we would just write:

(var Stream s) open "storage:/my_org/my_app/object1" "fiber [dq]abc[dq]" in+safe

Writting to a fiber is just as easy:

(var Stream s) open "storage:/my_org/my_app/object1" "fiber [dq]my_meta_infos[dq]" out+safe
s otag "backup_performed_on" datetime
s close

We could do nearly the same with the following code:

(var Stream s) open "data:/my_org/my_app/object1/log" append+safe
s otag "fiber" "my_meta_infos"
s obody_begin
s otag "backup_performed_on" datetime
s obody_end
s close

But using the storage filesystem will provide proper locking and be multi servers compatible.
In other words, the storage filesystem will use the StorageControl object that we have seen previously and use 'fiber_modify_begin' and 'fiber_modify_end' to handle multiplexing and locking.

Application level storage classes

Summary of the previous episodes
The storage object will be stored in the main memory cache.
Datas remanance (datas are not lost when the process or computer is stopped) and infinit log are provided through an on disk PML encoded file that stores instructions specifying alll changes applied to the object over time.

Now, we as we have already specified, the big issue is to cleverly define the instructions set, and it's not an easy task because in memory complex data structures tend to use pointers that are hard to put on disk.

Pliant is currently providing two storage classes, one for handling databases, one for handling XML trees for desktop applications. We are now going to see how to build a new trivial class, just to explain how the storage engine works. Existing classes usage will be explained in dedicated articles.

Let's start by defining our 'MyUser' data type.

module "/pliant/language/compiler.pli"
module "/pliant/language/data/cache.pli"
module "/pliant/language/stream.pli"
module "/pliant/util/pml/io.pli"

type MyUser
  field Str name
  field Str email
CachePrototype maybe MyUser

method u update s -> status
  oarg_rw MyUser u ; arg_rw Stream s ; arg ExtendedStatus status
  if (s itag "name" (var Str v))
    u name := v
    s inext
    status := success
  eif (s itag "email" (var Str v))
    u email := v
    s inext
    status := succes
  else
    status := failure

method u dump s -> status
  oarg_rw MyUser u ; arg_rw Stream s ; arg ExtendedStatus status
  s otag "name" u:name
  s otag "email" u:email
  status := success

All the logic is provided through generic methods 'update' and 'dump'.
'update' will be used at load time in order to set the value of in memory object in the global cache from the PML encoded stream stored on disk. It is used by 'storage_load' function in module /pliant/storage/ground/object.pli. It is also used to propagate changes from one server to the other when the storage object is shared on several server. This is implemented in 'sync_down' function called by 'sync' that we have seen arealy in this document, and is implemented in module /pliant/storage/ground/control.pli
'dump' fill be used by the storage control 'fiber_dump' function that we have seen earlier in this document and is intended to make a snapshot that will speed up futher loading through avoiding to read the all infinit log every time.

High level storage function

Now we can study the high level function used to handle storage objects.
It is 'storage_link' and is implemented in /pliant/storage/ground/object.pli

var Str id := "bob"
var Link:MyUser u :> storage_link "/my_corp/my_user/"+id "" MyUser

Let's explain what 'storage_link' truely does, since it's a summary of the storage machinery in facts.
First, 'storage_link' is a meta, just to make usage easier, so that the real function is 'storage_raw' in facts.
It calls 'cache_open' to check if the object is aleady in the main memory global cache. If the object was not, so has to be constructed, which mean read from the PML file for a storage object, it calls 'storage_load'.
Storage load uses the storage filesystem that we have seen earlier to demultiplex the PML stream, and calls 'update' to apply changes to the object.

Very clever readers might have notice that 'MyUser' implementation does not specify how to extend the PML file when some field of 'MyUser' is used. Using the 'dump' function would not be wise because if the storage object is a huge database instead of the tiny 'MyUser', it would me rewrite the all database each time a single field is changed, which would be crazy.

So, we will now finish 'MyUser' implementation with:

method u set_name n
  arg_rw MyUser u ; arg Str n
  u:sem request
  (addressof:u omap CachePrototype) storage_id (var Str id) (var Str fiber)
  var Link:StorageControl c :> (addressof:u omap CachePrototype) storage_control
  var Pointer:Stream s :> c fiber_modify_begin fiber current_thread_header:user
  s otag "name" n
  c fiber_modify_end
  u name_ := n
  u:sem release

This code might require a fiew more explaination.
'storage_id' brings us the PML file path associated with the 'MyUser' storage object in the global cache, plus the fiber identifier it is associated with.
We have already seen 'storage_control' that brings us a link to the StorageControl.
'fiber_modify_begin' and 'fiber_modify_end' are also known function that will handle multiplexing and locking in the PML file.
'current_thread_header user' is the standard way to know what user is currently executing this code, and it has been set by the Pliant user interface (UI) server.

So that we can now write in our application:

u set_name "Bob Dylan"

Anyway, our 'MyUser' storage class is still not working because we have not protected the read access, so if we use some 'MyUser' objects shared on several servers, then we might read the 'name' field while the storage machinery is changing it as a side effect of synchronisation, and this could result to nothing less than our Pliant process to crash or even worse but less likely, silently corrupt some of it's datas. Too bad.
Let's introduce multithreading (simultaneous access) protection:

type MyUser
  field Str name
  field Str email
  field Sem sem
CachePrototype maybe MyUser

method u get_name n
  arg_rw MyUser u ; arg Str n
  u:sem rd_request
  n := u name
  u:sem rd_release

Well, this a very bad user interface for application level, so, just for fun, here is a better implementation:

type MyUser
  field Str name_
  field Ste email_
  field Sem sem
CachePrototype maybe MyUser

method u name -> n
  arg MyUser u ; arg Str n
  u:sem rd_request
  n := u name_
  u:sem rd_release

method u 'name :=' n
  arg MyUser u ; arg Str n
  u:sem request
  (addressof:u omap CachePrototype) storage_id (var Str id) (var Str fiber)
  var Link:StorageControl c :> (addressof:u omap CachePrototype) storage_control
  var Pointer:Stream s :> c fiber_modify_begin fiber current_thread_header:user
  s otag "name" n
  c fiber_modify_end
  u name_ := n
  u:sem release

method u update s -> status
  oarg_rw MyUser u ; arg_rw Stream s ; arg ExtendedStatus status
  if (s itag "name" (var Str v))
    u:sem request
    u name_ := v
    u:sem release
    s inext
    status := success
  eif (s itag "email" (var Str v))
    u:sem request
    u email_ := v
    u:sem release
    s inext
    status := succes
  else
    status := failure

So, we can now write application the clean way:

var Str id := "bob"
var Link:MyUser u :> storage_link "/my_org/my_user/"+id "" MyUser
console u:name eol
u name := "Bob dylan"

Should somebody is still reading at this point, let's introduce the last refinement.
We might prefer to implement 'update' through:

method u update s -> status
  oarg_rw MyUser u ; arg_rw Stream s ; arg ExtendedStatus status
  if (s itag "name" (var Str v))
    u name := v
    s inext
    sratus := success
  eif (s itag "email" (var Str v))
    u email := v
    s inext
    s := succes
  else
    s := failure

In our example, it does not really improve the code, but if 'name :=' where  quite complex functions, we could be happy not to write the code twice.
Also, without extra care, we would hit a problem, because at load time, calling the 'name :=' function would have a side effect of extending the PML file just like if the field had been modified by the application so we would end wth the PML growing up to filling the all disk and the Pliant process stop working as a result.
So, we change the 'name :=' function with:

method u 'name :=' n
  arg MyUser u ; arg Str n
  u:sem request
  if not u:is_update
    (addressof:u omap CachePrototype) storage_id (var Str id) (var Str fiber)
    var Link:StorageControl c :> (addressof:u omap CachePrototype) storage_control
    var Pointer:Stream s :> c fiber_modify_begin fiber current_thread_header:user
    s otag "name" n
    c fiber_modify_end
  u name_ := n
  u:sem release

Basically, each object in the global cache has a fiew flags. One of them is set by 'storage_load' function in order to specify: your 'update' function will now be called for loading purpose as opposed to be called for changes loggin purpose.

Here we are.
Please notice that it's just explaination of how the underlying storage machinery works. Most of you will only use the provided database or document classes that are much more straightforward to use and will be introduced in next articles.