Language

Pliant markup language

A (fall) short classification of file formats

Historically, they have been mainly two classes of file formats and network protocols encodings:

•

binary

•	ASCII (clear text)

The advantages of ASCII are:

•	it is easier to understand the content without exhaustive documentation

•	it is easier to debug in case of troubles

The disadvantage of ASCII are:

•	slower to encode and decode

•	tend to be less compact

The compactness issue with ASCII is completely removed by ZLIB encoding the data flow (ZLIB is a fairly standard compression mechanism) so only the computing overhead remains in the end.

As a result, we see that ASCII file formats have been selected for carrying most Internet protocols, and it is probably one of the reasons for the Internet success since time to do a partial implementation of the protocol and debug interroperability issues has been greatly reduced as a result, but on the other hand, nearly all images file formats are binary because the encoding overhead seems too high.

Then comes the extensible, versus bloated notion. A file format or network protocol will be said to be extensible if you can add informations without requiering existing readers to upgrade. Most early binary file formats where completely bloated. Then there have been attempts to provide extensible binary file formats, with more or less success: JPEG is a simple and extendable framework, TIFF is a much more complex one with no more extension capabilities.

On the ASCII front, the huge jump is introduced by SGML, better known for the HTML and XML that are the web underlying formats. The decisive extension it introduces is the ability to provide with a tag (an instruction) a set of (identifier,value) pairs, I mean a dictionary that I will call a set of options, with readers just skipping the options they don't understand.
As a comparison, TIFF provides the ability to add new instructions (fields), with the readers skipping the instructions they dont understand.
The option addition is greatly superior to the instructions addition because it enables to refine the format over time. I mean, when the format has been used for some time, ambiguities appear, so adding options enable to smoothly add informations that ease resolving interroperability issues or small troubles. In other words, adding instructions is too coarse granurarity for format evolution: it's usefull, but not enough.

As a summary, file formats is an awfull field because it's too easy to create a new one, so there are zilions of poorly engenieered ones. Then evolution goes in all directions: the HP printers file format was ASCII (PCL5) and turned to binary (PCL6) in order to reduce decoding overhead on cheap underpowered printers. The desktop suites file format where ASCII in the early days (rich text), then got binary when becoming more complex, and are now back to ASCII with ODF in order to ease interroperability. PDF started as mostly ASCII, just like PostScript, and is now turning to a crazy mixture due to the addition of multiple layers of short term extensions. And finaly, the king HTML makes compromises (binary encoded images, polylines points provided in an option instead of an XML tree) when overhead would heart too much or (Javascript and CSS using completely different syntax than XML) when XML rules would make it harder to read in the end.

The other important notion when classifying file formats is streaming versus direct access.
A streaming file format is a format that is intended to be red or written sequencially from the first to the last byte.

The advantages of streaming are:

•	the file can be produced by one computer and consumed on the other side with no need for strorage in the middle

•	it avoids latency issues (disk direct access is very slow, so is direct access over the network)

The disadvantages are:

•	picking only part of the content can requires to read all the content

PML design choices

In very fiew words, PML is binary encoded XML.
So, compared to XML, it mostly get's rid of the ASCII encoding overhead, which means that no dirty compromise is required when handling huge volume of datas such as images.
The drawback is that it first needs to be converted before beeing editable in a standard text editor, but the drawback is very limited because the conversion is independant of the content.

As a summary, PML main characteristics are:

•	low overhead

•	highly extensible

•	streamable

•	editable through content independant conversion

PML is even more general than XML because with PML the value associated with a tag option can be a subtree, whereas in XML it must be a single string value so that if the value is complex falling back to dirty ASCII encoding tricks is mandatory.
Also, with PML, tag options are not necessary a sequence of one identifier followed by one value, but can be one identifier followed by several values. As an example 'range' 2 3 instead of mini=2 maxi=3. Not a big deal.

How it works

A PML stream is a sequence of tokens. The type of each token is mostly defined by it's first byte.

The available tokens are:

•	open, close and body framing tokens

•	Identifiers

•	Interger values

•	Floating point numbers

•

Strings

•	Boolean, date and time

•	Blobs (undefined raw content)

•	Strange tokens that are currently not used, but make room for futur extensions with the ability of today parsers to cope with futur tokens.

Let's give an example. In XML we could encode something like:

<tag1 id1="value1" id2="value2"><tag2 id3="value3" /></tag1>

In PML, it could be encoded as the following stream of tokens:

open 'tag1' 'id1' "value1" 'id2' "value2" bodyopen 'tag2' 'id3' "value3" closeclose

open token is a single byte with value 11000101b (that is 197 is decimal notation)
'tag1' is a 5 bytes token starting with byte 10110100b which means a 4 chararters long identifier, followed by the 4 ASCII signs t - a - g - 1
"value1" is a 7 bytes token starting with byte 10100110b meaning 6 characters long string, followed by the 6 ASCII signs v - a - l - u - e - 1

The facts that a PML stream can be converted to an ASCII one without understanding the content comes from the fact that each token is providing it's data type in the first byte value so it's possible to follow the tokens sequence without understanding the semantic.

For extra details about PML encoding, see /pliant/util/pml/coding.txt

Pliant API for encoding PML streams

/pliant/util/pml/prototype.pli is defining the prototypes of the generic functions defining how to send any data on a PML encoded stream, or trying to pick any data from a PML encoded stream.
The implementation for basic Pliant data types (Int, Str, etc) is provided at the beginning of /pliant/util/pml/io.pli and high level functions to read or write PML streams are defined at the end of the same module.

Before we start, please notica that if you define a new data type for you application, and define the 'to stream' and 'from stream' methods for it, then all high level parsing functions bellow will work for your new data type.
Module /pliant/util/pml/canonical.pli even provides a 'pml_canonical' function that you can use to define these two methods automatically if your data type is trivial, I mean just a set of fields and well known data sets such as lists, arrays, indexes or dictionaries.

Now, let's assume that 's' variable is a Pliant stream:

(var Stream s) open "file:/tmp/my_pml_file.pli" out+safe

The basic function for writing a tag to the PML stream is:

s otag "foo" 10 "abc"

and it will append the following tokens sequence to the stream:

open 'foo' 10 "abc" close

which is basically equivalent to an XML tag that would be:

<foo foo1="10" foo2="abc" />

Now, if we want to provide more options to the tag, we could write:

s otag "foo" 10 "abc"s oattr "bar" 20s oattr "another" 30 40

and it would produce the following tokens sequence:

open 'foo' 10 "abc" 'bar' 20 'another' 30 40 close

which is basically equivalent to an XML tag that would be:

<foo foo1="10" foo2="abc" bar="20" another="32" another1="40" />

You might have noticed that the first parameter to 'otag' and 'oattr' is a string, but it will produce an identifier PML token. All parameters comming next will be encoded straight on the PML stream.

As specified earlier in this document, the difference between PML and XML is that PML does not require to provide an idenfier associated with each option. It just uses identifiers tokens to specify what optional sequence of parameters is comming next.

Now, let's assume that we want to provide a tag with some body:

s otag "para"s oattr "style" "my_style"s obody_begins otag "foo"s otag "bar"s obody_end

The PML tokens sequence will be:

open 'para' 'style' "my_style" body open 'foo' close open 'bar' close close

and it is basically equivalent to the XML sequence:

<para style="my_style"><foo /><bar /></para>

You might also have noticed that 'otag' and 'oattr' immediately put the 'close' token at the end of the PML stream, but it will be removed by 'oattr' and 'obody_begin' if they are called next. In other words, 'oattr' and 'obody_begin' do something a bit dirty on the Stream cache: they first remove the last byte if it's 11000100b (close PML token).

It is also possible to write tokens freely to a PML stream. Here is some code that produce exactly the same sequence of tokens:

s oraw open (cast "para" Ident) (cast "style" Ident) "my_style" body open (cast "foo" Ident) close open (cast "bar" Ident) close close

Pliant API for decoding PML streams

Let's assume that 's' variable is now the Pliant stream:

(var Stream s) open "file:/tmp/my_pml_file.pli" in+safe

Then, the basic function for recognising a PML tag will be:

if (s itag "foo" (var Int i1) (var Str s2)) ...

The previous sequence is expected to recognize the following tokens sequence:

open 'foo' 10 "abc"

Then we can use 'iattr' to find the tag options:

if (s iattr "another" (var Int i) (var Int j)) ...

would recognize for the following sequence:

'another' 30 40

What is important to nice is that if the stream tokens sequence is the following:

open 'foo' 10 "abc" 'bar' 20 'another' 30 40 close

then the previous 'iattr' would succeed, and the reason is that 'iattr' is not expecting the specified tokens sequence to come just next in the PML stream. It can skip token until either it reaches the 'body' or the 'close' token. Nested is supported so that if it finds an 'open' token, all the sequence up to the 'close' will be skipped. Also, there is a limit (generaly 1 MB) on the amount of bytes one tag an it's options are allowed to consume. Lastly, when the iattr function returns, no stream data has been consumed, so searching for another attribute, or even the same one is still possible.

So, if you write:

if (s itag "foo" (var Int i1) (var Str s2)) console "ok" eolif (s iattr "another" (var Int i) (var Int j)) console "ok" eolif (s iattr "bar" (var Int k)) console "ok" eol

then ok will be displayed 3 times.

When you have scanned all attributes of a tag you are interested with, you have to either issue 'ibody_begin' to enter the body of the tag, or 'ibody_none' to move to the next tag.

if (s itag "foo" (var Int i1) (var Str s2)) console "ok" eolif (s iattr "another" (var Int i) (var Int j)) console "ok" eolif (s iattr "bar" (var Int k)) console "ok" eolif s:ibody_none console "now we are ready to read next tag"

The previous sequence is intended to read a tag with no body, I mean something like:

open 'foo' 10 "abc" 'bar' 20 'another' 30 40 close

Now, if the tag has a body, let's say the tokens sequence is:

open 'para' 'style' "my_style" body open 'foo' close open 'bar' close close

The the code to parse it could look like:

if (s itag "para") console "ok" eolif (s iattr "style" (var Str style)) console "ok" eolif s:ibody_begin console "now we are going to read the tags in the body"while s:imore if (s itag "foo") or (s itag "bar") s ibody_noneif s:ibody_end console "now we are ready to read the next tag"

When parsing a PML stream, we can also do raw parsing using 'iraw':

if (s iraw open (var Ident tag_id)) console "we have found tag " (cast tag_id Str) eol

We could also use 'ipick' instead of 'iraw' and it does the same thing, but it does not consume the datas from the input stream.

Other utitities for manipulating PML streams

Module /pliant/util/pml/ascii.pli is providing ready to use functions for turning a PML encoded file to an ASCII file that can be studied using a text editor.

module "/pliant/language/stream.pli"module "/pliant/util/pml/ascii.pli"(var Stream s) open "file:/tmp/my_pml_file.pml" in+safe(var Stream d) open "file:/tmp/my_pml_file.txt" out+safepml_decode s d "detailed"

A higher level decoding function that does exactly the same thing is also provided:

module "/pliant/language/stream.pli"module "/pliant/util/pml/ascii.pli"pml_decode "file:/tmp/my_pml_file.pml" "file:/tmp/my_pml_file.txt" "detailed"

The third parameter of 'pml_decode' function provides extra flags to specified how the ASCII file is expected to be formated. Only 'detailed' is implemented at the moment.
Please notice that I'm currently not satisfied at all with the ASCII presentation of the PML stream so that at will completely rework it at some time, and that the reverse function is missing so will have to be provided at some point.

Other files in /pliant/util/pml/ directory

This part of the documentation should not be there, it's just a memo for myself.

channel.pli is implementing a TCP security layer based on RSA+RC4 cryptography (prevents the content to be seen or modified by a man in the middle between the client and the server.

multiplexer.pli is the dispatcher of all Pliant new services based on PML streams that all receive clients on the same TCP port. It can use channel.pli to achieve security on top of TCP.

locker.pli is providing a service that enables to get a ticket (password) that brings access to some content. Any server that send the ticket will get access to the content.