Pliant markup languageA (fall) short classification of file formatsHistorically, they have been mainly two classes of file formats and network protocols encodings:
The advantages of ASCII are:
The disadvantage of ASCII are:
The compactness issue with ASCII is completely removed by ZLIB encoding the data flow (ZLIB is a fairly standard compression mechanism) so only the computing overhead remains in the end. As a result, we see that ASCII file formats have been selected for carrying most Internet protocols, and it is probably one of the reasons for the Internet success since time to do a partial implementation of the protocol and debug interroperability issues has been greatly reduced as a result, but on the other hand, nearly all images file formats are binary because the encoding overhead seems too high. Then comes the extensible, versus bloated notion. A file format or network protocol will be said to be extensible if you can add informations without requiering existing readers to upgrade. Most early binary file formats where completely bloated. Then there have been attempts to provide extensible binary file formats, with more or less success: JPEG is a simple and extendable framework, TIFF is a much more complex one with no more extension capabilities. On the ASCII front, the huge jump is introduced by SGML, better known for the HTML and XML that are the web underlying formats. The decisive extension it introduces is the ability to provide with a tag (an instruction) a set of (identifier,value) pairs, I mean a dictionary that I will call a set of options, with readers just skipping the options they don't understand. As a summary, file formats is an awfull field because it's too easy to create a new one, so there are zilions of poorly engenieered ones. Then evolution goes in all directions: the HP printers file format was ASCII (PCL5) and turned to binary (PCL6) in order to reduce decoding overhead on cheap underpowered printers. The desktop suites file format where ASCII in the early days (rich text), then got binary when becoming more complex, and are now back to ASCII with ODF in order to ease interroperability. PDF started as mostly ASCII, just like PostScript, and is now turning to a crazy mixture due to the addition of multiple layers of short term extensions. And finaly, the king HTML makes compromises (binary encoded images, polylines points provided in an option instead of an XML tree) when overhead would heart too much or (Javascript and CSS using completely different syntax than XML) when XML rules would make it harder to read in the end. The other important notion when classifying file formats is streaming versus direct access. The advantages of streaming are:
The disadvantages are:
PML design choicesIn very fiew words, PML is binary encoded XML. As a summary, PML main characteristics are:
PML is even more general than XML because with PML the value associated with a tag option can be a subtree, whereas in XML it must be a single string value so that if the value is complex falling back to dirty ASCII encoding tricks is mandatory. How it worksA PML stream is a sequence of tokens. The type of each token is mostly defined by it's first byte. The available tokens are:
Let's give an example. In XML we could encode something like: <tag1 id1="value1" id2="value2"> In PML, it could be encoded as the following stream of tokens: open 'tag1' 'id1' "value1" 'id2' "value2" body open token is a single byte with value 11000101b (that is 197 is decimal notation) The facts that a PML stream can be converted to an ASCII one without understanding the content comes from the fact that each token is providing it's data type in the first byte value so it's possible to follow the tokens sequence without understanding the semantic. For extra details about PML encoding, see /pliant/util/pml/coding.txt Pliant API for encoding PML streams/pliant/util/pml/prototype.pli is defining the prototypes of the generic functions defining how to send any data on a PML encoded stream, or trying to pick any data from a PML encoded stream. Before we start, please notica that if you define a new data type for you application, and define the 'to stream' and 'from stream' methods for it, then all high level parsing functions bellow will work for your new data type. Now, let's assume that 's' variable is a Pliant stream: (var Stream s) open "file:/tmp/my_pml_file.pli" out+safe The basic function for writing a tag to the PML stream is: s otag "foo" 10 "abc" and it will append the following tokens sequence to the stream: open 'foo' 10 "abc" close which is basically equivalent to an XML tag that would be: <foo foo1="10" foo2="abc" /> Now, if we want to provide more options to the tag, we could write: s otag "foo" 10 "abc" and it would produce the following tokens sequence: open 'foo' 10 "abc" 'bar' 20 'another' 30 40 close which is basically equivalent to an XML tag that would be: <foo foo1="10" foo2="abc" bar="20" another="32" another1="40" /> You might have noticed that the first parameter to 'otag' and 'oattr' is a string, but it will produce an identifier PML token. All parameters comming next will be encoded straight on the PML stream. As specified earlier in this document, the difference between PML and XML is that PML does not require to provide an idenfier associated with each option. It just uses identifiers tokens to specify what optional sequence of parameters is comming next. Now, let's assume that we want to provide a tag with some body: s otag "para" The PML tokens sequence will be: open 'para' 'style' "my_style" body open 'foo' close open 'bar' close close and it is basically equivalent to the XML sequence: <para style="my_style"> You might also have noticed that 'otag' and 'oattr' immediately put the 'close' token at the end of the PML stream, but it will be removed by 'oattr' and 'obody_begin' if they are called next. In other words, 'oattr' and 'obody_begin' do something a bit dirty on the Stream cache: they first remove the last byte if it's 11000100b (close PML token). It is also possible to write tokens freely to a PML stream. Here is some code that produce exactly the same sequence of tokens: s oraw open (cast "para" Ident) (cast "style" Ident) "my_style" body open (cast "foo" Ident) close open (cast "bar" Ident) close close Pliant API for decoding PML streamsLet's assume that 's' variable is now the Pliant stream: (var Stream s) open "file:/tmp/my_pml_file.pli" in+safe Then, the basic function for recognising a PML tag will be: if (s itag "foo" (var Int i1) (var Str s2)) The previous sequence is expected to recognize the following tokens sequence: open 'foo' 10 "abc" Then we can use 'iattr' to find the tag options: if (s iattr "another" (var Int i) (var Int j)) would recognize for the following sequence: 'another' 30 40 What is important to nice is that if the stream tokens sequence is the following: open 'foo' 10 "abc" 'bar' 20 'another' 30 40 close then the previous 'iattr' would succeed, and the reason is that 'iattr' is not expecting the specified tokens sequence to come just next in the PML stream. It can skip token until either it reaches the 'body' or the 'close' token. Nested is supported so that if it finds an 'open' token, all the sequence up to the 'close' will be skipped. Also, there is a limit (generaly 1 MB) on the amount of bytes one tag an it's options are allowed to consume. Lastly, when the iattr function returns, no stream data has been consumed, so searching for another attribute, or even the same one is still possible. So, if you write: if (s itag "foo" (var Int i1) (var Str s2)) then ok will be displayed 3 times. When you have scanned all attributes of a tag you are interested with, you have to either issue 'ibody_begin' to enter the body of the tag, or 'ibody_none' to move to the next tag. if (s itag "foo" (var Int i1) (var Str s2)) The previous sequence is intended to read a tag with no body, I mean something like: open 'foo' 10 "abc" 'bar' 20 'another' 30 40 close Now, if the tag has a body, let's say the tokens sequence is: open 'para' 'style' "my_style" body open 'foo' close open 'bar' close close The the code to parse it could look like: if (s itag "para") When parsing a PML stream, we can also do raw parsing using 'iraw': if (s iraw open (var Ident tag_id)) We could also use 'ipick' instead of 'iraw' and it does the same thing, but it does not consume the datas from the input stream. Other utitities for manipulating PML streamsModule /pliant/util/pml/ascii.pli is providing ready to use functions for turning a PML encoded file to an ASCII file that can be studied using a text editor. module "/pliant/language/stream.pli" A higher level decoding function that does exactly the same thing is also provided: module "/pliant/language/stream.pli" The third parameter of 'pml_decode' function provides extra flags to specified how the ASCII file is expected to be formated. Only 'detailed' is implemented at the moment. Other files in /pliant/util/pml/ directoryThis part of the documentation should not be there, it's just a memo for myself. channel.pli is implementing a TCP security layer based on RSA+RC4 cryptography (prevents the content to be seen or modified by a man in the middle between the client and the server. multiplexer.pli is the dispatcher of all Pliant new services based on PML streams that all receive clients on the same TCP port. It can use channel.pli to achieve security on top of TCP. locker.pli is providing a service that enables to get a ticket (password) that brings access to some content. Any server that send the ticket will get access to the content. |