Text handling

Litteral strings

Litteral strings are provided through double quoting the content:

console "abc" eol

Special characters are provided through bracket sequence:
[lb] is left bracket
[rb] is right bracket
[dq] is double quote
[cr] is carriage return
[lf] is line feed
[tab] is tabulation
[0] is character 0

console "abc[lf]def[rb]" eol 

Basic operations on strings

The data type for storing a text string is 'Str'.
The length of the string is returned by 'len' method:

var Str s := "abc"
console s:len eol

Concatenation is using plus sign:

var Str s2 := s+"def"

A substring is obtained through providing the index of the first selected character (the first character of the string has index 0, not 1) followed by the number of selected characters.
The index of the first selected character must be positive (at least 0), and can be a bigger than the length of the string (an empty string will be returned).
The number of selected characters must be positive (at least 0), and can be bigger than the nomber of remaining characters in the string (the returned string will have less than the specified number of characters)

var Str s := "abcde"
console (s 3 2) eol

Characters

A character can be constructed from it's ASCII number through 'character' instruction:

var Char c := character 65
console c eol

A character shall be extracted from a string, or changed, through providing it's index. The index must be positive and smaller than the string length:

var Str s := "abcde"
var Char c := s 1
console c eol
s 2 := "C"
console s eol

A litteral string with only one character can also be used as a character:

var Char c := "A" # ok
var Char c := "AB" # does not work

The ASCII number of a character can be obtained by 'number' method:

var Char c := "A"
console c:number eol

UTF8 encoding issues

Recent releases of Pliant assume that a 'Str' data type is UTF8 encoded. This has nasty consequences you have keep in mind:

Let's start with a sample:

var Str s := "rêve"
console s:len eol

The result is 5, not 4.
The reason is that the 'ê' character is not an ASCII one, so it is encoded as two bytes using UTF8 encoding scheme.
Now, what is false in what I've described previously is that I've spoken about characters, but in facts it was really bytes in the UTF8 encoding.
In other words, 'len' does not return the number of characters in the string, but the number of bytes of it's UTF8 encoding. This is the same only if the string only contains ASCII characters.
The same kind of issue applies when picking a substring: the index of the first character is in fact the index of the first byte in the UTF8 encoding, and the numer of characters is in facts the number of bytes in the UTF8 encoding.

Let's continue with encoding related issues:

module "/pliant/language/type/text/str32.pli" # we need this because 'Str32' data type is not part of the default Pliant dialect: we need an extension
var Str32 s := "rêve"
console s:len eol

The result is this time 4.
The reason is that 'Str32' encodes each character on exactly 32 bits, so with 'Str32', we have truely have the number of 32 bits words, so positions, corresponding to the number of characters in the strings.

Of course, we have also 'character32' and 'number' for 'Str32' strings:

module "/pliant/language/type/text/str32.pli"
var Char32 c := character32 234
console c:number eol

As a summary, Pliant uses UTF8 encoding for default string data type, and the reason is that it is more memory efficient that using 16 or 32 bytes per character. There is a drawback that you don't have one character matching one position. In many situations, this is not really a problem because UTF8 is a smart encoding, so you can for example search substrings, and it will always work.
If you want to recover the exactly one position for one character properly, as an example to know the exact number of characters in the string, then you can convert the 'Str' data type to 'Str32'

A 'Str8' data type exists, that encodes each character exactly on 8 bits:

module "/pliant/language/type/text/str8.pli"
var Str8 s := "rêve"
console s:len eol

Please notice that there is currently no 'Char8' data type and 'character8' function because they are 'Char' and 'character' in facts.

Searching for a substring

The first parameter of the 'search' method is the substring to search for. It must not be the empty string. The second is the value to return if the substring is not found. The result is once more not the index of the character in the string where the substring has been found, but the index of the byte in the UTF8 encoding.

var Str s := "abcde"
console (s search "cd" -1) eol

Parsing

The 'parse' method is a very powerfull way to parse some string. It scans the string from left to right and tries to match provided arguments one after the other. The result is true if all arguments have been found and the end of the string has been reached.
Let's start with a sample:

var Str s := "abc12def"
if (s parse "abc" (var Int i) "def")
  console "the value in the middle is " i eol

In this first sample, we have seen two kind of elements: litteral strings, and a variable. A litteral string must be matched litteraly, or parse will fail. A variable must find characters that provide a valid value for the variable data type, and the variable will be set with that value, or parse will fail.

The 'any' keyword matches any sequence of characters. If it has a variable as an argument, the matched substring will be returned in the variable:

var Str s := "abcdef"
if (s parse "ab" any "ef")
  console "good" eol
if (s parse "ab" any:(var Str middle) "ef")
  console "middle content is " middle eol

A matching sequence of characters for a string variable is just like a Pliant litteral string, it starts with a double quote, ends with a double quote, and in the middle, the bracket charater introduces a special character:

var Str s := "ab[dq]cd[dq]ef"
if (s parse "ab" (var Str v) "ef")
  console "the string value is " v eol
var Str s := "abcdef"
if (s parse "ab" (var Str v) "ef")
  console "oh no !" eol
if (s parse "ab" any:(var Str v) "ef")
  console "the characters in the middle are " v eol

The 'pattern' keyword forces it's argument to be handled like a litteral string to be matched instead of a variable to be filled.

var Str s := "abcdef"
var Str p := "ab"
if (s parse pattern:p "cdef")
  console "good" eol

The 'word" keyword checks that what is matched is a full keyword, I mean the previous and next characters are not some letters.

var Str s := "abcdef"
if (s parse "ab" word:"cd" "ef")
  console "oh no !" eol
var Str s := "ab cd ef"
if (s parse "ab" word:"cd" "ef")
  console "good" eol

Case shall be ignored through using 'acpattern' or 'acword' instead of 'pattern' or 'word'. 'ac' means 'any case'.

var Str s := "ab cd ef"
if (s parse acpattern:"AB" acword:"CD" acpattern:"EF")
  console "good" eol

One can also

The underscore keywords matches any number of spaces.

var Str s := "ab cd ef"
if (s parse any:(var Str w1) _ any:(var Str w2) _ any:(var Str w3))
  console "we found three words; they are " w1 ", " w2 " and " w3 eol

All spaces are between matched elements are automatically dropped. If you don't want spaces to be automatically dropped, use ' which standard for 'extact parsing' instead of 'parse':

var Str s := "abc12def"
if (s parse "abc" (var Int i) "def")
  console "good" eol
if (s eparse "abc" (var Int i) "def")
  console "good" eol
var Str s := "abc 12 def"
if (s parse "abc" (var Int i) "def")
  console "good" eol
if not (s eparse "abc" (var Int i) "def")
  console "good" eol

Options

Options are a way to store all a dictionary (a set of keyword -> value) in a single variable.

Let's take an example:

var Str s := "id [dq]r1[dq] name [dq]Dupont[dq] count 3 mini 2 maxi 10 country [dq]Spain[dq]"

We can test if a keyword is defined:

if (s option "name")
  console "name is defined" eol

We can also pick the value associated with the keyword:

console "name is " (s option "name" Str) eol

A default value shall be provided to be returned if the requested keyword is not found or the corresponding value does not match the requested data type:

console "name is " (s option "name" Str "nobody") eol

We can also query the position of the keyword in the string:

console "name keyword found at position " (s option_position "name" -1) eol

The second parameter of 'option_position' is the value to return if the keyword is not found in the string.

Through using both 'option_position' and 'parse', we can also find two values following a keyword:

var Str s := "id [dq]r1[dq] name [dq]Dupont[dq] count 3 range 2 10 country [dq]Spain[dq]"
var Int i := s option_position "range" 0
if ((s i s:len) parse word:"range" (var Int mini) (var Int maxi) any)
  console "range is " mini " to " maxi eol

Lastly, let's assume that we want to pass several time the same keyword value. Several of the methods we have seen previously accept an extra parameter just after the keyword parameter that specifies the index of the keyword instance we want to considere. When this parameter is omited, it's the same as setting it to zero. Here is a sample:

var Str s := "value 2 value 5 value 10"
var Int i := 0
while (s option_position "value" i -1)<>(-1)
  console "value " i " is " (s option "value" i Int) eol
  i += 1

Other string related functions

No explaination needed, isn't it ?

console (repeat 5 "abc") eol

console upper:"Hello" eol
console lower:"Hello" eol