String operations v0606

From EuWiki

Jump to: navigation, search

w32support.e defines a few routines that deal with strings. Strings are defined as sequences of integers. Hence, all Sequence_operations apply as well.

The library defines the w32string type as a sequence of 8-bit unsigned integers. However, this type is not used to type check the arguments of string routines, so that they may be used (for the most part) on UTF-16/32 encoded strings.

Contents

Whitespace trimming

Whitespace is hardcoded as any character which is a space, a tab ('\t'), a vertical tab (11), a line feed ('\n'), a form feed (12) or a carriage return ('\r'). This definition extends Euphoria's, since VT and FF are not valid characters in source code.

trim_left(sequence text) removes any leading whitespace from text and returns the modified string.

trim_right(sequence text) does the same, but removes trailing whitepspace instead.

w32trim(object text) does a little more than both w32trim_left() and w32trim_right(). It returns atoms unchanged, strips enclosing double quotes, then performs leading and trailing whitespace removal (the above routines are not called). Finally, another double quote stripping is performed before returning the modified text.

Character sets

The library defines sets of characters and provides routines to define such sets and know whether a character belongs to a set. You can think of character sets as style flags for characters. Very unfortunately, these routines do not work for UTF-16/32 character encodings.

Character set definition

Actually, what you can do is to state that some character(s) belong to a given character set. The library provides 6 initially empty sets. You can of course redefine existing sets, but be aware that the library uses them internally, at the very least, the NameChar_CT, Digit_CR and Punct_CT sets should be left alone. You can even create new sets as powers of 2 greater than #8000 (undocumented).

See the documentation for w32setCType() for a list of known character sets and their default definition.

The w32setCType(chars,sets) procedure tells the library that all chars designated by chars belong to all sets designated by sets:

  • sets is a set constant, a sum or a sequence of these. It is processed like a regular style flag argument in a create() instruction;
  • chars is either:
    • a single character;
    • a sequence of characters: all the characters in the sequence are affected;
    • a pair of 1-character strings: the first string defines the lower bound of a range, the second string defines the upper bound for that range, and all characters in the range are affected.

There is no simple function to remove a character from a set. To do this, you must assess to which sets the charater belongs, compute a new set of sets by removing the one(s) you wish and calling w32setCType with the affected character(s) and the new set of sets.

Testing

The w32CType(char,sets) tests whether the character char belongs to all sets listed in sets. The result is 1 if it does and 0 otherwise.

The w32GetCType(chars) returns the set of sets chars belongs to if chars is a single character. This set of sets is the sum of all character set flags the character belongs to - use get_bits() to turn this into a sequence if you need -. This function extends in the standard way to sequences.

Other

The routines below are useful, but didn't seem to fit anywhere else, nor to form a more meaningful group or break into such.

The makeStandardName(text) function is used to strip any non alphabetical character from the string text. This is the only place the library uses the character sets described in the previous section.


The w32split(text,delim) splits text into substrings and removes delimiters. delim is either:

  • a sequence: it is then assumed to be a sunstring.
  • a {string}, in which cae any character in the string is assumed to be a delimiter;
  • a single delimiter. In this case only, a depth analysis, using '(' and '{' as group openers and ')','}' as group closers,is performed, and delimiters found inside groups do not trigger a substring split.

There is no general w32join() inverse function, but it is not difficult to craft something rudimentary:

function w32join(sequence substrings,object delimiter,integer flags)
-- rebuilds a single sequence, using the parts in substrings 
-- and the delimiter delimiter.
-- flags is possibly 1 (to add delimiter at the beginning),
-- plus possibly 2 to add a delimiter at the end
sequence result,item
if and_bits(flags,1) then
   result=delimiter & substrings[1]
else
   result=substrings[1]
end if
for i=2 to length(substrings) do
   item=delimiter & substrings[i]
    result&=item
end for
if andèbits(flags,2) then
   return result & delimiter
else
   return result
end if
end function

However, there is an undocumented w32ToString(object x) which behaves like w32join(x,','). It returns strings unchanged, and turns atoms to their string representations.

The undocumented w32MaxWidth(object x) returns:

  • for an atom, the length of its string representation, using the %15.15g format;
  • for strings, the length of the string;
  • for other sequences, the maximum width of all the atoms and strings the sequence has.

The w32Encode(sequence text,sequence mask,integer size) computes a hash value for the string text. That hash value is a string of length size made of digits in the '0'..'9' range. Use mask to add some extra input so that the chances of duplicate {string,mask} pairs are tiny and ideally nil. There is no point in attempting to recover text from the returned value. See the password.exw demo program for an example of use.


The w32TextToNumber(text) is an alternative to the Euphoria value() and works slightly differently:

  • If text is a string, it tries to convert it into an atom using rules stated below. The atom is returned on success and 0 on failure.
  • If text is a sequence {some_string, some_nonzero_atom}, then the returned value is a pair. The second element of the pair is the place where conversion stopped, or 0 if no premature stop occurred. The first element is the atom obtaind by converting the usable head of the string;
  • If text is {some_string,0.0}, some_string is processed as if specified alone.
  • The conversion routine is more flexible than value():
    • a ase can be specified by having the first character of the string read '#' (base 16), '@' (base 8) or '!' (base 2);
    • Fractional numbers are understood whatever the base;
    • For hexadecimal values, the 'A'..'F' digits may be lower case;
    • A '+' or '-' sign may precede or follow the digits, and cause sign to be interpreted properly;
    • Commas and underscores are skipped, which makes reading the numbers easier;
    • Adding any number of '%' signs at the end of the string causes as many divisions by 100 as trailing "%' signs.
    • If any currency symbol appears before digits, it is ignored.
    • Surrounding whitespace is ignored.
  • Some undocumented rules:
    • The trailing '%' signs, whitespace characters and '+'/'-' sign characters may appear in any combination. The furthest inside sign symbol is eventually retained.
    • The leading base symbols, whitespace characters and '+'/'-' sign characters may appear in any combination. Only the sign or base symbol furthest inside is retained.
    • The set of currency signs is hardcoded as "$£¤¥€". The currency sign, if any, must be followed by a digit or dot and appear before any digit has been found. Yet trailing currency signs are commonplace, specially for '€';
    • The decimal separator must be the EN/US one, '.'. However, some countries use ',' as decimal separator and '.' for thousands, so there should be an option to sxap '.' and ',' so as to properly read strings in the other locale format.
Personal tools