Description détaillée

Version:: $Id$

strings

Define UTF8_CORE as required Wrapper round mb_strlen Assumes you have mb_internal_encoding to UTF-8 already Note: this function does not count bad bytes in the string - these are simply ignored

Paramètres:

string UTF-8 string

Renvoie:: int number of UTF-8 characters in string

strings

Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_strpos Find position of first occurrence of a string

Paramètres:

string	haystack
string	needle (you should validate this with utf8_is_valid)
integer	offset in characters (from left)

Renvoie:: mixed integer position or FALSE on failure

strings

Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_strrpos Find position of last occurrence of a char in a string

Paramètres:

string	haystack
string	needle (you should validate this with utf8_is_valid)
integer	(optional) offset (from left)

Renvoie:: mixed integer position or FALSE on failure

strings

Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_substr Return part of a string given character offset (and optionally length)

Paramètres:

string
integer	number of UTF-8 characters offset (from left)
integer	(optional) length in UTF-8 characters from offset

Renvoie:: mixed string or FALSE if failure

strings

Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_strtolower Make a string lowercase Note: The concept of a characters "case" only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings

Paramètres:

string

Renvoie:: mixed either string in lowercase or FALSE is UTF-8 invalid

strings

Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_strtoupper Make a string uppercase Note: The concept of a characters "case" only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings

Paramètres:

string

Renvoie:: mixed either string in lowercase or FALSE is UTF-8 invalid

strings

Define UTF8_CORE as required Unicode aware replacement for strlen(). Returns the number of characters in the string (not the number of bytes), replacing multibyte characters with a single byte equivalent utf8_decode() converts characters that are not in ISO-8859-1 to '?', which, for the purpose of counting, is alright - It's much faster than iconv_strlen Note: this function does not count bad UTF-8 bytes in the string

these are simply ignored
Auteur:
<chernyshevsky at="" hotmail="" dot="" com>=""> http://www.php.net/manual/en/function.utf8-decode.php string UTF-8 string int number of UTF-8 characters in string strings UTF-8 aware alternative to strpos Find position of first occurrence of a string Note: This will get alot slower if offset is used Note: requires utf8_strlen amd utf8_substr to be loaded string haystack string needle (you should validate this with utf8_is_valid) integer offset in characters (from left) mixed integer position or FALSE on failure http://www.php.net/strpos utf8_strlen utf8_substr strings UTF-8 aware alternative to strrpos Find position of last occurrence of a char in a string Note: This will get alot slower if offset is used Note: requires utf8_substr and utf8_strlen to be loaded string haystack string needle (you should validate this with utf8_is_valid) integer (optional) offset (from left) mixed integer position or FALSE on failure http://www.php.net/strrpos utf8_substr utf8_strlen strings UTF-8 aware alternative to substr Return part of a string given character offset (and optionally length) Note arguments: comparied to substr - if offset or length are not integers, this version will not complain but rather massages them into an integer. Note on returned values: substr documentation states false can be returned in some cases (e.g. offset > string length) mb_substr never returns false, it will return an empty string instead. This adopts the mb_substr approach Note on implementation: PCRE only supports repetitions of less than 65536, in order to accept up to MAXINT values for offset and length, we'll repeat a group of 65535 characters when needed. Note on implementation: calculating the number of characters in the string is a relatively expensive operation, so we only carry it out when necessary. It isn't necessary for +ve offsets and no specified length Chris Smithchris.nosp@m.@jal.nosp@m.akai..nosp@m.co.u.nosp@m.k string integer number of UTF-8 characters offset (from left) integer (optional) length in UTF-8 characters from offset mixed string or FALSE if failure strings UTF-8 aware alternative to strtolower Make a string lowercase Note: The concept of a characters "case" only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings Note: requires utf8_to_unicode and utf8_from_unicode Andreas Gohr andi@.nosp@m.spli.nosp@m.tbrai.nosp@m.n.or.nosp@m.g string mixed either string in lowercase or FALSE is UTF-8 invalid http://www.php.net/strtolower utf8_to_unicode utf8_from_unicode http://www.unicode.org/reports/tr21/tr21-5.html http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php strings UTF-8 aware alternative to strtoupper Make a string uppercase Note: The concept of a characters "case" only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings Note: requires utf8_to_unicode and utf8_from_unicode Andreas Gohr andi@.nosp@m.spli.nosp@m.tbrai.nosp@m.n.or.nosp@m.g string mixed either string in lowercase or FALSE is UTF-8 invalid http://www.php.net/strtoupper utf8_to_unicode utf8_from_unicode http://www.unicode.org/reports/tr21/tr21-5.html http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php strings UTF-8 aware alternative to str_ireplace Case-insensitive version of str_replace Note: requires utf8_strtolower Note: it's not fast and gets slower if $search / $replace is array Notes: it's based on the assumption that the lower and uppercase versions of a UTF-8 character will have the same length in bytes which is currently true given the hash table to strtolower string string http://www.php.net/str_ireplace utf8_strtolower strings Replacement for str_pad. $padStr may contain multi-byte characters. Oliver Saunders <oliver (a) osinternetservices.com> string $input int $length string $padStr int $type ( same constants as str_pad ) string http://www.php.net/str_pad utf8_substr strings UTF-8 aware alternative to str_split Convert a string to an array Note: requires utf8_strlen to be loaded string UTF-8 encoded int number to characters to split string by string characters in string reverses http://www.php.net/str_split utf8_strlen strings UTF-8 aware alternative to strcasecmp A case insensivite string comparison Note: requires utf8_strtolower string string int http://www.php.net/strcasecmp utf8_strtolower strings UTF-8 aware alternative to strcspn Find length of initial segment not matching mask Note: requires utf8_strlen and utf8_substr (if start, length are used) string int http://www.php.net/strcspn utf8_strlen strings UTF-8 aware alternative to stristr Find first occurrence of a string using case insensitive comparison Note: requires utf8_strtolower string string int http://www.php.net/strcasecmp utf8_strtolower strings UTF-8 aware alternative to strrev Reverse a string string UTF-8 encoded string characters in string reverses http://www.php.net/strrev strings UTF-8 aware alternative to strspn Find length of initial segment matching mask Note: requires utf8_strlen and utf8_substr (if start, length are used) string int http://www.php.net/strspn strings UTF-8 aware replacement for ltrim() Note: you only need to use this if you are supplying the charlist optional arg and it contains UTF-8 characters. Otherwise ltrim will work normally on a UTF-8 string Andreas Gohr andi@.nosp@m.spli.nosp@m.tbrai.nosp@m.n.or.nosp@m.g http://www.php.net/ltrim http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php string strings UTF-8 aware replacement for rtrim() Note: you only need to use this if you are supplying the charlist optional arg and it contains UTF-8 characters. Otherwise rtrim will work normally on a UTF-8 string Andreas Gohr andi@.nosp@m.spli.nosp@m.tbrai.nosp@m.n.or.nosp@m.g http://www.php.net/rtrim http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php string strings UTF-8 aware replacement for trim() Note: you only need to use this if you are supplying the charlist optional arg and it contains UTF-8 characters. Otherwise trim will work normally on a UTF-8 string Andreas Gohr andi@.nosp@m.spli.nosp@m.tbrai.nosp@m.n.or.nosp@m.g http://www.php.net/trim http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php string strings UTF-8 aware alternative to ucfirst Make a string's first character uppercase Note: requires utf8_strtoupper string string with first character as upper case (if applicable) http://www.php.net/ucfirst utf8_strtoupper strings UTF-8 aware alternative to ucwords Uppercase the first character of each word in a string Note: requires utf8_substr_replace and utf8_strtoupper string string with first char of each word uppercase http://www.php.net/ucwords strings Callback function for preg_replace_callback call in utf8_ucwords You don't need to call this yourself array of matches corresponding to a single word string with first char of the word in uppercase utf8_ucwords utf8_strtoupper strings This is the dynamic loader for the library. It checks whether you have the mbstring extension available and includes relevant files on that basis, falling back to the native (as in written in PHP) version if mbstring is unavailabe. It's probably easiest to use this, if you don't want to understand the dependencies involved, in conjunction with PHP versions etc. At the same time, you might get better performance by managing loading yourself. The smartest way to do this, bearing in mind performance, is probably to "load on demand" - i.e. just before you use these functions in your code, load the version you need. It makes sure the the following functions are available; utf8_strlen, utf8_strpos, utf8_strrpos, utf8_substr, utf8_strtolower, utf8_strtoupper Other functions in the ./native directory depend on these six functions being available Tools to help with ASCII in UTF-8 $Id$ ascii Tests whether a string contains only 7bit ASCII bytes. You might use this to conditionally check whether a string needs handling as UTF-8 or not, potentially offering performance benefits by using the native PHP equivalent if it's just ASCII e.g.; if ( utf8_is_ascii($someString) ) { // It's just ASCII - use the native PHP version $someString = strtolower($someString); } else { $someString = utf8_strtolower($someString); } string boolean TRUE if it's all ASCII ascii utf8_is_ascii_ctrl Tests whether a string contains only 7bit ASCII bytes with device control codes omitted. The device control codes can be found on the second table here: http://www.w3schools.com/tags/ref_ascii.asp string boolean TRUE if it's all ASCII without device control codes ascii utf8_is_ascii Strip out all non-7bit ASCII bytes If you need to transmit a string to system which you know can only support 7bit ASCII, you could use this function. string string with non ASCII bytes removed ascii utf8_strip_non_ascii_ctrl Strip out all non 7bit ASCII bytes and ASCII device control codes. For a list of ASCII device control codes see the 2nd table here: http://www.w3schools.com/tags/ref_ascii.asp string boolean TRUE if it's all ASCII ascii Replace accented UTF-8 characters by unaccented ASCII-7 "equivalents". The purpose of this function is to replace characters commonly found in Latin alphabets with something more or less equivalent from the ASCII range. This can be useful for converting a UTF-8 to something ready for a filename, for example. Following the use of this function, you would probably also pass the string through utf8_strip_non_ascii to clean out any other non-ASCII chars Use the optional parameter to just deaccent lower ($case = -1) or upper ($case = 1) letters. Default is to deaccent both cases ($case = 0) For a more complete implementation of transliteration, see the utf8_to_ascii package available from the phputf8 project downloads: http://prdownloads.sourceforge.net/phputf8 string UTF-8 string int (optional) -1 lowercase only, +1 uppercase only, 1 both cases string UTF-8 with accented characters replaced by ASCII chars string accented chars replaced with ascii equivalents Andreas Gohr andi@.nosp@m.spli.nosp@m.tbrai.nosp@m.n.or.nosp@m.g ascii $Id$ Tools for locating / replacing bad bytes in UTF-8 strings The Original Code is Mozilla Communicator client code. The Initial Developer of the Original Code is Netscape Communications Corporation. Portions created by the Initial Developer are Copyright (C) 1998 the Initial Developer. All Rights Reserved. Ported to PHP by Henri Sivonen (http://hsivonen.iki.fi) Slight modifications to fit with phputf8 library by Harry Fuecks (hfuecks gmail com) http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUnicodeToUTF8.cpp http://hsivonen.iki.fi/php-utf8/ bad utf8_is_valid Locates the first bad byte in a UTF-8 string returning it's byte index in the string PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 string mixed integer byte index or FALSE if no bad found bad Locates all bad bytes in a UTF-8 string and returns a list of their byte index in the string PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 string mixed array of integers or FALSE if no bad found bad Strips out any bad bytes from a UTF-8 string and returns the rest PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 string string bad Replace bad bytes with an alternative character - ASCII character recommended is replacement char PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 string to search string to replace bad bytes with (defaults to '?') - use ASCII string bad Return code from utf8_bad_identify() when a five octet sequence is detected. Note: 5 octets sequences are valid UTF-8 but are not supported by Unicode so do not represent a useful character utf8_bad_identify bad Return code from utf8_bad_identify() when a six octet sequence is detected. Note: 6 octets sequences are valid UTF-8 but are not supported by Unicode so do not represent a useful character utf8_bad_identify bad Return code from utf8_bad_identify(). Invalid octet for use as start of multi-byte UTF-8 sequence utf8_bad_identify bad Return code from utf8_bad_identify(). From Unicode 3.1, non-shortest form is illegal utf8_bad_identify bad Return code from utf8_bad_identify(). From Unicode 3.2, surrogate characters are illegal utf8_bad_identify bad Return code from utf8_bad_identify(). Codepoints outside the Unicode range are illegal utf8_bad_identify bad Return code from utf8_bad_identify(). Incomplete multi-octet sequence Note: this is kind of a "catch-all" utf8_bad_identify bad Reports on the type of bad byte found in a UTF-8 string. Returns a status code on the first bad byte found hsivo.nosp@m.nen@.nosp@m.iki.f.nosp@m.i string UTF-8 encoded string mixed integer constant describing problem or FALSE if valid UTF-8 utf8_bad_explain http://hsivonen.iki.fi/php-utf8/ bad Takes a return code from utf8_bad_identify() are returns a message (in English) explaining what the problem is. int return code from utf8_bad_identify mixed string message or FALSE if return code unknown utf8_bad_identify bad PCRE Regular expressions for UTF-8. Note this file is not actually used by the rest of the library but these regular expressions can be useful to have available. $Id$ http://www.w3.org/International/questions/qa-forms-utf-8 patterns PCRE Pattern to check a UTF-8 string is valid Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 patterns PCRE Pattern to match single UTF-8 characters Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 patterns PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 patterns Locate a byte index given a UTF-8 character index $Id$ position Given a string and a character index in the string, in terms of the UTF-8 character position, returns the byte index of that character. Can be useful when you want to PHP's native string functions but we warned, locating the byte can be expensive Takes variable number of parameters - first must be the search string then 1 to n UTF-8 character positions to obtain byte indexes for - it is more efficient to search the string for multiple characters at once, than make repeated calls to this function Chris Smithchris.nosp@m.@jal.nosp@m.akai..nosp@m.co.u.nosp@m.k string string to locate index in int (n times) mixed - int if only one input int, array if more boolean TRUE if it's all ASCII position Given a string and any byte index, returns the byte index of the start of the current UTF-8 character, relative to supplied position. If the current character begins at the same place as the supplied byte index, that byte index will be returned. Otherwise this function will step backwards, looking for the index where curent UTF-8 character begins Chris Smithchris.nosp@m.@jal.nosp@m.akai..nosp@m.co.u.nosp@m.k string int byte index in the string int byte index of start of next UTF-8 character position Given a string and any byte index, returns the byte index of the start of the next UTF-8 character, relative to supplied position. If the next character begins at the same place as the supplied byte index, that byte index will be returned. Chris Smithchris.nosp@m.@jal.nosp@m.akai..nosp@m.co.u.nosp@m.k string int byte index in the string int byte index of start of next UTF-8 character position Utilities for processing "special" characters in UTF-8. "Special" largely means anything which would be regarded as a non-word character, like ASCII control characters and punctuation. This has a "Roman" bias - it would be unaware of modern Chinese "punctuation" characters for example. Note: requires utils/unicode.php to be loaded $Id$ utils utf8_is_valid Used internally. Builds a PCRE pattern from the $UTF8_SPECIAL_CHARS array defined in this file The $UTF8_SPECIAL_CHARS should contain all special characters (non-letter/non-digit) defined in the various local charsets - it's not a complete list of non-alphanum characters in UTF-8. It's not perfect but should match most cases of special chars. This function adds the control chars 0x00 to 0x19 to the array of special chars (they are not included in $UTF8_SPECIAL_CHARS) utils string utf8_from_unicode utf8_is_word_chars utf8_strip_specials Checks a string for whether it contains only word characters. This is logically equivalent to the PCRE meta character. Note that this is not a 100% guarantee that the string only contains alpha / numeric characters but just that common non-alphanumeric are not in the string, including ASCII device control characters. utils string to check boolean TRUE if the string only contains word characters utf8_specials_pattern Removes special characters (nonalphanumeric) from a UTF-8 string This can be useful as a helper for sanitizing a string for use as something like a file name or a unique identifier. Be warned though it does not handle all possible non-alphanumeric characters and is not intended is some kind of security / injection filter. utils Andreas Gohr andi@.nosp@m.spli.nosp@m.tbrai.nosp@m.n.or.nosp@m.g string $string The UTF8 string to strip of special chars string (optional) $repl Replace special with this string string with common non-alphanumeric characters removed utf8_specials_pattern $Id$ Tools for conversion between UTF-8 and unicode The Original Code is Mozilla Communicator client code. The Initial Developer of the Original Code is Netscape Communications Corporation. Portions created by the Initial Developer are Copyright (C) 1998 the Initial Developer. All Rights Reserved. Ported to PHP by Henri Sivonen (http://hsivonen.iki.fi) Slight modifications to fit with phputf8 library by Harry Fuecks (hfuecks gmail com) http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUnicodeToUTF8.cpp http://hsivonen.iki.fi/php-utf8/ unicode Takes an UTF-8 string and returns an array of ints representing the Unicode characters. Astral planes are supported ie. the ints in the output can be > 0xFFFF. Occurrances of the BOM are ignored. Surrogates are not allowed. Returns false if the input string isn't a valid UTF-8 octet sequence and raises a PHP error at level E_USER_WARNING Note: this function has been modified slightly in this library to trigger errors on encountering bad bytes hsivo.nosp@m.nen@.nosp@m.iki.f.nosp@m.i string UTF-8 encoded string mixed array of unicode code points or FALSE if UTF-8 invalid utf8_from_unicode http://hsivonen.iki.fi/php-utf8/ unicode Takes an array of ints representing the Unicode characters and returns a UTF-8 string. Astral planes are supported ie. the ints in the input can be > 0xFFFF. Occurrances of the BOM are ignored. Surrogates are not allowed. Returns false if the input array contains ints that represent surrogates or are outside the Unicode range and raises a PHP error at level E_USER_WARNING Note: this function has been modified slightly in this library to use output buffering to concatenate the UTF-8 string (faster) as well as reference the array by it's keys array of unicode code points representing a string mixed UTF-8 string or FALSE if array contains invalid code points hsivo.nosp@m.nen@.nosp@m.iki.f.nosp@m.i utf8_to_unicode http://hsivonen.iki.fi/php-utf8/ unicode $Id$ Tools for validing a UTF-8 string is well formed. The Original Code is Mozilla Communicator client code. The Initial Developer of the Original Code is Netscape Communications Corporation. Portions created by the Initial Developer are Copyright (C) 1998 the Initial Developer. All Rights Reserved. Ported to PHP by Henri Sivonen (http://hsivonen.iki.fi) Slight modifications to fit with phputf8 library by Harry Fuecks (hfuecks gmail com) http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUnicodeToUTF8.cpp http://hsivonen.iki.fi/php-utf8/ validation Tests a string as to whether it's valid UTF-8 and supported by the Unicode standard Note: this function has been modified to simple return true or false hsivo.nosp@m.nen@.nosp@m.iki.f.nosp@m.i string UTF-8 encoded string boolean true if valid http://hsivonen.iki.fi/php-utf8/ utf8_compliant validation Tests whether a string complies as UTF-8. This will be much faster than utf8_is_valid but will pass five and six octet UTF-8 sequences, which are not supported by Unicode and so cannot be displayed correctly in a browser. In other words it is not as strict as utf8_is_valid but it's faster. If you use is to validate user input, you place yourself at the risk that attackers will be able to inject 5 and 6 byte sequences (which may or may not be a significant risk, depending on what you are are doing) utf8_is_valid http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php#54805 string UTF-8 string to check boolean TRUE if string is valid UTF-8 validation