- Version:
- Id:
- case.php 10381 2008-06-01 03:35:53Z pasamio
strings
Define UTF8_CASE as required Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_strtolower Make a string lowercase Note: The concept of a characters "case" only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings
- Paramètres:
-
- Renvoie:
- mixed either string in lowercase or FALSE is UTF-8 invalid
strings
Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_strtoupper Make a string uppercase Note: The concept of a characters "case" only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings
- Paramètres:
-
- Renvoie:
- mixed either string in lowercase or FALSE is UTF-8 invalid
strings
- Version:
- Id:
- core.php 10381 2008-06-01 03:35:53Z pasamio
strings
Define UTF8_CORE as required Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_strpos Find position of first occurrence of a string
- Paramètres:
-
string | haystack |
string | needle (you should validate this with utf8_is_valid) |
integer | offset in characters (from left) |
- Renvoie:
- mixed integer position or FALSE on failure
strings
Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_strrpos Find position of last occurrence of a char in a string
- Paramètres:
-
string | haystack |
string | needle (you should validate this with utf8_is_valid) |
integer | (optional) offset (from left) |
- Renvoie:
- mixed integer position or FALSE on failure
strings
Assumes mbstring internal encoding is set to UTF-8 Wrapper around mb_substr Return part of a string given character offset (and optionally length)
- Paramètres:
-
string | |
integer | number of UTF-8 characters offset (from left) |
integer | (optional) length in UTF-8 characters from offset |
- Renvoie:
- mixed string or FALSE if failure
strings
- Version:
- Id:
- strlen.php 10381 2008-06-01 03:35:53Z pasamio
strings
Define UTF8_STRLEN as required Wrapper round mb_strlen Assumes you have mb_internal_encoding to UTF-8 already Note: this function does not count bad bytes in the string - these are simply ignored
- Paramètres:
-
- Renvoie:
- int number of UTF-8 characters in string
strings
Define UTF8_CASE as required UTF-8 aware alternative to strtolower Make a string lowercase Note: The concept of a characters "case" only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings Note: requires utf8_to_unicode and utf8_from_unicode
- Auteur:
- Andreas Gohr <andi@splitbrain.org>
- Paramètres:
-
- Renvoie:
- mixed either string in lowercase or FALSE is UTF-8 invalid
- Voir également:
- http://www.php.net/strtolower
-
utf8_to_unicode
-
utf8_from_unicode
-
http://www.unicode.org/reports/tr21/tr21-5.html
-
http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php
strings
UTF-8 Case lookup table This lookuptable defines the lower case letters to their correspponding upper case letter in UTF-8
- Auteur:
- Andreas Gohr <andi@splitbrain.org>
- Voir également:
- http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php
-
utf8_strtolower
strings
UTF-8 aware alternative to strtoupper Make a string uppercase Note: The concept of a characters "case" only exists is some alphabets such as Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings Note: requires utf8_to_unicode and utf8_from_unicode
- Auteur:
- Andreas Gohr <andi@splitbrain.org>
- Paramètres:
-
- Renvoie:
- mixed either string in lowercase or FALSE is UTF-8 invalid
- Voir également:
- http://www.php.net/strtoupper
-
utf8_to_unicode
-
utf8_from_unicode
-
http://www.unicode.org/reports/tr21/tr21-5.html
-
http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php
strings
UTF-8 Case lookup table This lookuptable defines the upper case letters to their correspponding lower case letter in UTF-8
- Auteur:
- Andreas Gohr <andi@splitbrain.org>
- Voir également:
- utf8_strtoupper
strings
- Voir également:
- http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php
Define UTF8_CORE as required UTF-8 aware alternative to strpos Find position of first occurrence of a string Note: This will get alot slower if offset is used Note: requires utf8_strlen amd utf8_substr to be loaded
- Paramètres:
-
string | haystack |
string | needle (you should validate this with utf8_is_valid) |
integer | offset in characters (from left) |
- Renvoie:
- mixed integer position or FALSE on failure
- Voir également:
- http://www.php.net/strpos
-
utf8_strlen
-
utf8_substr
strings
UTF-8 aware alternative to strrpos Find position of last occurrence of a char in a string Note: This will get alot slower if offset is used Note: requires utf8_substr and utf8_strlen to be loaded
- Paramètres:
-
string | haystack |
string | needle (you should validate this with utf8_is_valid) |
integer | (optional) offset (from left) |
- Renvoie:
- mixed integer position or FALSE on failure
- Voir également:
- http://www.php.net/strrpos
-
utf8_substr
-
utf8_strlen
strings
UTF-8 aware alternative to substr Return part of a string given character offset (and optionally length) Note: supports use of negative offsets and lengths but will be slower when doing so
- Paramètres:
-
string | |
integer | number of UTF-8 characters offset (from left) |
integer | (optional) length in UTF-8 characters from offset |
- Renvoie:
- mixed string or FALSE if failure
strings
Define UTF8_STRLEN as required Unicode aware replacement for strlen(). Returns the number of characters in the string (not the number of bytes), replacing multibyte characters with a single byte equivalent utf8_decode() converts characters that are not in ISO-8859-1 to '?', which, for the purpose of counting, is alright - It's much faster than iconv_strlen Note: this function does not count bad UTF-8 bytes in the string
- these are simply ignored
- Auteur:
- <chernyshevsky at="" hotmail="" dot="" com>=""> http://www.php.net/manual/en/function.utf8-decode.php string UTF-8 string int number of UTF-8 characters in string strings strings UTF-8 aware alternative to str_ireplace Case-insensitive version of str_replace Note: requires utf8_strtolower Note: it's not fast and gets slower if $search / $replace is array Notes: it's based on the assumption that the lower and uppercase versions of a UTF-8 character will have the same length in bytes which is currently true given the hash table to strtolower string string http://www.php.net/str_ireplace utf8_strtolower strings strings UTF-8 aware alternative to str_split Convert a string to an array Note: requires utf8_strlen to be loaded string UTF-8 encoded int number to characters to split string by string characters in string reverses http://www.php.net/str_split utf8_strlen strings strings UTF-8 aware alternative to strcasecmp A case insensivite string comparison Note: requires utf8_strtolower string string int http://www.php.net/strcasecmp utf8_strtolower strings strings UTF-8 aware alternative to strcspn Find length of initial segment not matching mask Note: requires utf8_strlen and utf8_substr (if start, length are used) string int http://www.php.net/strcspn utf8_strlen strings strings UTF-8 aware alternative to stristr Find first occurrence of a string using case insensitive comparison Note: requires utf8_strtolower string string int http://www.php.net/strcasecmp utf8_strtolower strings strings UTF-8 aware alternative to strrev Reverse a string string UTF-8 encoded string characters in string reverses http://www.php.net/strrev strings strings UTF-8 aware alternative to strspn Find length of initial segment matching mask Note: requires utf8_strlen and utf8_substr (if start, length are used) string int http://www.php.net/strspn strings strings strings UTF-8 aware replacement for ltrim() Note: you only need to use this if you are supplying the charlist optional arg and it contains UTF-8 characters. Otherwise ltrim will work normally on a UTF-8 string Andreas Gohr <andi@splitbrain.org> http://www.php.net/ltrim http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php string strings UTF-8 aware replacement for rtrim() Note: you only need to use this if you are supplying the charlist optional arg and it contains UTF-8 characters. Otherwise rtrim will work normally on a UTF-8 string Andreas Gohr <andi@splitbrain.org> http://www.php.net/rtrim http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php string strings UTF-8 aware replacement for trim() Note: you only need to use this if you are supplying the charlist optional arg and it contains UTF-8 characters. Otherwise trim will work normally on a UTF-8 string Andreas Gohr <andi@splitbrain.org> http://www.php.net/trim http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php string strings strings UTF-8 aware alternative to ucfirst Make a string's first character uppercase Note: requires utf8_strtoupper string string with first character as upper case (if applicable) http://www.php.net/ucfirst utf8_strtoupper strings strings UTF-8 aware alternative to ucwords Uppercase the first character of each word in a string Note: requires utf8_substr_replace and utf8_strtoupper string string with first char of each word uppercase http://www.php.net/ucwords strings Callback function for preg_replace_callback call in utf8_ucwords You don't need to call this yourself array of matches corresponding to a single word string with first char of the word in uppercase utf8_ucwords utf8_strtoupper strings This is the dynamic loader for the library. It checks whether you have the mbstring extension available and includes relevant files on that basis, falling back to the native (as in written in PHP) version if mbstring is unavailabe. It's probably easiest to use this, if you don't want to understand the dependencies involved, in conjunction with PHP versions etc. At the same time, you might get better performance by managing loading yourself. The smartest way to do this, bearing in mind performance, is probably to "load on demand" - i.e. just before you use these functions in your code, load the version you need. It makes sure the the following functions are available; utf8_strlen, utf8_strpos, utf8_strrpos, utf8_substr, utf8_strtolower, utf8_strtoupper Other functions in the ./native directory depend on these six functions being available Tools to help with ASCII in UTF-8 ascii Tests whether a string contains only 7bit ASCII bytes. You might use this to conditionally check whether a string needs handling as UTF-8 or not, potentially offering performance benefits by using the native PHP equivalent if it's just ASCII e.g.;
if ( utf8_is_ascii($someString) ) { // It's just ASCII - use the native PHP version $someString = strtolower($someString); } else { $someString = utf8_strtolower($someString); }
string boolean TRUE if it's all ASCII ascii utf8_is_ascii_ctrl Tests whether a string contains only 7bit ASCII bytes with device control codes omitted. The device control codes can be found on the second table here: http://www.w3schools.com/tags/ref_ascii.asp string boolean TRUE if it's all ASCII without device control codes ascii utf8_is_ascii Strip out all non-7bit ASCII bytes If you need to transmit a string to system which you know can only support 7bit ASCII, you could use this function. string string with non ASCII bytes removed ascii utf8_strip_non_ascii_ctrl Strip out all non 7bit ASCII bytes and ASCII device control codes. For a list of ASCII device control codes see the 2nd table here: http://www.w3schools.com/tags/ref_ascii.asp string boolean TRUE if it's all ASCII ascii Replace accented UTF-8 characters by unaccented ASCII-7 "equivalents". The purpose of this function is to replace characters commonly found in Latin alphabets with something more or less equivalent from the ASCII range. This can be useful for converting a UTF-8 to something ready for a filename, for example. Following the use of this function, you would probably also pass the string through utf8_strip_non_ascii to clean out any other non-ASCII chars Use the optional parameter to just deaccent lower ($case = -1) or upper ($case = 1) letters. Default is to deaccent both cases ($case = 0) For a more complete implementation of transliteration, see the utf8_to_ascii package available from the phputf8 project downloads: http://prdownloads.sourceforge.net/phputf8 string UTF-8 string int (optional) -1 lowercase only, +1 uppercase only, 1 both cases string UTF-8 with accented characters replaced by ASCII chars string accented chars replaced with ascii equivalents Andreas Gohr <andi@splitbrain.org> ascii Tools for locating / replacing bad bytes in UTF-8 strings The Original Code is Mozilla Communicator client code. The Initial Developer of the Original Code is Netscape Communications Corporation. Portions created by the Initial Developer are Copyright (C) 1998 the Initial Developer. All Rights Reserved. Ported to PHP by Henri Sivonen (http://hsivonen.iki.fi) Slight modifications to fit with phputf8 library by Harry Fuecks (hfuecks gmail com) http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUnicodeToUTF8.cpp http://hsivonen.iki.fi/php-utf8/ bad utf8_is_valid Locates the first bad byte in a UTF-8 string returning it's byte index in the string PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 string mixed integer byte index or FALSE if no bad found bad Locates all bad bytes in a UTF-8 string and returns a list of their byte index in the string PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 string mixed array of integers or FALSE if no bad found bad Strips out any bad bytes from a UTF-8 string and returns the rest PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 string string bad Replace bad bytes with an alternative character - ASCII character recommended is replacement char PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 string to search string to replace bad bytes with (defaults to '?') - use ASCII string bad Return code from utf8_bad_identify() when a five octet sequence is detected. Note: 5 octets sequences are valid UTF-8 but are not supported by Unicode so do not represent a useful character utf8_bad_identify bad Return code from utf8_bad_identify() when a six octet sequence is detected. Note: 6 octets sequences are valid UTF-8 but are not supported by Unicode so do not represent a useful character utf8_bad_identify bad Return code from utf8_bad_identify(). Invalid octet for use as start of multi-byte UTF-8 sequence utf8_bad_identify bad Return code from utf8_bad_identify(). From Unicode 3.1, non-shortest form is illegal utf8_bad_identify bad Return code from utf8_bad_identify(). From Unicode 3.2, surrogate characters are illegal utf8_bad_identify bad Return code from utf8_bad_identify(). Codepoints outside the Unicode range are illegal utf8_bad_identify bad Return code from utf8_bad_identify(). Incomplete multi-octet sequence Note: this is kind of a "catch-all" utf8_bad_identify bad Reports on the type of bad byte found in a UTF-8 string. Returns a status code on the first bad byte found <hsivonen@iki.fi> string UTF-8 encoded string mixed integer constant describing problem or FALSE if valid UTF-8 utf8_bad_explain http://hsivonen.iki.fi/php-utf8/ bad Takes a return code from utf8_bad_identify() are returns a message (in English) explaining what the problem is. int return code from utf8_bad_identify mixed string message or FALSE if return code unknown utf8_bad_identify bad PCRE Regular expressions for UTF-8. Note this file is not actually used by the rest of the library but these regular expressions can be useful to have available. http://www.w3.org/International/questions/qa-forms-utf-8 patterns PCRE Pattern to check a UTF-8 string is valid Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 patterns PCRE Pattern to match single UTF-8 characters Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 patterns PCRE Pattern to locate bad bytes in a UTF-8 string Comes from W3 FAQ: Multilingual Forms Note: modified to include full ASCII range including control chars http://www.w3.org/International/questions/qa-forms-utf-8 patterns Utilities for processing "special" characters in UTF-8. "Special" largely means anything which would be regarded as a non-word character, like ASCII control characters and punctuation. This has a "Roman" bias - it would be unaware of modern Chinese "punctuation" characters for example. Note: requires utils/unicode.php to be loaded utils utf8_is_valid Used internally. Builds a PCRE pattern from the $UTF8_SPECIAL_CHARS array defined in this file This function adds the control chars 0x00 to 0x19 to the array of special chars (they are not included in $UTF8_SPECIAL_CHARS) utils string utf8_from_unicode utf8_is_word_chars utf8_strip_specials Checks a string for whether it contains only word characters. This is logically equivalent to the PCRE meta character. Note that this is not a 100% guarantee that the string only contains alpha / numeric characters but just that common non-alphanumeric are not in the string, including ASCII device control characters. utils string to check boolean TRUE if the string only contains word characters utf8_specials_pattern Removes special characters (nonalphanumeric) from a UTF-8 string This can be useful as a helper for sanitizing a string for use as something like a file name or a unique identifier. Be warned though it does not handle all possible non-alphanumeric characters and is not intended is some kind of security / injection filter. utils Andreas Gohr <andi@splitbrain.org> string $string The UTF8 string to strip of special chars string (optional) $repl Replace special with this string string with common non-alphanumeric characters removed utf8_specials_pattern UTF-8 array of common special characters This array should contain all special characters (not a letter or digit) defined in the various local charsets - it's not a complete list of non-alphanum characters in UTF-8. It's not perfect but should match most cases of special chars. The controlchars 0x00 to 0x19 are _not_ included in this array. The space 0x20 is! These chars are _not_ in the array either: _ (0x5f), : 0x3a, . 0x2e, - 0x2d utils Andreas Gohr <andi@splitbrain.org> utf8_specials_pattern Tools for conversion between UTF-8 and unicode The Original Code is Mozilla Communicator client code. The Initial Developer of the Original Code is Netscape Communications Corporation. Portions created by the Initial Developer are Copyright (C) 1998 the Initial Developer. All Rights Reserved. Ported to PHP by Henri Sivonen (http://hsivonen.iki.fi) Slight modifications to fit with phputf8 library by Harry Fuecks (hfuecks gmail com) http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUnicodeToUTF8.cpp http://hsivonen.iki.fi/php-utf8/ unicode Takes an UTF-8 string and returns an array of ints representing the Unicode characters. Astral planes are supported ie. the ints in the output can be > 0xFFFF. Occurrances of the BOM are ignored. Surrogates are not allowed. Returns false if the input string isn't a valid UTF-8 octet sequence and raises a PHP error at level E_USER_WARNING Note: this function has been modified slightly in this library to trigger errors on encountering bad bytes <hsivonen@iki.fi> string UTF-8 encoded string mixed array of unicode code points or FALSE if UTF-8 invalid utf8_from_unicode http://hsivonen.iki.fi/php-utf8/ unicode Takes an array of ints representing the Unicode characters and returns a UTF-8 string. Astral planes are supported ie. the ints in the input can be > 0xFFFF. Occurrances of the BOM are ignored. Surrogates are not allowed. Returns false if the input array contains ints that represent surrogates or are outside the Unicode range and raises a PHP error at level E_USER_WARNING Note: this function has been modified slightly in this library to use output buffering to concatenate the UTF-8 string (faster) as well as reference the array by it's keys array of unicode code points representing a string mixed UTF-8 string or FALSE if array contains invalid code points <hsivonen@iki.fi> utf8_to_unicode http://hsivonen.iki.fi/php-utf8/ unicode Tools for validing a UTF-8 string is well formed. The Original Code is Mozilla Communicator client code. The Initial Developer of the Original Code is Netscape Communications Corporation. Portions created by the Initial Developer are Copyright (C) 1998 the Initial Developer. All Rights Reserved. Ported to PHP by Henri Sivonen (http://hsivonen.iki.fi) Slight modifications to fit with phputf8 library by Harry Fuecks (hfuecks gmail com) http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUnicodeToUTF8.cpp http://hsivonen.iki.fi/php-utf8/ validation Tests a string as to whether it's valid UTF-8 and supported by the Unicode standard Note: this function has been modified to simple return true or false <hsivonen@iki.fi> string UTF-8 encoded string boolean true if valid http://hsivonen.iki.fi/php-utf8/ utf8_compliant validation Tests whether a string complies as UTF-8. This will be much faster than utf8_is_valid but will pass five and six octet UTF-8 sequences, which are not supported by Unicode and so cannot be displayed correctly in a browser. In other words it is not as strict as utf8_is_valid but it's faster. If you use is to validate user input, you place yourself at the risk that attackers will be able to inject 5 and 6 byte sequences (which may or may not be a significant risk, depending on what you are are doing) utf8_is_valid http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php#54805 string UTF-8 string to check boolean TRUE if string is valid UTF-8 validation