mb_detect_encoding

Description

string mb_detect_encoding ( string str [, mixed encoding_list [, bool strict]] )

mb_detect_encoding() detects character encoding in string str. It returns detected character encoding.

encoding_list is list of character encoding. Encoding order may be specified by array or comma separated list string.

If encoding_list is omitted, detect_order is used.

例子 1. mb_detect_encoding() example
<?php /* Detect character encoding with current detect_order */ echo mb_detect_encoding($str); /* "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */ echo mb_detect_encoding($str, "auto"); /* Specify encoding_list character encoding by comma separated list */ echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win"); /* Use array to specify encoding_list */ $ary[] = "ASCII"; $ary[] = "JIS"; $ary[] = "EUC-JP"; echo mb_detect_encoding($str, $ary); ?>

See also mb_detect_order().

add a note User Contributed Notes

telemach
28-Jul-2005 09:48


beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)



mb_detect_encoding('accentue' , 'UTF-8, ISO-8859-1')



returns ISO-8859-1, while 



mb_detect_encoding('accentu' , 'UTF-8, ISO-8859-1')



returns UTF-8



bottom line : an ending '' (and probably other accentuated chars) mislead mb_detect_encoding

Chrigu
29-Mar-2005 11:32


If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:

mb_detect_encoding($string, 'UTF-8, ISO-8859-1');



if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

php-note-2005 at ryandesign dot com
17-Feb-2005 11:57


Much simpler UTF-8-ness checker using a regular expression created by the W3C:



<?php



// Returns true if $string is valid UTF-8 and false otherwise.

function is_utf8($string) {

    

    // From http://w3.org/International/questions/qa-forms-utf-8.html

    return preg_match('%^(?:

          [\x09\x0A\x0D\x20-\x7E]            # ASCII

        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte

        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs

        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte

        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates

        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3

        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15

        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16

    )*$%xs', $string);

    

} // function is_utf8



?>

jaaks at playtech dot com
14-Jan-2005 04:27


Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.



Replace

         } // goto next char

with

         } else {

           return false; // 10xxxxxx occuring alone

         } // goto next char

maarten
13-Jan-2005 07:55


Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.

To verify utf 8 use the following:



//

//    utf8 encoding validation developed based on Wikipedia entry at:

//    http://en.wikipedia.org/wiki/UTF-8

//

//    Implemented as a recursive descent parser based on a simple state machine

//    copyright 2005 Maarten Meijer

//

//    This cries out for a C-implementation to be included in PHP core

//

    function valid_1byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0x80) == 0x00;

    }

    

    function valid_2byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xE0) == 0xC0;

    }



    function valid_3byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xF0) == 0xE0;

    }



    function valid_4byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xF8) == 0xF0;

    }

    

    function valid_nextbyte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xC0) == 0x80;

    }

    

    function valid_utf8($string) {

        $len = strlen($string);

        $i = 0;    

        while( $i < $len ) {

            $char = ord(substr($string, $i++, 1));

            if(valid_1byte($char)) {    // continue

                continue;

            } else if(valid_2byte($char)) { // check 1 byte

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

            } else if(valid_3byte($char)) { // check 2 bytes

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

            } else if(valid_4byte($char)) { // check 3 bytes

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

            } // goto next char

        }

        return true; // done

    }



for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png

add a note