phpxmlrpc icon indicating copy to clipboard operation
phpxmlrpc copied to clipboard

Please support more internal encodings than those of the xml parser.

Open cmanley opened this issue 8 years ago • 1 comments

In particular CP1252 :)

If the set internal encoding isn't one supported by the xml parser, then the XML parser can be told output UTF-8 (which is already the case), then afterwards, the output can be transcoding the internal encoding.

For starters, I rewrote Helper\Charset::encodeEntities() to be less limited in terms of encodings (and less code):

public function encodeEntities($data, $srcEncoding = '', $destEncoding = '')
    {
        # Get/guess internal encoding
        if ($srcEncoding == '') {
            // lame, but we know no better...
            $srcEncoding = PhpXmlRpc::$xmlrpc_internalencoding;
        }
        $srcEncoding  = strtoupper($srcEncoding);
        if ($srcEncoding == 'US_ASCII') {
            $srcEncoding = 'ASCII'; # See http://www.php.net/manual/en/function.mb-list-encodings.php
        }
        $destEncoding = strtoupper($destEncoding);
        if ($destEncoding == 'US_ASCII') {
            $destEncoding = 'ASCII'; # See http://www.php.net/manual/en/function.mb-list-encodings.php
        }

        # Convert $data into destination encoding if necessary.
        if ($srcEncoding != $destEncoding) {
            if (function_exists('mb_convert_encoding')) {   # TODO store this as protected boolean or perhaps just force the requirement ext-mbstring that everyone should have by now.
                # TODO store and use a protected regex of valid mb_list_encodings()
                $data = mb_convert_encoding($data, $destEncoding, $srcEncoding);    # See http://www.php.net/manual/en/function.mb-list-encodings.php
            }           
            elseif ((($srcEncoding == 'ISO-8859-1') || ($srcEncoding == 'ASCII')) && ($destEncoding == 'UTF-8')) {
                $data = utf8_encode($data);
            }
            elseif (($destEncoding == 'ISO-8859-1') && ($srcEncoding == 'UTF-8')) {
                $data = utf8_decode($data);
            }
            elseif (function_exists('iconv')) {
                $data = iconv($srcEncoding, $destEncoding . '//TRANSLIT', $data);   # iconv() is buggy, that's why I prefer mb_convert_encoding()
            }
            else {
                error_log('XML-RPC: ' . __METHOD__ . ": Converting from $srcEncoding to $destEncoding: not supported...");
            }
        }

        # Escape XML entities
        $escapedData = htmlspecialchars(
            $data,
            ENT_QUOTES | ENT_XHTML,
            $destEncoding   # probably not necessary as the escaped characters share the same code points in all supported encodings
        );
        return $escapedData;
    }

Then I think in Request::parseResponse(), either all strings should be transcoded into the internal encoding before being passed into the Response constructor, or the Response constructor should receive the actual encoding of the strings as well as the desired internal encoding and do the transcoding as needed.

Other ideas I had reading the source code: Perhaps replace all checks for the mbstring extension with function_exists('...'). This allows the use of identically named substitution functions. Perhaps create a Helper\Transcoder class that performs the transcoding and supports any of mbstring, iconv, checks for supported encodings, etc.

cmanley avatar Oct 01 '16 22:10 cmanley

Thanks for the suggestions. Checking for function names instead of presence of a specific extension and adding a transcoder class seem like good ideas. The only downside is that, if the lib has to provide support for different transcoder backends, testing will become even messier than it currently is (and tbh charset problems make the bulk of issues reported by users since many years now, the codebase being quite stable). I will look into this when I have some spare time

gggeek avatar Nov 20 '16 00:11 gggeek

Long time no see :-D

I have come around to implementing what you had requested a while ago.

It's in the master branch as of today, and will be in the next release, expected to be 4.10.0. All that is needed is adding some tests and docs...

I did not use the code you suggested verbatim, as I preferred to keep the existing manual conversion logic, which is able to encode characters outside the target encoding using their entity representation. The resulting code is quite messy, but it should cater to as many cases as possible.

gggeek avatar Jan 16 '23 15:01 gggeek

Tests, demo code and docs added

gggeek avatar Jan 17 '23 18:01 gggeek