advanced_html_dom icon indicating copy to clipboard operation
advanced_html_dom copied to clipboard

Problem with local codepages

Open LuxCave opened this issue 6 years ago • 12 comments

The library doesn't seem to work with codepages other than utf8.

For instance, in a greek blogger page (for instance "https://ippokrateio-miterakaipaidi.blogspot.gr/2017/09/blog-post.html"),

$text = $html->find($text_container); echo $text;

returns

Î Î?Î?Î?Î? ΡΥÎ?Î?ΤÎ?Î?Î?Î?Σ ΤÎ?Î¥ <br>Î?Î?ΣÎ?Î?Î?Î?Î?Î?Î?Î¥
Î?Î?ΤÎ?ΡÎ?Σ Î?Î?Î? Î Î?Î?Î?Î?Î?Î¥```

LuxCave avatar Apr 07 '18 11:04 LuxCave

Can this be your terminal? It looks ok when I run it with php 7.0.9.

On Sat, Apr 7, 2018 at 7:54 PM, LuxCave [email protected] wrote:

The library doesn't seem to work with codepages other than utf8.

For instance, in a greek blogger page (for instance "https://ippokrateio- miterakaipaidi.blogspot.gr/2017/09/blog-post.html"),

$text = $html->find($text_container); echo $text;

returns

Î Î?Î?Î?Î? ΡΥÎ?Î?ΤÎ?Î?Î?Î?Σ ΤÎ?Î¥
Î?Î?ΣÎ?Î?Î?Î?Î?Î?Î?Î¥ Î?Î?ΤÎ?ΡÎ?Σ Î?Î?Î? Î Î?Î?Î?Î?Î?Î¥```

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afBA9X-5-1GMVY_2E6RC3dhfebligks5tmKj0gaJpZM4TLDU1 .

monkeysuffrage avatar Apr 07 '18 13:04 monkeysuffrage

I tried it:

  1. on my server (ubuntu - nginx - php 7.1 fpm - mysql - 16gb - i7 3440)
  2. on my local windows server on my pc,
  3. and in an "Instant Wordpress" Container all with similar specs as above. All the above are running Simple PHP Dom with no problems but produce this garbage output (erroneus codepage fonts) with Advance PHP Dom.

I am trying now to set up a docker container with older php version to check if the PHP version has any effect on this.

LuxCave avatar Apr 07 '18 17:04 LuxCave

How are you looking at the output? In a terminal?

On Sun, Apr 8, 2018 at 1:18 AM, LuxCave [email protected] wrote:

I tried it:

  1. on my server (ubuntu - nginx - php 7.1 fpm - mysql - 16gb - i7 3440)
  2. on my local windows server on my pc,
  3. and in an "Instant Wordpress" Container all with similar specs as above. All the above are running Simple PHP Dom with no problems but produce this garbage output (erroneus codepage fonts) with Advance PHP Dom.

I am trying now to set up a docker container with older php version to check if the PHP version has any effect on this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379485177, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afHhpvtUg8dcEw5_0gvtVhKYROTTKks5tmPTIgaJpZM4TLDU1 .

monkeysuffrage avatar Apr 07 '18 22:04 monkeysuffrage

Show me the output of: php myfile.php | file -

On Sun, Apr 8, 2018 at 6:50 AM, P Guardiario [email protected] wrote:

How are you looking at the output? In a terminal?

On Sun, Apr 8, 2018 at 1:18 AM, LuxCave [email protected] wrote:

I tried it:

  1. on my server (ubuntu - nginx - php 7.1 fpm - mysql - 16gb - i7
  1. on my local windows server on my pc,
  2. and in an "Instant Wordpress" Container all with similar specs as above. All the above are running Simple PHP Dom with no problems but produce this garbage output (erroneus codepage fonts) with Advance PHP Dom.

I am trying now to set up a docker container with older php version to check if the PHP version has any effect on this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379485177, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afHhpvtUg8dcEw5_0gvtVhKYROTTKks5tmPTIgaJpZM4TLDU1 .

monkeysuffrage avatar Apr 07 '18 22:04 monkeysuffrage

It is

:~$ php myfile.php | file -
/dev/stdin: ASCII text

also

:~$ printf "\ufeff...\n" | file -
/dev/stdin: UTF-8 Unicode (with BOM) text

I have the same problem in windows though.

LuxCave avatar Apr 08 '18 07:04 LuxCave

Hmm, I think maybe I wasn't clear that you need to replace myfile.php with the filename of your script.

On Sun, Apr 8, 2018 at 3:13 PM, LuxCave [email protected] wrote:

It is

:~$ php myfile.php | file - /dev/stdin: ASCII text

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379526825, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afDKGndKPf4wf_u8zlIbvR9qT4NhPks5tmbikgaJpZM4TLDU1 .

monkeysuffrage avatar Apr 08 '18 08:04 monkeysuffrage

I don't know how to do this in windows. But I stripped down my script to the following (and scrape a random greek blog page):

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<meta charset="utf-8"> 

<?php include_once( 'advanced_html_dom.php');

$url = "http://noetic-on.blogspot.gr/2016/01/blog-post_29.html";
$text_container = ".entry-content";
$html = file_get_html($url);
$text_content = $html->find($text_container);
echo $text_content;	
?>

and still I am getting garbage, like these: Στην οικονομία, αγαπητά μου παιδιά, 1 + 1 ΔΕΝ κάνει 2.

I have an unused webdomain on my server (with the specs above), and I put the script there, you can see it working:

https://nett.gr/

there I did what you ask:

.../nett/web$ php index.php | file -

and got this:

/dev/stdin: UTF-8 Unicode text, with very long lines, with CRLF, LF, NEL line terminators

Update: The same garbage are returned even if I echo $html.

LuxCave avatar Apr 08 '18 21:04 LuxCave

Your html isn't valid for one thing, make sure to have a doctype and meta charset inside of

On Mon, Apr 9, 2018 at 5:18 AM, LuxCave [email protected] wrote:

Another update: The problem is not appearing in other greek pages, like:

https://news.makedonias.gr/2018/04/3410336/

There I get the correct greek codepage letters.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379583412, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afD4lZ7V_7N_VlLlTxaiis_smhIfyks5tmn6fgaJpZM4TLDU1 .

monkeysuffrage avatar Apr 08 '18 23:04 monkeysuffrage

Ok, but this doesn't change the problem. Here's the new code, present also in https://nett.gr

<!DOCTYPE html>
<html lang="el">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
<title>Hello</title>
</head>
<body>
<?php include_once( 'advanced_html_dom.php');
$url = "http://noetic-on.blogspot.gr/2016/01/blog-post_29.html";
$text_container = ".entry-content";
$html = file_get_html($url); 
$text_content = $html->find($text_container);
echo $text_content;	
?>
</body></html>

LuxCave avatar Apr 09 '18 07:04 LuxCave

That still doesn't look right to me, but I don't think the html is the problem. What happens when you do:

echo file_get_contents($url);

On Mon, Apr 9, 2018 at 3:41 PM, LuxCave [email protected] wrote:

Ok, but this doesn't change the problem. Here's the new code, present also in https://nett.gr

Hello find($text_container); echo $text_content; ?>

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379663220, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afMnqyoZvRehCCxzSLiv4nD_8Jr2Bks5tmxCwgaJpZM4TLDU1 .

monkeysuffrage avatar Apr 09 '18 07:04 monkeysuffrage

file_get_content($url) returns the webpage correctly, as you can see in https://nett.gr. The problem appears with file_get_html($url);

LuxCave avatar Apr 09 '18 08:04 LuxCave

It's the charset not getting set properly that's causing this. Try it like this:

$html = str_get_html('' . file_get_contents($url));

On Mon, Apr 9, 2018 at 4:10 PM, LuxCave [email protected] wrote:

The file_get_content($url) returns the webpage correctly, as you can see in https://nett.gr. The problem appears with file_get_html($url);

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379670107, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afFT1TDikCCkVlEHsIuT1_hDuCnw9ks5tmxdjgaJpZM4TLDU1 .

monkeysuffrage avatar Apr 09 '18 09:04 monkeysuffrage