advanced_html_dom
advanced_html_dom copied to clipboard
Problem with local codepages
The library doesn't seem to work with codepages other than utf8.
For instance, in a greek blogger page (for instance "https://ippokrateio-miterakaipaidi.blogspot.gr/2017/09/blog-post.html"),
$text = $html->find($text_container); echo $text;
returns
Î Î?Î?Î?Î? ΡΥÎ?Î?ΤÎ?Î?Î?Î?Σ ΤÎ?Î¥ <br>Î?Î?ΣÎ?Î?Î?Î?Î?Î?Î?Î¥
Î?Î?ΤÎ?ΡÎ?Σ Î?Î?Î? Î Î?Î?Î?Î?Î?Î¥```
Can this be your terminal? It looks ok when I run it with php 7.0.9.
On Sat, Apr 7, 2018 at 7:54 PM, LuxCave [email protected] wrote:
The library doesn't seem to work with codepages other than utf8.
For instance, in a greek blogger page (for instance "https://ippokrateio- miterakaipaidi.blogspot.gr/2017/09/blog-post.html"),
$text = $html->find($text_container); echo $text;
returns
Î Î?Î?Î?Î? ΡΥÎ?Î?ΤÎ?Î?Î?Î?Σ ΤÎ?Î¥
Î?Î?ΣÎ?Î?Î?Î?Î?Î?Î?Î¥ Î?Î?ΤÎ?ΡÎ?Σ Î?Î?Î? Î Î?Î?Î?Î?Î?Î¥```— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afBA9X-5-1GMVY_2E6RC3dhfebligks5tmKj0gaJpZM4TLDU1 .
I tried it:
- on my server (ubuntu - nginx - php 7.1 fpm - mysql - 16gb - i7 3440)
- on my local windows server on my pc,
- and in an "Instant Wordpress" Container all with similar specs as above. All the above are running Simple PHP Dom with no problems but produce this garbage output (erroneus codepage fonts) with Advance PHP Dom.
I am trying now to set up a docker container with older php version to check if the PHP version has any effect on this.
How are you looking at the output? In a terminal?
On Sun, Apr 8, 2018 at 1:18 AM, LuxCave [email protected] wrote:
I tried it:
- on my server (ubuntu - nginx - php 7.1 fpm - mysql - 16gb - i7 3440)
- on my local windows server on my pc,
- and in an "Instant Wordpress" Container all with similar specs as above. All the above are running Simple PHP Dom with no problems but produce this garbage output (erroneus codepage fonts) with Advance PHP Dom.
I am trying now to set up a docker container with older php version to check if the PHP version has any effect on this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379485177, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afHhpvtUg8dcEw5_0gvtVhKYROTTKks5tmPTIgaJpZM4TLDU1 .
Show me the output of: php myfile.php | file -
On Sun, Apr 8, 2018 at 6:50 AM, P Guardiario [email protected] wrote:
How are you looking at the output? In a terminal?
On Sun, Apr 8, 2018 at 1:18 AM, LuxCave [email protected] wrote:
I tried it:
- on my server (ubuntu - nginx - php 7.1 fpm - mysql - 16gb - i7
- on my local windows server on my pc,
- and in an "Instant Wordpress" Container all with similar specs as above. All the above are running Simple PHP Dom with no problems but produce this garbage output (erroneus codepage fonts) with Advance PHP Dom.
I am trying now to set up a docker container with older php version to check if the PHP version has any effect on this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379485177, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afHhpvtUg8dcEw5_0gvtVhKYROTTKks5tmPTIgaJpZM4TLDU1 .
It is
:~$ php myfile.php | file -
/dev/stdin: ASCII text
also
:~$ printf "\ufeff...\n" | file -
/dev/stdin: UTF-8 Unicode (with BOM) text
I have the same problem in windows though.
Hmm, I think maybe I wasn't clear that you need to replace myfile.php with the filename of your script.
On Sun, Apr 8, 2018 at 3:13 PM, LuxCave [email protected] wrote:
It is
:~$ php myfile.php | file - /dev/stdin: ASCII text
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379526825, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afDKGndKPf4wf_u8zlIbvR9qT4NhPks5tmbikgaJpZM4TLDU1 .
I don't know how to do this in windows. But I stripped down my script to the following (and scrape a random greek blog page):
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<meta charset="utf-8">
<?php include_once( 'advanced_html_dom.php');
$url = "http://noetic-on.blogspot.gr/2016/01/blog-post_29.html";
$text_container = ".entry-content";
$html = file_get_html($url);
$text_content = $html->find($text_container);
echo $text_content;
?>
and still I am getting garbage, like these:
ΣÏην οικονομία, αγαÏηÏά μοÏ
Ïαιδιά, 1 + 1 ÎÎΠκάνει 2.
I have an unused webdomain on my server (with the specs above), and I put the script there, you can see it working:
https://nett.gr/
there I did what you ask:
.../nett/web$ php index.php | file -
and got this:
/dev/stdin: UTF-8 Unicode text, with very long lines, with CRLF, LF, NEL line terminators
Update: The same garbage are returned even if I echo $html.
Your html isn't valid for one thing, make sure to have a doctype and meta charset inside of
On Mon, Apr 9, 2018 at 5:18 AM, LuxCave [email protected] wrote:
Another update: The problem is not appearing in other greek pages, like:
https://news.makedonias.gr/2018/04/3410336/
There I get the correct greek codepage letters.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379583412, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afD4lZ7V_7N_VlLlTxaiis_smhIfyks5tmn6fgaJpZM4TLDU1 .
Ok, but this doesn't change the problem. Here's the new code, present also in https://nett.gr
<!DOCTYPE html>
<html lang="el">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
<title>Hello</title>
</head>
<body>
<?php include_once( 'advanced_html_dom.php');
$url = "http://noetic-on.blogspot.gr/2016/01/blog-post_29.html";
$text_container = ".entry-content";
$html = file_get_html($url);
$text_content = $html->find($text_container);
echo $text_content;
?>
</body></html>
That still doesn't look right to me, but I don't think the html is the problem. What happens when you do:
echo file_get_contents($url);
On Mon, Apr 9, 2018 at 3:41 PM, LuxCave [email protected] wrote:
Ok, but this doesn't change the problem. Here's the new code, present also in https://nett.gr
Hello find($text_container); echo $text_content; ?>— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379663220, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afMnqyoZvRehCCxzSLiv4nD_8Jr2Bks5tmxCwgaJpZM4TLDU1 .
file_get_content($url) returns the webpage correctly, as you can see in https://nett.gr. The problem appears with file_get_html($url);
It's the charset not getting set properly that's causing this. Try it like this:
$html = str_get_html('' . file_get_contents($url));
On Mon, Apr 9, 2018 at 4:10 PM, LuxCave [email protected] wrote:
The file_get_content($url) returns the webpage correctly, as you can see in https://nett.gr. The problem appears with file_get_html($url);
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/monkeysuffrage/advanced_html_dom/issues/18#issuecomment-379670107, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1afFT1TDikCCkVlEHsIuT1_hDuCnw9ks5tmxdjgaJpZM4TLDU1 .