scrapely ValueError: Buffer dtype mismatch, expected 'int64

Hi, I am having the following problem. Not sure if i am following the right steps. This is the repro. Regards,

--------------------------------------
root
--------------------------------------
root@tex:/home/scraper# python --version
Python 3.4.3+
root@tex:/home/scraper# virtualenv venv_scrapely
Using base prefix '/usr'
New python executable in /home/scraper/venv_scrapely/bin/python3
Also creating executable in /home/scraper/venv_scrapely/bin/python
Installing setuptools, pip, wheel...done.
root@tex:/home/scraper# ls -lrt
total 4
drwxr-xr-x 5 root root 4096 Feb  6 18:23 venv_scrapely
root@tex:/home/scraper# source ./venv_scrapely/bin/activate
(venv_scrapely) root@tex:/home/scraper# pip install scrapely
Collecting scrapely
Collecting w3lib (from scrapely)
  Using cached w3lib-1.16.0-py2.py3-none-any.whl
Collecting numpy (from scrapely)
  Using cached numpy-1.12.0-cp34-cp34m-manylinux1_i686.whl
Requirement already satisfied: six in ./venv_scrapely/lib/python3.4/site-packages (from scrapely)
Installing collected packages: w3lib, numpy, scrapely
Successfully installed numpy-1.12.0 scrapely-0.13.3 w3lib-1.16.0
(venv_scrapely) root@tex:/home/scraper#
(venv_scrapely) root@tex:/home/scraper# pip list
(1.4.0)
numpy (1.12.0)
packaging (16.8)
pip (9.0.1)
pyparsing (2.1.10)
scrapely (0.13.3)
setuptools (34.1.1)
six (1.10.0)
w3lib (1.16.0)
wheel (0.29.0)
------------------------
with user scraper
------------------------
scraper@tex:$ source ./venv_scrapely/bin/activate
(venv_scrapely) scraper@tex:~$ python --version
Python 3.4.3+
(venv_scrapely) scraper@tex:~$ python
Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
*** from scrapely import Scraper
*** s=Scraper()
*** url1='https://github.com/ripple/rippled'
*** data={'name':'ripple/rippled','commits':'11,292','releases':'66','contributors':'56'}
*** s.train(url1,data)
*** url2='https://github.com/scrapy/scrapely/'
*** s.scrape(url2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/__init__.py", line 53, in scrape
    return self.scrape_page(page)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/__init__.py", line 59, in scrape_page
    return self._ex.extract(page)[0]
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/__init__.py", line 119, in extract
    extracted = extraction_tree.extract(extraction_page)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 575, in extract
    items.extend(extractor.extract(page, start_index, end_index, self.template.ignored_regions))
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 351, in extract
    _, _, attributes = self._doextract(page, extractors, start_index, end_index, **kwargs)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/regionextract.py", line 396, in _doextract
    labelled, start_index, end_index_exclusive, self.best_match, **kwargs)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/similarity.py", line 148, in similar_region
    data_length - range_end, data_length - range_start)
  File "/home/scraper/venv_scrapely/lib/python3.4/site-packages/scrapely/extraction/similarity.py", line 85, in longest_unique_subsequence
    matches = naive_match_length(to_search, subsequence, range_start, range_end)
  File "scrapely/extraction/_similarity.pyx", line 155, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3845)
  File "scrapely/extraction/_similarity.pyx", line 158, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3648)
  File "scrapely/extraction/_similarity.pyx", line 87, in scrapely.extraction._similarity.np_naive_match_length (scrapely/extraction/_similarity.c:2802)
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'
```bash

Feb 06 '17 17:02 aceri

I got the same error running the example code:

from scrapely import Scraper

s = Scraper()

url1 = 'http://pypi.python.org/pypi/w3lib/1.1'

data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}

s.train(url1, data)

url2 = 'http://pypi.python.org/pypi/Django/1.3'

s.scrape(url2)

Gives me the same error.

Mar 05 '17 01:03 aschi2

@aceri, @aschi2 I'm unable to replicate the issue. I guess both of you are using 32 bit systems and that is causing problems. If you can confirm you are using 32 bit systems I can add a fallback to just use the python implementation on 32 bit systems

Mar 06 '17 09:03 ruairif

I am using a 64bit system and 64bit Python 2.7.

Mar 06 '17 16:03 aschi2

I get the exact same error, 64 bit system.

Mar 18 '17 23:03 pavelmalai

I can't replicate the issue as well. @ruairif I have some doubts in the six library

This is the code for finding the maxsize

class X(object):

            def __len__(self):
                return 1 << 31
        try:
            len(X())
        except OverflowError:
            # 32-bit
            MAXSIZE = int((1 << 31) - 1)
        else:
            # 64-bit
            MAXSIZE = int((1 << 63) - 1)
        del X

According to me in def __len__(self) return value should be 1 << 63

If this is valid could this be a source of the problem?

Mar 27 '17 05:03 hackrush01

I am also facing the same problem on Python 3.5 64bit Windows!

Apr 03 '17 11:04 bhavsarpratik

I have same issue on Python 2.7.11 MSC v.1500 64 bit (AMD64) on win32 under virtual environment. No answers yet?

Nov 30 '17 23:11 andreylisovskiy

I've same problem with Python 3.6.3 32bit on windwos 10 Enterprise X64

Dec 16 '17 08:12 Navid61

I got the same problem on Python 2.7.13 64 bit in both System wide and under virtual environment, Windows 10 Home.

Jan 08 '18 03:01 hiadore

The same (similar?) bug here. Python 2.7.14 as venv, MacOS High Sierra.

ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'double'

@ruairif It may be hard to reproduce because a bug is pretty rare. It's present in only 2% of my tests. It occurred 5 times, total 203 trials.

Jan 22 '18 12:01 indywidualny

I am getting this error consistently, regardless of input data. Even the small example on the front page of scrapely's github, that illustrates how to scrape pypi, fails with this error.

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] Windows 7, 64-bit.

numpy (1.14.0) pip (9.0.1) scrapely (0.13.4) setuptools (28.8.0) six (1.11.0) w3lib (1.18.0)

Jan 25 '18 11:01 bitblomster

Hi @bitblomster, I'm too. Just in Windows. I've no issue with scrapely on Ubuntu.

But something interesting happened. I copied scrapely folder from my Ubuntu Python environment (in site packages) into my Windows, at the same folder with my project that using scrapely. All issue is gone, scrapely working properly afther this. @ruairif , may something missing on scrapely on Windows?

Jan 27 '18 08:01 hiadore

I keep getting the same error in Windows whenever I try to scrape a website (using the API as well as using the command line):

Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)] on win32
[...]
 File "scrapely/extraction/_similarity.pyx", line 155, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3845)
    cpdef naive_match_length(sequence, pattern, int start=0, int end=-1):
  File "scrapely/extraction/_similarity.pyx", line 158, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3648)
    return np_naive_match_length(sequence, pattern, start, end)
  File "scrapely/extraction/_similarity.pyx", line 87, in scrapely.extraction._similarity.np_naive_match_length (scrapely/extraction/_similarity.c:2802)
    cdef np_naive_match_length(np.ndarray[np.int64_t, ndim=1] sequence,
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

I've managed to try it on Ubuntu with another computer: it works, no issue found when scraping. I tried to copy the Ubuntu scrapely folder to Windows, as @hiadore suggested, but I'm still finding the same exact error. I have no clue!

Feb 09 '18 15:02 dbenitog

I also have exactly the same problem on Windows 10. Any workarounds?

May 18 '18 15:05 ramedey

@ramedey same issue here, but I'm having initial success with running scrapely with https://docs.microsoft.com/en-us/windows/wsl/about (example from readme works :) )

May 26 '18 20:05 pawelkmiec

I have the same issue. The problem lies with numpy (scrapely dependency) and how it treats int on a 32bit and 64bit windows system differently.

Sep 15 '18 12:09 ronaldgreeff

Any workarounds on this issue?

Jun 24 '19 15:06 maximeboun

scrapely scrapely copied to clipboard

ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

scrapely
scrapely copied to clipboard