google-play-scraper icon indicating copy to clipboard operation
google-play-scraper copied to clipboard

[BUG] reviews_all doesn't download all reviews of an app with large amount of reviews

Open Jl-wei opened this issue 11 months ago • 59 comments

Library version 1.2.6

Describe the bug I cannot download all the reviews of an app with large amount of reviews. The number of downloaded reviews is always a multiple of 199.

Code

result = reviews_all("com.google.android.apps.fitness")
print(len(result))
# get 995

Expected behavior Expect to download all the reviews with reviews_all, which should be at least 20k

Additional context No

Jl-wei avatar Mar 01 '24 20:03 Jl-wei

Im seeing the same issue even when I set the number of reviews (25000 in my case). Im only getting back about 500 and the output number changes each time I run it.

funnan avatar Mar 02 '24 06:03 funnan

Im seeing the same issue even when I set the number of reviews (25000 in my case). Im only getting back about 500 and the output number changes each time I run it.

Me too, and I found that the output number is always a multiple of 199. It seems that Google Play randomly block the retrieval of next page of reviews.

Jl-wei avatar Mar 02 '24 16:03 Jl-wei

This is probably a dupe of #208.

The error seems to be the play service intermittently returning an error inside a 200 success code, which then fails to parse as the json the library expects. It seems to contain this ....store.error.PlayDataError message.

)]}'

[["wrb.fr","UsvDTd",null,null,null,[5,null,[["type.googleapis.com/wireless.android.finsky.boq.web.data.store.error.PlayDataError",[1]]]],"generic"],["di",45],["af.httprm",45,"-6355766929392607683",2]]

The error seems to happen frequently but not reliably. Scraping in chunks of 200 reviews, basically every request has a decent chance of crashing, resulting in usually 200-1000 total reviews scraped before it craps out.

Currently, the library swallows this exception silently and quits. Handling this error lets the scraping continue as normal.

We monkey-patched around it like this and seem to have gotten back to workable scraping:

import google_play_scraper
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )

    # MOD error handling
    if "error.PlayDataError" in dom:
        return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
    # ENDMOD

    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-1][-1]


google_play_scraper.reviews._fetch_review_items = _fetch_review_items

adilosa avatar Mar 05 '24 02:03 adilosa

Still not able to get more than a few hundred reviews.

funnan avatar Mar 05 '24 12:03 funnan

@funnan, the monkey patch @adilosa posted worked well for me.

paulolacombe avatar Mar 05 '24 14:03 paulolacombe

Hey @adilosa @funnan @paulolacombe can you all please tell how to implement this in order to fix this issue. I am trying to scrape reviews using reviews_all in Google Colab but it wont work for me. It would be great if you could help!

Shivam-170103 avatar Mar 06 '24 06:03 Shivam-170103

Hey @Shivam-170103, you need to use the code lines that @adilosa provided to replace the corresponding ones in the reviews.py function file in your environment. Let me know if that helps as I am not that familiar with Google Colab.

paulolacombe avatar Mar 06 '24 15:03 paulolacombe

Thanks @adilosa and @paulolacombe , your posts are worked for me :)

terrichiachia avatar Mar 07 '24 02:03 terrichiachia

I don't know why but even applying @adilosa 's solution the number of reviews returned here is still very low.

image

lucasbral avatar Mar 07 '24 12:03 lucasbral

Hello! I tried the monkey patch suggested by @adilosa, scraping a big app like eBay.

Instead of getting 8 or 10 reviews, I did end up getting 199, but I am expecting thousands of reviews (that's how it used be several weeks ago).

Any updated for getting this fixed? Cheers, and thank you

ej-white avatar Mar 09 '24 23:03 ej-white

Same for me TT: the number of reviews scraped has plummeted since around 15 Feb and @adilosa's patch does not change my numbers by much Is there something else I can try?

sfischerw avatar Mar 11 '24 05:03 sfischerw

This mod did not work for me either. I tried a different approach that worked for me:

In reviews.py:

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

funnan avatar Mar 11 '24 05:03 funnan

@funnan, thanks for sharing it! It does not fix the issue for me, I still only retrieve 200-300 reviews for an app like Ebay And every run still yields a different number of reviews

sfischerw avatar Mar 17 '24 21:03 sfischerw

@funnan Thank you! I tried that and seemed to get a little more reviews, but not the full count. But I'm not sure if I implemented the patch correctly.

What I did was put the entire features/reviews.py into a new file (my_reviews.py), updated the try/except block with your change, and patched it like this:

import google_play_scraper
from my_reviews import reviews  # <- patched version

google_play_scraper.features.reviews = reviews

# Then call google_play_scraper.reviews(app, count=1000, ...)

Is this how to apply your patch? If not, could you provide an example of the correct way? Thanks so much

ej-white avatar Mar 17 '24 21:03 ej-white

Both mods dont work for me, first doesn't change anything and funnan's just loops forever and never returns.

Bigsy avatar Mar 20 '24 11:03 Bigsy

I'm having the same issue, and trying to use the workaround posted by @adilosa (thx!).

However, it's giving me a pagination token error.

TypeError: Formats._Reviews.build_body() missing 1 required positional argument: 'pagination_token'

Can someone please tell me what this should be set at? I've tried None, 0, 100, 200, and 2000 as values for 'pagination_token', but always get the same TypeError.

This is how I have the variables defined:

google_play_scraper.reviews._fetch_review_items = _fetch_review_items

# Set values for 'url', 'app_id', 'sort', 'count', 'filter_score_with', and 'pagination_token'
url = 'https://play.google.com/store/getreviews'
app_id = 'com.doctorondemand.android.patient'
sort = 1  # 1 for most relevant, 2 for newest
count = 20  # Number of reviews to fetch
filter_score_with = None
pagination_token = 100

# Example call to the function with provided values
_fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)

Greatly appreciate any input.

MemeRunner avatar Mar 20 '24 12:03 MemeRunner

Here's my code ( I am fixing the number of reviews I need and break the loop when that number has crossed):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

and in reviews.py I added the mod as my original comment.

funnan avatar Mar 20 '24 19:03 funnan

Ini kode saya (saya memperbaiki jumlah ulasan yang saya perlukan dan memutus perulangan ketika angka itu telah melewatinya):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

dan di review.py saya menambahkan mod sebagai komentar asli saya.

Ini kode saya (saya memperbaiki jumlah ulasan yang saya perlukan dan memutus perulangan ketika angka itu telah melewatinya):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

dan di review.py saya menambahkan mod sebagai komentar asli saya.

I have tried with your code, and it worked for me running on colab

Mayumiwandi avatar Mar 21 '24 16:03 Mayumiwandi

@funnan Thank you, that works!

ej-white avatar Mar 24 '24 23:03 ej-white

@JoMingyu Any chance we could get @funnan 's fix added to the code and merged?

It works for me and others, I can once again scrape 10,000's of reviews. Based on this discussion, seems like this issue is affecting many people! Cheers

ej-white avatar Mar 24 '24 23:03 ej-white

Here's my code ( I am fixing the number of reviews I need and break the loop when that number has crossed):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

and in reviews.py I added the mod as my original comment.

Thanks bro worked for me as well

HuDHuD0x1 avatar Mar 29 '24 10:03 HuDHuD0x1

Unfortunately, it is still not working for me. I suspect that Google has put some limitations on the crawling

`from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

app_id = 'com.zhiliaoapp.musically'


result = []
continuation_token = None
reviews_count = 5000

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=199
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
print(len(df))`

The progress bar is raised after displaying the following: 8%|▊ | 398/5000 [00:00<00:03, 1302.69it/s]398 sometimes it will get more data like 995. but most time just 199 or 398 data retrieved

myownhoney avatar Apr 01 '24 22:04 myownhoney

@myownhoney did you edit the reviews.py file using the fix from @funnan? I just tested it for v1.2.6 and this app id: "com.ingka.ikea.app" and except for hanging on 10950 reviews it works.

AndreasKarasenko avatar Apr 02 '24 09:04 AndreasKarasenko

@myownhoney您是否使用来自的修复编辑了reviews.py文件@funnan? 我刚刚测试了它的 v1.2.6 和这个应用程序 ID:“com.ingka.ikea.app”,除了挂在 10950 条评论上之外,它可以工作。

it works now :) Cheers!

myownhoney avatar Apr 02 '24 10:04 myownhoney

@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well

RamaDNA avatar Apr 03 '24 13:04 RamaDNA

@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well

My code is in the previous comment. Have you tried editing reviews.py? If you're working on colab, I strongly suggest you run this code before running your scrape code `

import json
from time import sleep
from typing import List, Optional, Tuple

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.NEWEST,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

`

myownhoney avatar Apr 03 '24 15:04 myownhoney

@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well

My code is in the previous comment. Have you tried editing reviews.py? If you're working on colab, I strongly suggest you run this code before running your scrape code `

import json
from time import sleep
from typing import List, Optional, Tuple

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.NEWEST,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

`

if we use this code before running our script is it compulsory to edit reviews.py first? or just run this code and that's all!! because the @funnan patch is worked for me on Jupiter

HuDHuD0x1 avatar Apr 03 '24 15:04 HuDHuD0x1

@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well

My code is in the previous comment. Have you tried editing reviews.py? If you're working on colab, I strongly suggest you run this code before running your scrape code `

import json
from time import sleep
from typing import List, Optional, Tuple

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.NEWEST,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

`

so after run this code and after that i should run this code right ? help me please

`from google_play_scraper import Sort, reviews import pandas as pd from datetime import datetime from tqdm import tqdm import time

app_id = 'com.zhiliaoapp.musically'

result = [] continuation_token = None reviews_count = 5000

with tqdm(total=reviews_count, position=0, leave=True) as pbar: while len(result) < reviews_count: new_result, continuation_token = reviews( app_id, continuation_token=continuation_token, lang='en', country='us', sort=Sort.NEWEST, filter_score_with=None, count=199 ) if not new_result: break result.extend(new_result) pbar.update(len(new_result))

df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S")) print(len(df))`

RamaDNA avatar Apr 03 '24 16:04 RamaDNA

@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well

My code is in the previous comment. Have you tried editing reviews.py? If you're working on colab, I strongly suggest you run this code before running your scrape code `

import json
from time import sleep
from typing import List, Optional, Tuple

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.NEWEST,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

`

if we use this code before running our script is it compulsory to edit reviews.py first? or just run this code and that's all!! because the @funnan patch is worked for me on Jupiter

If you run this code, you don't need to edit reviews.py; in fact, this code is the edited reviews.py

myownhoney avatar Apr 03 '24 17:04 myownhoney

@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well

My code is in the previous comment. Have you tried editing reviews.py? If you're working on colab, I strongly suggest you run this code before running your scrape code `

import json
from time import sleep
from typing import List, Optional, Tuple

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.NEWEST,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

`

so after run this code and after that i should run this code right ? help me please

`from google_play_scraper import Sort, reviews import pandas as pd from datetime import datetime from tqdm import tqdm import time

app_id = 'com.zhiliaoapp.musically'

result = [] continuation_token = None reviews_count = 5000

with tqdm(total=reviews_count, position=0, leave=True) as pbar: while len(result) < reviews_count: new_result, continuation_token = reviews( app_id, continuation_token=continuation_token, lang='en', country='us', sort=Sort.NEWEST, filter_score_with=None, count=199 ) if not new_result: break result.extend(new_result) pbar.update(len(new_result))

df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S")) print(len(df))`

yeah just run the first one, than the second one

myownhoney avatar Apr 03 '24 17:04 myownhoney