snscrape icon indicating copy to clipboard operation
snscrape copied to clipboard

Instagram index problem creating index error

Open Jb2817 opened this issue 9 months ago • 0 comments

Describe the bug

Index error when trying to access Instagram posts

How to reproduce

Any accessing of Ig posts should produce the error.

Loop over each post for the current year

for post in tqdm(snsinstagram.InstagramHashtagScraper(query).get_items()):

Expected behaviour

The program should save Instagram information as a pandas data frame. However, when trying to access posts I am getting an index error. Theres a comment specifying that if Instagram changed anything this might cause an error.

Screenshots and recordings

No response

Operating system

macOS Ventura 13.4

Python version: output of python3 --version

python 3.11.5

snscrape version: output of snscrape --version

snscrape 0.7.0.20230622

Scraper

Snscrape.module.instagram

How are you using snscrape?

Module (import snscrape.modules.something in Python code)

Backtrace

No response

Log output

0it [00:00, ?it/s]INFO:snscrape.modules.instagram:Retrieving initial data INFO:snscrape.base:Retrieving https://www.instagram.com/explore/tags/Advanced%20Micro%20Devices%20OR%20AMD/ DEBUG:snscrape.base:... with headers: {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} DEBUG:snscrape.base:... with environmentSettings: {'proxies': OrderedDict(), 'stream': False, 'verify': True, 'cert': None} DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.instagram.com:443 DEBUG:snscrape.base:Connected to: ('157.240.241.174', 443) DEBUG:snscrape.base:Connection cipher: ('TLS_CHACHA20_POLY1305_SHA256', 'TLSv1.3', 256) DEBUG:urllib3.connectionpool:https://www.instagram.com:443 "GET /explore/tags/Advanced%20Micro%20Devices%20OR%20AMD/ HTTP/1.1" 200 None INFO:snscrape.base:Retrieved https://www.instagram.com/explore/tags/Advanced%20Micro%20Devices%20OR%20AMD/: 200 DEBUG:snscrape.base:... with response headers: {'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Set-Cookie': 'csrftoken=YBE20kuKsQmjab47aSnkSn; expires=Sun, 27-Apr-2025 18:27:16 GMT; Max-Age=31449600; path=/; domain=.instagram.com; secure', 'accept-ch-lifetime': '4838400', 'accept-ch': 'viewport-width,dpr,Sec-CH-Prefers-Color-Scheme,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Platform-Version,Sec-CH-UA-Model', 'Link': 'https://www.instagram.com/explore/tags/Advanced%20Micro%20Devices%20OR%20AMD/top/; rel="canonical"', 'reporting-endpoints': 'coop_report="https://www.facebook.com/browser_reporting/coop/?minimize=0", coep_report="https://www.facebook.com/browser_reporting/coep/?minimize=0", default="https://www.instagram.com/error/ig_web_error_reports/?device_level=unknown", permissions_policy="https://www.instagram.com/error/ig_web_error_reports/"', 'report-to': '{"max_age":2592000,"endpoints":[{"url":"https:\/\/www.facebook.com\/browser_reporting\/coop\/?minimize=0"}],"group":"coop_report","include_subdomains":true}, {"max_age":86400,"endpoints":[{"url":"https:\/\/www.facebook.com\/browser_reporting\/coep\/?minimize=0"}],"group":"coep_report"}, {"max_age":259200,"endpoints":[{"url":"https:\/\/www.instagram.com\/error\/ig_web_error_reports\/?device_level=unknown"}]}, {"max_age":21600,"endpoints":[{"url":"https:\/\/www.instagram.com\/error\/ig_web_error_reports\/"}],"group":"permissions_policy"}', 'content-security-policy-report-only': "default-src *.facebook.com *.fbcdn.net *.instagram.com data: blob:;script-src *.teststagram.com *.instagram.com static.cdninstagram.com *.google-analytics.com https://translate.google.com/ https://apis.google.com/ https://accounts.google.com/ *.facebook.com *.fbcdn.net *.facebook.net 'unsafe-inline' 'unsafe-eval' blob: data: 'self';style-src *.teststagram.com *.instagram.com static.cdninstagram.com data: blob: 'unsafe-inline' *.fbcdn.net *.facebook.com;connect-src *.teststagram.com .instagram.com wss://edge-chat.instagram.com/ connect.facebook.net .facebook.com facebook.com .fbcdn.net .facebook.net wss://.facebook.com: ws://localhost: blob: .cdninstagram.com wss://.instagram.com: 'self' https://meta.privacy-gateway.cloudflare.com/relay;font-src *.teststagram.com *.instagram.com static.cdninstagram.com data: *.fbcdn.net *.intern.facebook.com *.facebook.com fonts.gstatic.com;img-src *.teststagram.com *.instagram.com *.facebook.com *.fbcdn.net data: *.igsonar.com *.cdninstagram.com *.google-analytics.com blob: *.fbsbx.com android-webview-video-poster: *.giphy.com;media-src *.facebook.com *.fbcdn.net *.instagram.com *.cdninstagram.com cdn.fbsbx.com data: blob: https://*.giphy.com;frame-src *.instagram.com *.facebook.com *.fbsbx.com fbsbx.com data:;worker-src *.instagram.com/static_resources/webworker_v1/init_script/ *.instagram.com/static_resources/webworker/init_script/ *.instagram.com/static_resources/sharedworker/init_script/ *.instagram.com/www-service-worker.js;block-all-mixed-content;report-uri https://www.facebook.com/csp/reporting/?minimize=0;", 'content-security-policy': "default-src *.facebook.com *.fbcdn.net *.instagram.com data: blob:;script-src *.teststagram.com *.instagram.com static.cdninstagram.com *.google-analytics.com https://translate.google.com/ https://apis.google.com/ https://accounts.google.com/ *.facebook.com *.fbcdn.net .facebook.net 127.0.0.1: 'unsafe-inline' 'unsafe-eval' blob: data: 'self';style-src *.teststagram.com *.instagram.com static.cdninstagram.com data: blob: 'unsafe-inline' *.fbcdn.net *.facebook.com;connect-src *.teststagram.com .instagram.com wss://edge-chat.instagram.com/ connect.facebook.net .facebook.com facebook.com .fbcdn.net .facebook.net wss://.facebook.com: ws://localhost: blob: .cdninstagram.com wss://.instagram.com: 'self' https://meta.privacy-gateway.cloudflare.com/relay;font-src *.teststagram.com *.instagram.com static.cdninstagram.com data: *.fbcdn.net *.intern.facebook.com *.facebook.com fonts.gstatic.com;img-src *.teststagram.com *.instagram.com *.facebook.com *.fbcdn.net data: *.igsonar.com *.cdninstagram.com *.google-analytics.com *.whatsapp.net blob: www.gstatic.com *.fbsbx.com android-webview-video-poster: *.oculuscdn.com www.googleadservices.com *.doubleclick.net *.google.com *.google.co.uk *.giphy.com;media-src *.facebook.com *.fbcdn.net *.instagram.com *.cdninstagram.com cdn.fbsbx.com data: blob: https://*.giphy.com;frame-src *.instagram.com *.facebook.com *.fbsbx.com fbsbx.com data: www.googleadservices.com *.doubleclick.net *.google.com *.google.co.uk;block-all-mixed-content;upgrade-insecure-requests;", 'document-policy': 'force-load-at-top', 'permissions-policy': 'accelerometer=(self), attribution-reporting=(), autoplay=(), bluetooth=(), camera=(self), ch-device-memory=(), ch-downlink=(), ch-dpr=(), ch-ect=(), ch-rtt=(), ch-save-data=(), ch-ua-arch=(), ch-ua-bitness=(), ch-viewport-height=(), ch-viewport-width=(), ch-width=(), clipboard-read=(), clipboard-write=(self), display-capture=(), encrypted-media=(), fullscreen=(self), gamepad=(), geolocation=(self), gyroscope=(self), hid=(), idle-detection=(), keyboard-map=(), local-fonts=(), magnetometer=(), microphone=(self), midi=(), otp-credentials=(), payment=(), picture-in-picture=(self), publickey-credentials-get=(), screen-wake-lock=(), serial=(), usb=(), window-management=(), xr-spatial-tracking=();report-to="permissions_policy"', 'cross-origin-resource-policy': 'same-origin', 'cross-origin-embedder-policy-report-only': 'require-corp;report-to="coep_report"', 'cross-origin-opener-policy': 'same-origin-allow-popups;report-to="coop_report"', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, must-revalidate', 'Expires': 'Sat, 01 Jan 2000 00:00:00 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'X-Frame-Options': 'DENY', 'Strict-Transport-Security': 'max-age=31536000; preload; includeSubDomains', 'x-stack': 'www', 'Content-Type': 'text/html; charset="utf-8"', 'X-FB-Debug': 'mQypGwsXYcnBYHVu2sXcPrKTeI1apWlyzIcvhTq92feWnrc4DROmKwi+swdfI2uNPh7w5XTrxe0YvuGaeWLHTQ==', 'Date': 'Sun, 28 Apr 2024 18:27:16 GMT', 'Alt-Svc': 'h3=":443"; ma=86400', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive'} 0it [00:00, ?it/s]

IndexError Traceback (most recent call last) Cell In[19], line 19 16 year_end_date = pd.Timestamp('{}-12-31'.format(year)) 18 # Loop over each post for the current year ---> 19 for post in tqdm(snsinstagram.InstagramHashtagScraper(query).get_items()): 20 if post.date >= year_start_date and post.date <= year_end_date: 21 if len(posts) >= limit*(year-start_date.year+1):

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tqdm/std.py:1182, in tqdm.iter(self) 1179 time = self._time 1181 try: -> 1182 for obj in iterable: 1183 yield obj 1184 # Update and possibly print the progressbar. 1185 # Note: does not call self.update(1) for speed optimisation.

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/modules/instagram.py:110, in _InstagramCommonScraper.get_items(self) 109 def get_items(self): --> 110 r = self._initial_page() 111 if r.status_code == 404: 112 _logger.warning('Page does not exist')

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/modules/instagram.py:78, in _InstagramCommonScraper._initial_page(self) 76 if self._initialPage is None: 77 _logger.info('Retrieving initial data') ---> 78 r = self._get(self._initialUrl, headers = self._headers, responseOkCallback = self._check_initial_page_callback) 79 if r.status_code not in (200, 404): 80 raise snscrape.base.ScraperException(f'Got status code {r.status_code}')

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/base.py:275, in Scraper._get(self, *args, **kwargs) 274 def _get(self, *args, **kwargs): --> 275 return self._request('GET', *args, **kwargs)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/base.py:246, in Scraper._request(self, method, url, params, data, headers, timeout, responseOkCallback, allowRedirects, proxies) 244 _logger.debug(f'... ... with response headers: {redirect.headers!r}') 245 if responseOkCallback is not None: --> 246 success, msg = responseOkCallback(r) 247 errors.append(msg) 248 else:

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/snscrape/modules/instagram.py:89, in _InstagramCommonScraper._check_initial_page_callback(self, r) 87 if r.status_code != 200: 88 return True, None ---> 89 jsonData = r.text.split('')[0] # May throw an IndexError if Instagram changes something again; we just let that bubble. 90 try: 91 obj = json.loads(jsonData)

IndexError: list index out of range

Dump of locals

No response

Additional context

No response

Jb2817 avatar Apr 28 '24 18:04 Jb2817