blocked-on-weibo
blocked-on-weibo copied to clipboard

Published 20 hours ago •

→

Metadata

Python package for testing keyword censorship on Sina Weibo

Readme
Issues

What is blockedonweibo?

This python package allows you to automate tests to check if a keyword is censored or not on Sina Weibo, a Chinese social media site. It is an updated version of the script which was used to detect keywords on the site, http://blockedonweibo.com. It handles interrupted tests, multiple tests per day, storing results to a database, and a number of other features which simplify testing Weibo censorship at scale. The researcher merely has to feed the script a list of words for one-off tests. For recurring tests, simply wrap the script with a scheduler.

screenshot

IMPORTANT: Upgrading from v0.1 with an existing database?

The database table has been modified to accomodate tracking of minimum keyword strings triggering censorship. If you used blockedonweibo v0.1 and you used a database to store results, you will need to update your database file.

To migrate your older database file

Move update_db.py to the same file directory as your database file ("results.sqlite" if you followed this setup guide)
In terminal, run python update_db.py and confirm database file.

Version 0.2 changes

now includes a feature to find canonical censored keywords (minimum set of keywords required to trigger explicit censorship message)
- to run, pass get_canonical=True with rest of variables into run()
- see section 4.8

Table of Contents

1 What is blockedonweibo?

2 Install the blockedonweibo package

3 Adjust your settings

4 Let's start testing!

4.1 Pass a dictionary of keywords to start testing

4.2 Pass in cookies so you can also get the number of results. Pass in sqlite_file to save the results to disk so you can load it later

4.3 If your test gets interrupted or you add more keywords, you can pick up where you left off

4.4 You can attach notes or categorizations to your keywords for easy querying and analysis later

4.5 If you want to test multiple times a day, just pass in the test_number param

4.6 It can skip redundant keywords

4.7 You can also pass in lists if you prefer (though you can't include the source or notes)

4.8 It can detect the canonical (minimum) set of characters in the search query triggering censorship

Install the blockedonweibo package

The github repo for this Weibo keyword testing script is located at https://github.com/jasonqng/blocked-on-weibo.

To begin using this python package, inside your terminal, run

pip install blockedonweibo

Alternatively, you can clone this repo, cd into the repo directory and manually install the requirements and package:

pip install -r requirements.txt
python setup.py install

To confirm the installation works, in a python shell (you can start by running python from terminal), try importing the package:

import blockedonweibo

If you don't get any errors, things have installed successfully. If not, you may need to fiddle with your python paths and settings to ensure it's being installed to the correct location.

Adjust your settings

Your python script only requires the following. All other imports are handled by the package.

from blockedonweibo import weibo
import pandas as pd

You have the option of saving your test results to a file. You'll need to pass a path to to a file which will store the results in sqlite format. It can be helpful to set this at the top of your script and pass the variable each time you run the test.

sqlite_file = 'results.sqlite' # name of sqlite file to read from/write to

If you want to erase any existing data you have in the sqlite file defined above, just pass overwrite=True to the create_database function. Otherwise any new results will be appended to the end of the database.

weibo.create_database(sqlite_file, overwrite=True)

This testing script is enhanced if you allow it to log into Weibo, which increases your rate limit threshold as well as returns the number of results a search says it has. This script will work without your supplying credentials, but it is highly recommended. To do so, edit the weibo_credentials.py with your email address and password. The file is ignored and will not be uploaded by default when you push commits to github. You can inspect the code to verify that the credentials don't go anywhere except to weibo.

Using those credentials, the script logs you in and fetches a cookie for the user session you create. This cookie can be saved to a file by passing the write_cookie parameter in the user_login function.

session = weibo.user_login(write_cookie=True)

There is a helper function to verify that the cookie actually works

cookie = session.cookies.get_dict()
print(weibo.verify_cookies_work(cookie))

True

If you have the cookie already written to disk, you don't need to perform another user_login and instead, you can just use the load_cookies function to fetch the cookie from the file. Again, you can verify that it works. Just store the cookie's contents (a dictionary) to a variable and pass that to the run function below if you want to test as if you were logged in. Otherwise, it will emulate a search by a logged out user.

cookie = weibo.load_cookies()
print(weibo.verify_cookies_work(cookie))

True

Let's start testing!

Pass a dictionary of keywords to start testing

sample_keywords_df = pd.DataFrame(
    [{'keyword':'hello','source':'my dataframe'},
     {'keyword':'lxb','source':'my dataframe'},
     {'keyword':u'习胞子','source':'my dataframe'}
    ])

sample_keywords_df

	keyword	source
0	hello	my dataframe
1	lxb	my dataframe
2	习胞子	my dataframe

weibo.run(sample_keywords_df,insert=False,return_df=True)

(0, u'hello', 'has_results')
(1, u'lxb', 'censored')
(2, u'\u4e60\u80de\u5b50', 'no_results')

	date	datetime	is_canonical	keyword	num_results	orig_keyword	result	source	test_number
0	2017-09-25	2017-09-25 10:12:45.280812	False	hello	[]	None	has_results	my dataframe	1
0	2017-09-25	2017-09-25 10:13:00.191900	False	lxb	None	None	censored	my dataframe	1
0	2017-09-25	2017-09-25 10:13:16.356805	False	习胞子	None	None	no_results	my dataframe	1

Pass in cookies so you can also get the number of results. Pass in sqlite_file to save the results to disk so you can load it later

weibo.run(sample_keywords_df,sqlite_file=sqlite_file,cookies=cookie)

(0, u'hello', 'has_results')
(1, u'lxb', 'censored')
(2, u'\u4e60\u80de\u5b50', 'no_results')

weibo.sqlite_to_df(sqlite_file)

	id	date	datetime_logged	test_number	keyword	censored	no_results	reset	is_canonical	result	source	orig_keyword	num_results	notes
0	0	2017-09-25	2017-09-25 10:13:37.816720	1	hello	0	0	0	0	has_results	my dataframe	None	80454701.0	None
1	1	2017-09-25	2017-09-25 10:13:54.356722	1	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
2	2	2017-09-25	2017-09-25 10:14:11.489530	1	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None

If your test gets interrupted or you add more keywords, you can pick up where you left off

Let's pretend I wanted to test four total keywords, but I was only able to complete the first three above. I'll go ahead and add one more keyword to the test list to replicate an unfinished keyword.

sample_keywords_df.loc[len(sample_keywords_df.index)] = ['刘晓波','my dataframe']

sample_keywords_df

	keyword	source
0	hello	my dataframe
1	lxb	my dataframe
2	习胞子	my dataframe
3	刘晓波	my dataframe

weibo.run(sample_keywords_df,sqlite_file=sqlite_file,cookies=cookie)

(3, u'\u5218\u6653\u6ce2', 'censored')

Neat-o, it was smart enough to start right at that new keyword and not start all over again!

weibo.sqlite_to_df(sqlite_file)

	id	date	datetime_logged	test_number	keyword	censored	no_results	reset	is_canonical	result	source	orig_keyword	num_results	notes
0	0	2017-09-25	2017-09-25 10:13:37.816720	1	hello	0	0	0	0	has_results	my dataframe	None	80454701.0	None
1	1	2017-09-25	2017-09-25 10:13:54.356722	1	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
2	2	2017-09-25	2017-09-25 10:14:11.489530	1	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
3	3	2017-09-25	2017-09-25 10:14:29.667395	1	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None

You can attach notes or categorizations to your keywords for easy querying and analysis later

new_keywords_df = pd.DataFrame(
    [{'keyword':'pokemon','source':'my dataframe',"notes":"pop culture"},
     {'keyword':'jay chou','source':'my dataframe',"notes":"pop culture"},
     {'keyword':u'weibo','source':'my dataframe',"notes":"social media"}
    ])
merged_keywords_df = pd.concat([sample_keywords_df,new_keywords_df]).reset_index(drop=True)
merged_keywords_df

	keyword	notes	source
0	hello	NaN	my dataframe
1	lxb	NaN	my dataframe
2	习胞子	NaN	my dataframe
3	刘晓波	NaN	my dataframe
4	pokemon	pop culture	my dataframe
5	jay chou	pop culture	my dataframe
6	weibo	social media	my dataframe

weibo.run(merged_keywords_df,sqlite_file=sqlite_file,cookies=cookie)

(4, u'pokemon', 'has_results')
(5, u'jay chou', 'has_results')
(6, u'weibo', 'has_results')

weibo.sqlite_to_df(sqlite_file)

	id	date	datetime_logged	test_number	keyword	censored	no_results	reset	is_canonical	result	source	orig_keyword	num_results	notes
0	0	2017-09-25	2017-09-25 10:13:37.816720	1	hello	0	0	0	0	has_results	my dataframe	None	80454701.0	None
1	1	2017-09-25	2017-09-25 10:13:54.356722	1	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
2	2	2017-09-25	2017-09-25 10:14:11.489530	1	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
3	3	2017-09-25	2017-09-25 10:14:29.667395	1	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None
4	4	2017-09-25	2017-09-25 10:14:49.107078	1	pokemon	0	0	0	0	has_results	my dataframe	None	5705260.0	pop culture
5	5	2017-09-25	2017-09-25 10:15:09.762484	1	jay chou	0	0	0	0	has_results	my dataframe	None	881.0	pop culture
6	6	2017-09-25	2017-09-25 10:15:28.100418	1	weibo	0	0	0	0	has_results	my dataframe	None	63401495.0	social media

results = weibo.sqlite_to_df(sqlite_file)
results.query("notes=='pop culture'")

	id	date	datetime_logged	test_number	keyword	censored	no_results	reset	is_canonical	result	source	orig_keyword	num_results	notes
4	4	2017-09-25	2017-09-25 10:14:49.107078	1	pokemon	0	0	0	0	has_results	my dataframe	None	5705260.0	pop culture
5	5	2017-09-25	2017-09-25 10:15:09.762484	1	jay chou	0	0	0	0	has_results	my dataframe	None	881.0	pop culture

results.query("notes=='pop culture'").num_results.mean()

2853070.5

If you want to test multiple times a day, just pass in the `test_number` param

You can off verbose output in case you don't need to troubleshoot anything...

weibo.run(sample_keywords_df,sqlite_file=sqlite_file,cookies=cookie,verbose='none',test_number=2)

weibo.sqlite_to_df(sqlite_file)

	id	date	datetime_logged	test_number	keyword	censored	no_results	reset	is_canonical	result	source	orig_keyword	num_results	notes
0	0	2017-09-25	2017-09-25 10:13:37.816720	1	hello	0	0	0	0	has_results	my dataframe	None	80454701.0	None
1	1	2017-09-25	2017-09-25 10:13:54.356722	1	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
2	2	2017-09-25	2017-09-25 10:14:11.489530	1	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
3	3	2017-09-25	2017-09-25 10:14:29.667395	1	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None
4	4	2017-09-25	2017-09-25 10:14:49.107078	1	pokemon	0	0	0	0	has_results	my dataframe	None	5705260.0	pop culture
5	5	2017-09-25	2017-09-25 10:15:09.762484	1	jay chou	0	0	0	0	has_results	my dataframe	None	881.0	pop culture
6	6	2017-09-25	2017-09-25 10:15:28.100418	1	weibo	0	0	0	0	has_results	my dataframe	None	63401495.0	social media
7	7	2017-09-25	2017-09-25 10:15:46.214464	2	hello	0	0	0	0	has_results	my dataframe	None	80454634.0	None
8	8	2017-09-25	2017-09-25 10:16:03.274804	2	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
9	9	2017-09-25	2017-09-25 10:16:19.035805	2	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
10	10	2017-09-25	2017-09-25 10:16:36.021837	2	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None

It can skip redundant keywords

more_keywords_df = pd.DataFrame(
    [{'keyword':'zhongnanhai','source':'my dataframe2',"notes":"location"},
     {'keyword':'cats','source':'my dataframe2',"notes":"pop culture"},
     {'keyword':'zhongnanhai','source':'my dataframe2',"notes":"location"}
    ])

more_keywords_df

	keyword	notes	source
0	zhongnanhai	location	my dataframe2
1	cats	pop culture	my dataframe2
2	zhongnanhai	location	my dataframe2

weibo.run(more_keywords_df,sqlite_file=sqlite_file,cookies=cookie)

(0, u'zhongnanhai', 'has_results')
(1, u'cats', 'has_results')

weibo.sqlite_to_df(sqlite_file)

	id	date	datetime_logged	test_number	keyword	censored	no_results	reset	is_canonical	result	source	orig_keyword	num_results	notes
0	0	2017-09-25	2017-09-25 10:13:37.816720	1	hello	0	0	0	0	has_results	my dataframe	None	80454701.0	None
1	1	2017-09-25	2017-09-25 10:13:54.356722	1	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
2	2	2017-09-25	2017-09-25 10:14:11.489530	1	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
3	3	2017-09-25	2017-09-25 10:14:29.667395	1	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None
4	4	2017-09-25	2017-09-25 10:14:49.107078	1	pokemon	0	0	0	0	has_results	my dataframe	None	5705260.0	pop culture
5	5	2017-09-25	2017-09-25 10:15:09.762484	1	jay chou	0	0	0	0	has_results	my dataframe	None	881.0	pop culture
6	6	2017-09-25	2017-09-25 10:15:28.100418	1	weibo	0	0	0	0	has_results	my dataframe	None	63401495.0	social media
7	7	2017-09-25	2017-09-25 10:15:46.214464	2	hello	0	0	0	0	has_results	my dataframe	None	80454634.0	None
8	8	2017-09-25	2017-09-25 10:16:03.274804	2	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
9	9	2017-09-25	2017-09-25 10:16:19.035805	2	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
10	10	2017-09-25	2017-09-25 10:16:36.021837	2	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None
11	11	2017-09-25	2017-09-25 10:16:53.766351	1	zhongnanhai	0	0	0	0	has_results	my dataframe2	None	109.0	location
12	12	2017-09-25	2017-09-25 10:17:14.124440	1	cats	0	0	0	0	has_results	my dataframe2	None	648313.0	pop culture

You can also pass in lists if you prefer (though you can't include the source or notes)

sample_keywords_list = ["cats",'yes','自由亚洲电台','刘晓波','dhfjkdashfjkasdhf']

See below how it handles connection reset errors (it waits a little extra to make sure your connection clears before continuing testing)

weibo.run(sample_keywords_list,sqlite_file=sqlite_file,cookies=cookie)

(0, u'cats', 'has_results')
(1, u'yes', 'has_results')
自由亚洲电台 caused connection reset, waiting 95
(2, u'\u81ea\u7531\u4e9a\u6d32\u7535\u53f0', 'reset')
(3, u'\u5218\u6653\u6ce2', 'censored')
(4, u'dhfjkdashfjkasdsf87', 'no_results')

weibo.sqlite_to_df(sqlite_file)

	id	date	datetime_logged	test_number	keyword	censored	no_results	reset	is_canonical	result	source	orig_keyword	num_results	notes
0	0	2017-09-25	2017-09-25 10:13:37.816720	1	hello	0	0	0	0	has_results	my dataframe	None	80454701.0	None
1	1	2017-09-25	2017-09-25 10:13:54.356722	1	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
2	2	2017-09-25	2017-09-25 10:14:11.489530	1	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
3	3	2017-09-25	2017-09-25 10:14:29.667395	1	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None
4	4	2017-09-25	2017-09-25 10:14:49.107078	1	pokemon	0	0	0	0	has_results	my dataframe	None	5705260.0	pop culture
5	5	2017-09-25	2017-09-25 10:15:09.762484	1	jay chou	0	0	0	0	has_results	my dataframe	None	881.0	pop culture
6	6	2017-09-25	2017-09-25 10:15:28.100418	1	weibo	0	0	0	0	has_results	my dataframe	None	63401495.0	social media
7	7	2017-09-25	2017-09-25 10:15:46.214464	2	hello	0	0	0	0	has_results	my dataframe	None	80454634.0	None
8	8	2017-09-25	2017-09-25 10:16:03.274804	2	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
9	9	2017-09-25	2017-09-25 10:16:19.035805	2	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
10	10	2017-09-25	2017-09-25 10:16:36.021837	2	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None
11	11	2017-09-25	2017-09-25 10:16:53.766351	1	zhongnanhai	0	0	0	0	has_results	my dataframe2	None	109.0	location
12	12	2017-09-25	2017-09-25 10:17:14.124440	1	cats	0	0	0	0	has_results	my dataframe2	None	648313.0	pop culture
13	13	2017-09-25	2017-09-25 10:17:36.205255	1	cats	0	0	0	0	has_results	list	None	648313.0	None
14	14	2017-09-25	2017-09-25 10:17:54.330039	1	yes	0	0	0	0	has_results	list	None	28413048.0	None
15	15	2017-09-25	2017-09-25 10:19:47.007930	1	自由亚洲电台	0	0	0	0	reset	list	None	NaN	None
16	16	2017-09-25	2017-09-25 10:20:03.491231	1	刘晓波	0	0	0	0	censored	list	None	NaN	None
17	17	2017-09-25	2017-09-25 10:20:18.747414	1	dhfjkdashfjkasdsf87	0	0	0	0	no_results	list	None	NaN	None

It can detect the canonical (minimum) set of characters in the search query triggering censorship

Set get_canonical=True when running to find which part of a censored search query is actually triggering the censorship. Note: this will only work on explicitly censored search queries.

Finding canonical censored keywords can take a large number of search cycles, especially with larger original queries.

weibo.run(['江蛤','江泽民江蛤蟆'],sqlite_file=sqlite_file,cookies=cookie,continue_interruptions=False,get_canonical=True)

If we find a minimum keyword component, we'll record it as a keyword, set column is_canonical to True, and record our full search query in orig_keyword. For completeness, we'll also include the original keyword as its own entry with is_canonical=False

weibo.sqlite_to_df(sqlite_file)

	id	date	datetime_logged	test_number	keyword	censored	no_results	reset	is_canonical	result	source	orig_keyword	num_results	notes
0	0	2017-09-25	2017-09-25 10:13:37.816720	1	hello	0	0	0	0	has_results	my dataframe	None	80454701.0	None
1	1	2017-09-25	2017-09-25 10:13:54.356722	1	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
2	2	2017-09-25	2017-09-25 10:14:11.489530	1	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
3	3	2017-09-25	2017-09-25 10:14:29.667395	1	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None
4	4	2017-09-25	2017-09-25 10:14:49.107078	1	pokemon	0	0	0	0	has_results	my dataframe	None	5705260.0	pop culture
5	5	2017-09-25	2017-09-25 10:15:09.762484	1	jay chou	0	0	0	0	has_results	my dataframe	None	881.0	pop culture
6	6	2017-09-25	2017-09-25 10:15:28.100418	1	weibo	0	0	0	0	has_results	my dataframe	None	63401495.0	social media
7	7	2017-09-25	2017-09-25 10:15:46.214464	2	hello	0	0	0	0	has_results	my dataframe	None	80454634.0	None
8	8	2017-09-25	2017-09-25 10:16:03.274804	2	lxb	0	0	0	0	censored	my dataframe	None	NaN	None
9	9	2017-09-25	2017-09-25 10:16:19.035805	2	习胞子	0	0	0	0	no_results	my dataframe	None	NaN	None
10	10	2017-09-25	2017-09-25 10:16:36.021837	2	刘晓波	0	0	0	0	censored	my dataframe	None	NaN	None
11	11	2017-09-25	2017-09-25 10:16:53.766351	1	zhongnanhai	0	0	0	0	has_results	my dataframe2	None	109.0	location
12	12	2017-09-25	2017-09-25 10:17:14.124440	1	cats	0	0	0	0	has_results	my dataframe2	None	648313.0	pop culture
13	13	2017-09-25	2017-09-25 10:17:36.205255	1	cats	0	0	0	0	has_results	list	None	648313.0	None
14	14	2017-09-25	2017-09-25 10:17:54.330039	1	yes	0	0	0	0	has_results	list	None	28413048.0	None
15	15	2017-09-25	2017-09-25 10:19:47.007930	1	自由亚洲电台	0	0	0	0	reset	list	None	NaN	None
16	16	2017-09-25	2017-09-25 10:20:03.491231	1	刘晓波	0	0	0	0	censored	list	None	NaN	None
17	17	2017-09-25	2017-09-25 10:20:18.747414	1	dhfjkdashfjkasdsf87	0	0	0	0	no_results	list	None	NaN	None
18	18	2017-11-15	2017-11-15 12:38:32.931313	1	江蛤	0	0	0	1	censored	list	江蛤	NaN	None
19	19	2017-11-15	2017-11-15 12:38:32.963135	1	江蛤	0	0	0	0	censored	list	None	NaN	None
20	20	2017-11-15	2017-11-15 12:40:21.294841	1	江蛤	0	0	0	1	censored	list	江泽民江蛤蟆	NaN	None
21	21	2017-11-15	2017-11-15 12:40:21.326378	1	江泽民江蛤蟆	0	0	0	0	censored	list	None	NaN	None

← Metadata

15

Stars

3

Forks

Watchers

Owner

Metadata

Python package for testing keyword censorship on Sina Weibo