Mapping Question (twitter-coronavirus homework)
Hello! I am trying to run the mapper for programming task 1 on the twitter-coronavirus homework.
I was wondering how I can tell when the mapping has finished? I don't want to start the other steps until the mapping step is finished so I don't miss any data.
I ran the following and used the nohup command:
lambda-server:~/twitter_coronavirus (master *%=) $ chmod +x run_maps.sh
lambda-server:~/twitter_coronavirus (master *%=) $ nohup ./run_maps.sh
nohup: ignoring input and appending output to 'nohup.out'
You should use a combination of:
- the
ps -ef | grep usernameincantation to check what processes you have running, and - manually inspecting the outputs of the
map.pyfile to see if they are reasonable.
Thank you! When I run ps -ef | grep sfullerton24, I do not see map.py running:
lambda-server:~/twitter_coronavirus/src (master *=) $ ps -ef | grep sfullerton24
sfuller+ 2621 45218 0 Feb05 ? 00:00:00 /home/sfullerton24/bin/rootlesskit-docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 4444 -container-ip 172.17.0.3 -container-port 5000
sfuller+ 2639 45218 0 Feb05 ? 00:00:00 /home/sfullerton24/bin/rootlesskit-docker-proxy -proto tcp -host-ip :: -host-port 4444 -container-ip 172.17.0.3 -container-port 5000
sfuller+ 2662 45155 0 Feb05 ? 00:00:25 /home/sfullerton24/bin/containerd-shim-runc-v2 -namespace moby -id e9cd8d7a885b853faff976fe2ffd997dde30c2de6f329acc907a8df3b23de982 -address /run/user/1360/docker/containerd/containerd.sock
sfuller+ 13738 81139 0 11:47 pts/5 00:00:00 grep sfullerton24
sfuller+ 45174 45155 0 Feb01 ? 00:00:00 rootlesskit --state-dir=/run/user/1360/dockerd-rootless --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /home/sfullerton24/bin/dockerd-rootless.sh
sfuller+ 45191 45174 0 Feb01 ? 00:01:28 /proc/self/exe --state-dir=/run/user/1360/dockerd-rootless --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /home/sfullerton24/bin/dockerd-rootless.sh
sfuller+ 70143 45155 0 Feb01 ? 00:01:24 /home/sfullerton24/bin/containerd-shim-runc-v2 -namespace moby -id fd122a711a977f99833a1ea2ff96abecf82817fc9c8fa6a1c4c012a7dfcdc4e3 -address /run/user/1360/docker/containerd/containerd.sock
root 81076 1767 0 10:26 ? 00:00:00 sshd: sfullerton24 [priv]
sfuller+ 81138 81076 0 10:26 ? 00:00:00 sshd: sfullerton24@pts/5
I manually inspected my outputs and see that I am only receiving .zip.lang files. I was wondering if you could tell me if there is an issue in my mapping process?
This is what I wrote for run_maps.sh:
# file will loop over each file in dataset and run the map.py command
for file in '/data/Twitter dataset/'geoTwitter20*; do
echo "Processing"
./src/map.py --input_path="$file" &
done
echo "Processing complete."
(When I run the script I don't see the echos.)
This is what I changed in the map.py file:
# open the zipfile
43 with zipfile.ZipFile(args.input_path) as archive:
44
45 # loop over every file within the zip file
46 for i,filename in enumerate(archive.namelist()):
47 print(datetime.datetime.now(),args.input_path,filename)
48
49 # open the inner file
50 with archive.open(filename) as f:
51
52 # loop over each line in the inner file
53 for line in f:
54
55 # load the tweet as a python dictionary
56 tweet = json.loads(line)
57
58 # convert text to lower case
59 text = tweet['text'].lower()
60
61 # search hashtags
62 for hashtag in hashtags:
63 lang = tweet['lang']
64 if tweet['place']:
65 country = tweet['place']['country_code']
66 echo "Here1"
67 else:
68 country = None
69 if hashtag in text:
70 counter_lang[hashtag][lang] += 1
71 counter_country[hashtag][country] += 1
72 echo "Here2"
73 counter_lang['_all'][lang] += 1
74 country_country['_all'][country] += 1
The fact that you don't see the map.py processes running when you run ps indicates that there was probably an error (because these should be running for a long time). To diagnose the error, you'll have to figure out where these processes are outputting their results in order to read the error message.
This location will depend on exactly how you ran your commands. If you used the nohup command, then this command outputs to the screen the location where it is storing the output of the command you were running. Check that location for error messages.
Thank you! I checked nohup.out and I see by its contents that the nohup ./run_maps.sh command is successfully processing. However, when I run ps -ef | grep sfullerton24 I still do not see the map.py status. Is the status in here and I am just overlooking it?
lambda-server:~/twitter_coronavirus (master *%=) $ nohup ./run_maps.sh
nohup: ignoring input and appending output to 'nohup.out'
lambda-server:~/twitter_coronavirus (master *%=) $ ps -ef | grep sfullerton24
sfuller+ 2621 45218 0 Feb05 ? 00:00:00 /home/sfullerton24/bin/rootlesskit-docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 4444 -container-ip 172.17.0.3 -container-port 5000
sfuller+ 2639 45218 0 Feb05 ? 00:00:00 /home/sfullerton24/bin/rootlesskit-docker-proxy -proto tcp -host-ip :: -host-port 4444 -container-ip 172.17.0.3 -container-port 5000
sfuller+ 2662 45155 0 Feb05 ? 00:00:44 /home/sfullerton24/bin/containerd-shim-runc-v2 -namespace moby -id e9cd8d7a885b853faff976fe2ffd997dde30c2de6f329acc907a8df3b23de982 -address /run/user/1360/docker/containerd/containerd.sock
root 20167 1767 0 12:55 ? 00:00:00 sshd: sfullerton24 [priv]
sfuller+ 20240 20167 0 12:56 ? 00:00:00 sshd: sfullerton24@pts/7
sfuller+ 45174 45155 0 Feb01 ? 00:00:00 rootlesskit --state-dir=/run/user/1360/dockerd-rootless --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /home/sfullerton24/bin/dockerd-rootless.sh
sfuller+ 45191 45174 0 Feb01 ? 00:01:30 /proc/self/exe --state-dir=/run/user/1360/dockerd-rootless --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /home/sfullerton24/bin/dockerd-rootless.sh
sfuller+ 66940 20241 0 14:10 pts/7 00:00:00 grep sfullerton24
sfuller+ 70143 45155 0 Feb01 ? 00:01:43 /home/sfullerton24/bin/containerd-shim-runc-v2 -namespace moby -id fd122a711a977f99833a1ea2ff96abecf82817fc9c8fa6a1c4c012a7dfcdc4e3 -address /run/user/1360/docker/containerd/containerd.sock
root 71321 1767 0 11:43 ? 00:00:00 sshd: sfullerton24 [priv]
sfuller+ 71383 71321 0 11:43 ? 00:00:00 sshd: sfullerton24@pts/35
If you have a particularly long username, the command above may not work for you as written. The output of
ps -efwill truncate usernames that are too long, and so thegrepcommand won't be able to find them. If you use the first 6 characters of your username in thegrepcommand (instead of your full username), then the command should work.