cmc-csci143 icon indicating copy to clipboard operation
cmc-csci143 copied to clipboard

Mapping Question (twitter-coronavirus homework)

Open sjanefullerton opened this issue 1 year ago • 3 comments

Hello! I am trying to run the mapper for programming task 1 on the twitter-coronavirus homework.

I was wondering how I can tell when the mapping has finished? I don't want to start the other steps until the mapping step is finished so I don't miss any data.

I ran the following and used the nohup command:

lambda-server:~/twitter_coronavirus (master *%=) $ chmod +x run_maps.sh
lambda-server:~/twitter_coronavirus (master *%=) $ nohup ./run_maps.sh
nohup: ignoring input and appending output to 'nohup.out'

sjanefullerton avatar Feb 07 '24 19:02 sjanefullerton

You should use a combination of:

  1. the ps -ef | grep username incantation to check what processes you have running, and
  2. manually inspecting the outputs of the map.py file to see if they are reasonable.

mikeizbicki avatar Feb 07 '24 19:02 mikeizbicki

Thank you! When I run ps -ef | grep sfullerton24, I do not see map.py running:

lambda-server:~/twitter_coronavirus/src (master *=) $ ps -ef | grep sfullerton24
sfuller+  2621 45218  0 Feb05 ?        00:00:00 /home/sfullerton24/bin/rootlesskit-docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 4444 -container-ip 172.17.0.3 -container-port 5000
sfuller+  2639 45218  0 Feb05 ?        00:00:00 /home/sfullerton24/bin/rootlesskit-docker-proxy -proto tcp -host-ip :: -host-port 4444 -container-ip 172.17.0.3 -container-port 5000
sfuller+  2662 45155  0 Feb05 ?        00:00:25 /home/sfullerton24/bin/containerd-shim-runc-v2 -namespace moby -id e9cd8d7a885b853faff976fe2ffd997dde30c2de6f329acc907a8df3b23de982 -address /run/user/1360/docker/containerd/containerd.sock
sfuller+ 13738 81139  0 11:47 pts/5    00:00:00 grep sfullerton24
sfuller+ 45174 45155  0 Feb01 ?        00:00:00 rootlesskit --state-dir=/run/user/1360/dockerd-rootless --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /home/sfullerton24/bin/dockerd-rootless.sh
sfuller+ 45191 45174  0 Feb01 ?        00:01:28 /proc/self/exe --state-dir=/run/user/1360/dockerd-rootless --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /home/sfullerton24/bin/dockerd-rootless.sh
sfuller+ 70143 45155  0 Feb01 ?        00:01:24 /home/sfullerton24/bin/containerd-shim-runc-v2 -namespace moby -id fd122a711a977f99833a1ea2ff96abecf82817fc9c8fa6a1c4c012a7dfcdc4e3 -address /run/user/1360/docker/containerd/containerd.sock
root     81076  1767  0 10:26 ?        00:00:00 sshd: sfullerton24 [priv]
sfuller+ 81138 81076  0 10:26 ?        00:00:00 sshd: sfullerton24@pts/5

I manually inspected my outputs and see that I am only receiving .zip.lang files. I was wondering if you could tell me if there is an issue in my mapping process?

This is what I wrote for run_maps.sh:

# file will loop over each file in dataset and run the map.py command
for file in '/data/Twitter dataset/'geoTwitter20*; do 
      echo "Processing"
      ./src/map.py --input_path="$file" &
  done
  
  echo "Processing complete."                                 

(When I run the script I don't see the echos.)

This is what I changed in the map.py file:

# open the zipfile
 43 with zipfile.ZipFile(args.input_path) as archive:
 44 
 45     # loop over every file within the zip file
 46     for i,filename in enumerate(archive.namelist()):
 47         print(datetime.datetime.now(),args.input_path,filename)
 48 
 49         # open the inner file
 50         with archive.open(filename) as f:
 51 
 52             # loop over each line in the inner file
 53             for line in f:
 54 
 55                 # load the tweet as a python dictionary
 56                 tweet = json.loads(line)
 57 
 58                 # convert text to lower case
 59                 text = tweet['text'].lower()
 60 
 61                 # search hashtags
 62                 for hashtag in hashtags:
 63                     lang = tweet['lang']
 64                     if tweet['place']:
 65                         country = tweet['place']['country_code']
 66                         echo "Here1"
 67                     else:
 68                         country = None
 69                     if hashtag in text:
 70                         counter_lang[hashtag][lang] += 1
 71                         counter_country[hashtag][country] += 1
 72                         echo "Here2"
 73                     counter_lang['_all'][lang] += 1
 74                     country_country['_all'][country] += 1

sjanefullerton avatar Feb 07 '24 19:02 sjanefullerton

The fact that you don't see the map.py processes running when you run ps indicates that there was probably an error (because these should be running for a long time). To diagnose the error, you'll have to figure out where these processes are outputting their results in order to read the error message.

This location will depend on exactly how you ran your commands. If you used the nohup command, then this command outputs to the screen the location where it is storing the output of the command you were running. Check that location for error messages.

mikeizbicki avatar Feb 08 '24 03:02 mikeizbicki

Thank you! I checked nohup.out and I see by its contents that the nohup ./run_maps.sh command is successfully processing. However, when I run ps -ef | grep sfullerton24 I still do not see the map.py status. Is the status in here and I am just overlooking it?

lambda-server:~/twitter_coronavirus (master *%=) $ nohup ./run_maps.sh
nohup: ignoring input and appending output to 'nohup.out'

lambda-server:~/twitter_coronavirus (master *%=) $ ps -ef | grep sfullerton24
sfuller+  2621 45218  0 Feb05 ?        00:00:00 /home/sfullerton24/bin/rootlesskit-docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 4444 -container-ip 172.17.0.3 -container-port 5000
sfuller+  2639 45218  0 Feb05 ?        00:00:00 /home/sfullerton24/bin/rootlesskit-docker-proxy -proto tcp -host-ip :: -host-port 4444 -container-ip 172.17.0.3 -container-port 5000
sfuller+  2662 45155  0 Feb05 ?        00:00:44 /home/sfullerton24/bin/containerd-shim-runc-v2 -namespace moby -id e9cd8d7a885b853faff976fe2ffd997dde30c2de6f329acc907a8df3b23de982 -address /run/user/1360/docker/containerd/containerd.sock
root     20167  1767  0 12:55 ?        00:00:00 sshd: sfullerton24 [priv]
sfuller+ 20240 20167  0 12:56 ?        00:00:00 sshd: sfullerton24@pts/7
sfuller+ 45174 45155  0 Feb01 ?        00:00:00 rootlesskit --state-dir=/run/user/1360/dockerd-rootless --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /home/sfullerton24/bin/dockerd-rootless.sh
sfuller+ 45191 45174  0 Feb01 ?        00:01:30 /proc/self/exe --state-dir=/run/user/1360/dockerd-rootless --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /home/sfullerton24/bin/dockerd-rootless.sh
sfuller+ 66940 20241  0 14:10 pts/7    00:00:00 grep sfullerton24
sfuller+ 70143 45155  0 Feb01 ?        00:01:43 /home/sfullerton24/bin/containerd-shim-runc-v2 -namespace moby -id fd122a711a977f99833a1ea2ff96abecf82817fc9c8fa6a1c4c012a7dfcdc4e3 -address /run/user/1360/docker/containerd/containerd.sock
root     71321  1767  0 11:43 ?        00:00:00 sshd: sfullerton24 [priv]
sfuller+ 71383 71321  0 11:43 ?        00:00:00 sshd: sfullerton24@pts/35

sjanefullerton avatar Feb 08 '24 22:02 sjanefullerton

The lab-processes.md file that introduced this technique has an explanation for the behavior you're observing and a solution:

If you have a particularly long username, the command above may not work for you as written. The output of ps -ef will truncate usernames that are too long, and so the grep command won't be able to find them. If you use the first 6 characters of your username in the grep command (instead of your full username), then the command should work.

mikeizbicki avatar Feb 08 '24 22:02 mikeizbicki