cmc-csci143 data over 25G for normalized

Hello!

I am running into an issue of my normalized_batch data being over 25 G, right now, it is at 47G, and I was wondering why that is and how to fix it so that it is 25G.

$ nohup ./load_tweets_parallel.sh > load_parallel_normalized.out 2>&1 &

I run this to load the data, and it's been running for quite some time now, and in the load_tweets_parallel.sh file, I have the denormalized part commented out to look like this:

#time echo "$files" | parallel ./load_denormalized.sh

PYTHONBUFFERED=1 time ./load_tweets_batch.py --db "postgresql://postgres:pass@localhost:1334" --inputs     $files

This is what the output file looks like for the first command:

load_parallel_normalized.out

nohup: ignoring input
   2 ================================================================================
   3 load pg_denormalized
   4 ================================================================================
   5 ================================================================================
   6 load pg_normalized_batch
   7 ================================================================================
   8 NOTICE:  identifier "load_tweets.py --inputs /data/tweets/geoTwitter21-01-01.zip /data/tweets/geoTwitt     er21-01-02.zip /data/tweets/geoTwitter21-01-03.zip /data/tweets/geoTwitter21-01-04.zip /data/tweets/ge     oTwitter21-01-05.zip /data/tweets/geoTwitter21-01-06.zip /data/tweets/geoTwitter21-01-07.zip /data/twe     ets/geoTwitter21-01-08.zip /data/tweets/geoTwitter21-01-09.zip /data/tweets/geoTwitter21-01-10.zip" wi     ll be truncated to "load_tweets.py --inputs /data/tweets/geoTwitter21-01-01.zip /da"
   9 2025-05-05 19:34:58.699358 /data/tweets/geoTwitter21-01-10.zip
  10 2025-05-05 19:35:13.965960 insert_tweets i= 0
  11 2025-05-05 19:35:23.903827 insert_tweets i= 1
  12 2025-05-05 19:35:24.445888 insert_tweets i= 2
  13 2025-05-05 19:35:27.285666 insert_tweets i= 3
  14 2025-05-05 19:35:27.820863 insert_tweets i= 4
  15 2025-05-05 19:35:28.438077 insert_tweets i= 5
  16 2025-05-05 19:35:28.997831 insert_tweets i= 6
  17 2025-05-05 19:35:29.545392 insert_tweets i= 7
  18 2025-05-05 19:35:29.996753 insert_tweets i= 8
  19 2025-05-05 19:35:30.449163 insert_tweets i= 9
  20 2025-05-05 19:35:33.142387 insert_tweets i= 10
  21 2025-05-05 19:35:33.705111 insert_tweets i= 11
  22 2025-05-05 19:35:34.246913 insert_tweets i= 12
  23 2025-05-05 19:35:34.781617 insert_tweets i= 13
  24 2025-05-05 19:35:35.313768 insert_tweets i= 14
  25 2025-05-05 19:35:35.847606 insert_tweets i= 15
  26 2025-05-05 19:35:36.417394 insert_tweets i= 16
  27 2025-05-05 19:35:36.983321 insert_tweets i= 17
  28 2025-05-05 19:35:39.837262 insert_tweets i= 18
  29 2025-05-05 19:35:40.320065 insert_tweets i= 19

Any help would be greatly appreciated!

May 06 '25 02:05 ralee25

Most likely you have inserted the data twice, and so have duplicate copies of each tweet. 4/5 test cases have a DISTINCT clause inside them, and so these duplicate entries will not hurt you and you should be able to pass the test cases. Test case 4 does not, and so you will not be able to pass the test case.

I will waive for you the requirement to have all of your test cases passing for the normalized batch service. You can move on to just focusing on the runtimes.

May 06 '25 02:05 mikeizbicki

My tests are not passing for normalized_batch for all test cases. I deleted the services/pg_normalized_batch/schema-indexes.sql file and it is still not passing. Is there a way to find out why they aren't passing?

May 06 '25 03:05 ralee25

The test cases are structured according to the standard postgres structure we used in all of the assignments: The correct outputs are in the expected folder and the results of your commands are stored in the results folder and you can use this information to debug.

May 06 '25 06:05 mikeizbicki

data over 25G for normalized_batch