communitynotes
communitynotes copied to clipboard
Request for Scored Output Files from Algorithm Execution
Is your feature request related to a problem? Please describe.
I need aggregated results so I can analyze helpfulness on Notes level. I think the only way so far is to run the algorithm from scratch so I’m reproducing results using the downloaded data (notes, ratings, notes history, and user enrollment) on a 64-core Intel(R) Xeon(R) Gold 6448H CPU with 500GB memory (correct me if I am wrong). However, after 20 hours, the pre-scoring phase still hasn’t completed. It looks like it won't finish within one day which stops me from working on further analysis.
Since the algorithm runs every hour or so on the server, may I know:
- would it be possible to share the output files (scored_notes.tsv, helpfulness_scores.tsv, note_status_history.tsv, and aux_note_info.tsv)?
- and the hardware requirement and expected running time if I want to generate aggregated scores for notes myself?
This would greatly help for research analysis, as running the algorithm locally to aggregate helpfulness scores has been quite challenging.
Describe the solution you'd like Would it be possible to share the output files (scored_notes.tsv, helpfulness_scores.tsv, note_status_history.tsv, and aux_note_info.tsv)? They don’t need to be the latest versions—files aligned with the current download page would be fine.
Describe alternatives you've considered It would be nice to share the hardware requirement and expected running time if I want to generate aggregated scores for notes from scratch, or any intermediate process.
Additional context Thank you so much for your contribution on this amazing project! I am a PhD student working on fact-checking in Natural Language Processing and I am very happy to explore and contribute more. I am actively working on this and any help in above questions would be much appreciated!
hi - it's not surprising that the job might take that long when run sequentially. since you seem not to be resource-bound, you could try running with the parallel flag set to True. Let us know if that helps!
hi - it's not surprising that the job might take that long when run sequentially. since you seem not to be resource-bound, you could try running with the parallel flag set to True. Let us know if that helps!
Hi @ashilgard, many thanks for your reply, I will try that! Actually I do have resource limitations and normally we don't have that much CPU and memory (64G at most), I queued for a very long time to run the algorithm It would be great if it's possible to share results and descriptions of the output formats.
I also would appreciate more insights about hardware requirements and expected running time.
I'm trying to run it for weeks, but failed.
My latest attempt was to use an AWS r5.metal instance, which is a 3rd gen Intel Xeon with 768Gb of RAM, but after running for 9 hours the process died with no forensic information.
This is my attempt output: https://gist.github.com/tuler/02aa42c423e5a627a0ea5fa5b9381f7b
I used --parallel
Could anyone external who has successfully run the algorithm code share their machine, runtime, and how many threads/processes they used if different than default? E.g. @avalanchesiqi I think you may have?
To give a ballpark, it will likely take in the ballpark of 12ish hours if run with default multiprocessing settings (of course, highly dependent on the exact processor).
768G RAM is more than what we need internally. Could you share any charts of resource usage @tuler? E.g. RAM and CPU usage over time?
Could you share any charts of resource usage @tuler? E.g. RAM and CPU usage over time?
I don't have a chart, but I saw memory increasing during pre-processing, until it reaches something around 180Gb of memory usage. And that is single core, 100% CPU all the time, for like 5 hours. Then models start to run in parallel, I see like 8 cores working, memory goes down a lot, to like 30Gb. Does it make sense?
Yeah that seems fine. Not sure why it stopped.
@tuler I checked your log. I think your program had finished, however your output folder didn't exist. That is why your program stopped. I found this in your log file. It's located near the end, like one scroll up.
Traceback (most recent call last):
File "/home/ubuntu/communitynotes/sourcecode/main.py", line 31, in <module>
main()
File "/home/ubuntu/communitynotes/sourcecode/scoring/runner.py", line 268, in main
return _run_scorer(args=args, dataLoader=dataLoader, extraScoringArgs=extraScoringArgs)
File "/home/ubuntu/communitynotes/sourcecode/scoring/pandas_utils.py", line 678, in _inner
retVal = main(*args, **kwargs)
File "/home/ubuntu/communitynotes/sourcecode/scoring/runner.py", line 245, in _run_scorer
write_tsv_local(scoredNotes, os.path.join(args.outdir, "scored_notes.tsv"))
File "/home/ubuntu/communitynotes/sourcecode/scoring/process_data.py", line 543, in write_tsv_local
assert df.to_csv(path, index=False, header=headers, sep="\t") is None
File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/util/_decorators.py", line 333, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/core/generic.py", line 3967, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1014, in to_csv
csv_formatter.save()
File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 251, in save
with get_handle(
File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/common.py", line 749, in get_handle
check_parent_directory(str(handle))
File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/common.py", line 616, in check_parent_directory
raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
OSError: Cannot save file into a non-existent directory: '../output'
Thanks for checking it out @avalanchesiqi It’s strange because other times I tried to run, with a data subset, the directory gets created, even with intermediate results in it. I’ll try to run it again. Thanks
We did get it to complete on the umich HPC cluster, using 170GB max memory. I think the max it is set up to offer is 184GB, so we are OK until the memory requirement grows above that.
I think @avalanchesiqi traced the memory issue to some relatively recently added computation involving all pairs of users or something like that. Can you point to exactly where you traced the issue, Siqi?
Providing outputs from your internal runs would certainly be useful to us folks on the outside, though I undertsand that this may have been a deliberate design decision to make the code and data available but with a little friction so that only people who were serious would be able to reproduce results.
(BTW: it would also be helpful if your scoring runs recorded the inferred global parameter /mu in an output file, rather than just in the logs. We may submit a PR for that.)
Now I successfully ran it. I took 10:30h to run in a r5.metal AWS instance.
@tuler is GPU necessary?
Edit: Okay I suppose the thumbsdown means not necessary. Thanks!