communitynotes icon indicating copy to clipboard operation
communitynotes copied to clipboard

Request for Scored Output Files from Algorithm Execution

Open ruixing76 opened this issue 11 months ago • 11 comments

Is your feature request related to a problem? Please describe.

I need aggregated results so I can analyze helpfulness on Notes level. I think the only way so far is to run the algorithm from scratch so I’m reproducing results using the downloaded data (notes, ratings, notes history, and user enrollment) on a 64-core Intel(R) Xeon(R) Gold 6448H CPU with 500GB memory (correct me if I am wrong). However, after 20 hours, the pre-scoring phase still hasn’t completed. It looks like it won't finish within one day which stops me from working on further analysis.

Since the algorithm runs every hour or so on the server, may I know:

  1. would it be possible to share the output files (scored_notes.tsv, helpfulness_scores.tsv, note_status_history.tsv, and aux_note_info.tsv)?
  2. and the hardware requirement and expected running time if I want to generate aggregated scores for notes myself?

This would greatly help for research analysis, as running the algorithm locally to aggregate helpfulness scores has been quite challenging.

Describe the solution you'd like Would it be possible to share the output files (scored_notes.tsv, helpfulness_scores.tsv, note_status_history.tsv, and aux_note_info.tsv)? They don’t need to be the latest versions—files aligned with the current download page would be fine.

Describe alternatives you've considered It would be nice to share the hardware requirement and expected running time if I want to generate aggregated scores for notes from scratch, or any intermediate process.

Additional context Thank you so much for your contribution on this amazing project! I am a PhD student working on fact-checking in Natural Language Processing and I am very happy to explore and contribute more. I am actively working on this and any help in above questions would be much appreciated!

ruixing76 avatar Dec 08 '24 19:12 ruixing76

hi - it's not surprising that the job might take that long when run sequentially. since you seem not to be resource-bound, you could try running with the parallel flag set to True. Let us know if that helps!

ashilgard avatar Dec 11 '24 22:12 ashilgard

hi - it's not surprising that the job might take that long when run sequentially. since you seem not to be resource-bound, you could try running with the parallel flag set to True. Let us know if that helps!

Hi @ashilgard, many thanks for your reply, I will try that! Actually I do have resource limitations and normally we don't have that much CPU and memory (64G at most), I queued for a very long time to run the algorithm It would be great if it's possible to share results and descriptions of the output formats.

ruixing76 avatar Dec 12 '24 07:12 ruixing76

I also would appreciate more insights about hardware requirements and expected running time. I'm trying to run it for weeks, but failed. My latest attempt was to use an AWS r5.metal instance, which is a 3rd gen Intel Xeon with 768Gb of RAM, but after running for 9 hours the process died with no forensic information. This is my attempt output: https://gist.github.com/tuler/02aa42c423e5a627a0ea5fa5b9381f7b I used --parallel

tuler avatar Jan 09 '25 17:01 tuler

Could anyone external who has successfully run the algorithm code share their machine, runtime, and how many threads/processes they used if different than default? E.g. @avalanchesiqi I think you may have?

To give a ballpark, it will likely take in the ballpark of 12ish hours if run with default multiprocessing settings (of course, highly dependent on the exact processor).

768G RAM is more than what we need internally. Could you share any charts of resource usage @tuler? E.g. RAM and CPU usage over time?

jbaxter avatar Jan 09 '25 21:01 jbaxter

Could you share any charts of resource usage @tuler? E.g. RAM and CPU usage over time?

I don't have a chart, but I saw memory increasing during pre-processing, until it reaches something around 180Gb of memory usage. And that is single core, 100% CPU all the time, for like 5 hours. Then models start to run in parallel, I see like 8 cores working, memory goes down a lot, to like 30Gb. Does it make sense?

tuler avatar Jan 09 '25 21:01 tuler

Yeah that seems fine. Not sure why it stopped.

jbaxter avatar Jan 09 '25 21:01 jbaxter

@tuler I checked your log. I think your program had finished, however your output folder didn't exist. That is why your program stopped. I found this in your log file. It's located near the end, like one scroll up.

Traceback (most recent call last):
  File "/home/ubuntu/communitynotes/sourcecode/main.py", line 31, in <module>
    main()
  File "/home/ubuntu/communitynotes/sourcecode/scoring/runner.py", line 268, in main
    return _run_scorer(args=args, dataLoader=dataLoader, extraScoringArgs=extraScoringArgs)
  File "/home/ubuntu/communitynotes/sourcecode/scoring/pandas_utils.py", line 678, in _inner
    retVal = main(*args, **kwargs)
  File "/home/ubuntu/communitynotes/sourcecode/scoring/runner.py", line 245, in _run_scorer
    write_tsv_local(scoredNotes, os.path.join(args.outdir, "scored_notes.tsv"))
  File "/home/ubuntu/communitynotes/sourcecode/scoring/process_data.py", line 543, in write_tsv_local
    assert df.to_csv(path, index=False, header=headers, sep="\t") is None
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/core/generic.py", line 3967, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1014, in to_csv
    csv_formatter.save()
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 251, in save
    with get_handle(
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/common.py", line 749, in get_handle
    check_parent_directory(str(handle))
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/common.py", line 616, in check_parent_directory
    raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
OSError: Cannot save file into a non-existent directory: '../output'

avalanchesiqi avatar Jan 15 '25 17:01 avalanchesiqi

Thanks for checking it out @avalanchesiqi It’s strange because other times I tried to run, with a data subset, the directory gets created, even with intermediate results in it. I’ll try to run it again. Thanks

tuler avatar Jan 15 '25 17:01 tuler

We did get it to complete on the umich HPC cluster, using 170GB max memory. I think the max it is set up to offer is 184GB, so we are OK until the memory requirement grows above that.

I think @avalanchesiqi traced the memory issue to some relatively recently added computation involving all pairs of users or something like that. Can you point to exactly where you traced the issue, Siqi?

Providing outputs from your internal runs would certainly be useful to us folks on the outside, though I undertsand that this may have been a deliberate design decision to make the code and data available but with a little friction so that only people who were serious would be able to reproduce results.

(BTW: it would also be helpful if your scoring runs recorded the inferred global parameter /mu in an output file, rather than just in the logs. We may submit a PR for that.)

presnick avatar Jan 15 '25 17:01 presnick

Now I successfully ran it. I took 10:30h to run in a r5.metal AWS instance.

tuler avatar Jan 16 '25 14:01 tuler

@tuler is GPU necessary?

Edit: Okay I suppose the thumbsdown means not necessary. Thanks!

Jacobsonradical avatar Jan 16 '25 21:01 Jacobsonradical