Ensure processing is better for automatic LLM-based curinator.md to issue conversion
Changes made to rweekly scripts folder for curation team use.
This pull request updates the scripts/process_curinator.R script to enhance its data processing pipeline for extracting and summarizing R-related markdown links. The main improvements include reorganizing the order of library imports, expanding the set of libraries used, and significantly refactoring the data wrangling logic to produce grouped, formatted summaries.
Dependency management:
- Moved the import of
ellmerafterdplyrand added new dependencies:tidyrandglue, to support improved data manipulation and string formatting.
Data processing and summarization:
- Refactored the processing pipeline to:
- Store the initial results in
result_rawwith a newjson_metadatacolumn. - Parse and expand the metadata, filter for R-related entries, format markdown links, group by type, and generate combined summaries for each group using
glue.
- Store the initial results in
Limitations
- Note that some URLs fail to read which might require us to manually review them, but this approach can still help us automatically check if an RSS post is R-related and put it in the right category.
- I have limited the character length of website content converted markdown to 1000 characters to respect model token limits
Running this as a test after manually running curatinator today...
-
There's a post that errors (with an unusually long URL); I like that it's included in the
manual_review_linksbut (if we proceed with this) a final 'report' which prints out theresult_wrangled,manual_review_links, and also thefilter(result_temp, is_r_related != "yes")results for inspection would be helpful. -
I get an error here
> result_temp <- result_raw |>
+ mutate(metadata = map(json_metadata, ~ fromJSON(.x))) |>
+ unnest_wider(metadata)
# Error in `mutate()`:
# ℹ In argument: `metadata = map(json_metadata, ~fromJSON(.x))`.
# Caused by error in `map()`:
# ℹ In index: 19.
# Caused by error:
# ! lexical error: invalid char in json text.
# ```json { "is_r_related": "ye
# (right here) ------^
# Run `rlang::last_trace()` to see where the error occurred.
because the output (from this post) looks like this
result_raw[19, ]$json_metadata
# [1] "```json\n{\n \"is_r_related\": \"yes\",\n \"category\": \"R in Organization\"\n}\n```"
It's not clear to me why the code fence is there, but it should probably be stripped prior to attempting fromJSON(). This seems to do the trick in this case
map(json_metadata, ~ fromJSON(gsub("```(json\n)*", "", .x))))
- In this instance all of the R-related posts were assigned to 'Insights' including 'RcppSimdJson 0.1.14 on CRAN: New Upstream Major'. For me to give support to this (since it cost US$0.34 to perform) I'd like it to have a clear demonstration of doing better than assigning to the default category. I suspect we could achieve the R-related inference with some keyword matching, possibly even building our own classification model based on the existing issues.
It also misclassified this post as not R related - I can see why it had trouble with it, but misclassifications add work rather than reduce it.
That's interesting. I recently switched over to a Claude sonnet model for our use case. I'll experiment with different models and with changing our system prompt. Maybe I need to do a better job teaching it how to decide the post category. Right now, I'm just using our wiki page material which can be confusing. We might even come up with maybe an eval dataset based on a previous issue or multiple issues to measure accuracy. Also, maybe the 1000 character limit isn't enough sometimes to determine if a post is R related.
I just performed a run to kick off my issue curation for 2025-W40, and my experience was quite similar to @jonocarroll . My observations:
- I also has the case of a post with a long URL (from the same site) being funneled to manual review.
- Same issue with the code fenced block appearing in one of the raw post contents, and implementing
map(json_metadata, ~ fromJSON(gsub("```(json\n)*", "", .x))))did the trick. - In my case, all of the RSS posts were classified as tutorial. I don't have a huge batch of links (17 RSS posts), but definitely a handful of them belong in other categories.
- At first I was perplexed why
{reticulate}was involved until I read up on the documentation of{ragnar}. Once I figured out how to add Python to my project's Nix configuration via{rix}I was able to solve the errors. But that's just a caution to other curators that try to use this in the future.
While I am very supportive of finding clever ways to reduce the manual effort in the process of curating an issue, it's pretty clear that more testing is needed to find the optimal combo of a prompt and appropriate LLM to make this practical for the rest of the team. But this is an excellent starting point.
Sorry, for the delay. Finally had some free time today to revisit this. Improved the system prompt, addressed the json code chunks issue, and switched over to a Quarto report format. You'll see I'm now using a decision tree sort of format that Claude recommended to me for the system prompt. Increased the character limit to 1,500 too. The result seems to be much better now: process_curinator_report.html
@rpodcast, I'd love to know how the report compares to your grouping for this upcoming issue. We'll probably need to continue tuning the system prompt to meet our needs. It's getting better though.