polis icon indicating copy to clipboard operation
polis copied to clipboard

Conversation serialization ambiguity

Open jucor opened this issue 3 months ago • 3 comments

The Conversation class in Delphi has two export methods: get_full_data() , and to_dict() which exports more content but also renames the fields to match the Clojure old export (e.g. attribute conversation_id is renamed to zid) ? The latter renames to zid, but its unit test test_conversation.py still checks for conversation_id and thus fails.

Note: get_full_data() does not have tests. But is used in the code and the notebooks. In particular, get_full_data() is user-facing via the route /api/v3/conversations/{conversation_id}

On the other hand to_dict() (the one with the Clojure-compatible renaming) is used in all the tests, and by the ConversationManager serialization tool, and its internal saving tool (when it stores a json to the data_dir -- and I'm unsure where those files are then used).

Thickening the plot, while to_dict() renames conversation_id to zid, from_dict() expects a conversation_id, not a zid .... So I'm not sure if from_dict() is supposed to be callable with the output of get_full_data(), as its use of conversation_id seems to indicate, or that of to_dict(), as its function name seems to indict.

It would be clearer to:

  • Have only one export method, if possible -- I guess it depends on the downstream uses?
  • Or if not possible, at least factorize the duplicate code, and fix the tests for to_dict() and from_dict() and add similar tests for get_full_data() and for the serialization

As a temporary (😅 ) fix, I can try to modify from_dict() so it accepts both conversation_id and zid, as long as only one is present, or as long as both agree if both are present. If neither is present or if they disagree, default to the current default behaviour of empty conversation name 🤷 (ideally I'd raise an error if both are present and disagree, but I'm not sure how robust are the callers and wouldn't want to break the server...).

I'll do the same for any other similar fields which store the exact same content but go by two distinct names (i.e. new name vs renamed to old Clojure's name).

jucor avatar Nov 10 '25 09:11 jucor

The plot thickens even more: Except for the unit tests, the only usage of to_dict() that I can find is in ConversationManager._save_conversation(). That is a no-op unless a data directory was specified. Only unit tests seem to provide one.

Production code, instead uses a completely different serialization mechanism to dictionary, with to_dynamo_dict() (which itself uses zid, unlike get_full_data() which is the user-facing export)

Even in the tests, the data_dir that enables _save_conversation() is only used for testing its own behaviour 😆

There's just one test where to_dict() is possibly genuinely used, it's at the end of test_real_data.py's test_biodiversity_conversation() (which just checks that things run), which stores the data at the end -- but I do not see any code that uses those files, so they may have only been used for manual inspection during the development of delphi.

So, to summarize, we seem to have three Conversation serialization functions: 1/ get_full_data(), not tested, user-facing at /api/v3/conversations/ route 2/ to_dynamo_dict(), not tested that I can see, used for actual persistence in prod. 3/ to_dict(), includes more things than get_full_data(), such as counts data, tid, votes_base, group_clusters, and renames fields for clojure compatibiltiy, also has comments that it optimizes for faster processing than to_dynamodb_dict() but, neither me nor Claude Code seem to see it meaningfully used anywhere... 10:22

As to de-serialization: 1/ from_dict(), which from the name and the tests seems to be the inverse of to_dict(), but is unaware of the Clojure renames done by to_dict() and thus fails the tests 2/ dynamodb.read_math_by_tick() loads the output of Conversation.to_dynamo_db_dict(() from the Dynamo DB, but then modifies it to make it with the current from_dict()'s format --> so it's a fetcher+converter+call to from_dict(). Edited8m

So we have 3 written formats, and 2 read formats, one of which is converted to the other upon reading.

At first I was thinking the to_dict() format was here so we could converse with clojure, but since the reader from_dict() doesn't quite match it, I guess not ?

## Recommendation I think it would be terrific to have only one single format for serialization, if we can. If not possible for performance reason, then at most two: user-facing, and internal, with a lot of factorized code.

Three questions:

  • Q1: Do we feel ready to modify any typescript code that reads convos from DynamoDB, to match the get_full_data() schema?
  • Q2: Or, do we feel OK to modify the user-facing format produced by /api/v3/conversations/ ?

If (Q1 or Q2):, then refactor to the corresponding format. Else keep both but refactor common code.

  • Q3: Do we confirm that to_dict()'s format is legacy? If so, replace that function by a call to get_full_data() ?

jucor avatar Nov 10 '25 13:11 jucor

Claude Code's helpful summary:

Feature get_full_data() to_dynamo_dict()
Key format conversation_id zid
Moderation Nested dict Flat lists
Computed aggregates ❌ No ✅ Yes (votes-base, group-votes, consensus)
Float handling Lists Decimals (DynamoDB-safe)
ID conversion Keep as strings Convert to int
Projections ✅ Included ❌ Separate
Clojure compatibility No Yes
Round-trip with from_dict() ✅ Yes ❌ No (needs read_math_by_tick())

Interestingly, there's a third method zto_dict() that appears to be a hybrid:

  • Starts like get_full_data()
  • Then adds Clojure-specific fields (like to_dynamo_dict())
  • Uses hyphenated keys (user-vote-counts, votes-base) instead of underscores
  • Uses 'A'/'D'/'S' format for vote stats (Agree/Disagree/Sum) instead of 'agree'/'disagree'/'total'
  • Includes projections (unlike to_dynamo_dict()).

This suggests to_dict() is for Clojure compatibility while to_dynamo_dict() is specifically for DynamoDB export.

jucor avatar Nov 10 '25 13:11 jucor

See also #2290 , to address some type conversions of keys in to_dict(). Are they needed? 🤷

jucor avatar Nov 14 '25 16:11 jucor