Conversation serialization ambiguity
The Conversation class in Delphi has two export methods: get_full_data() , and to_dict() which exports more content but also renames the fields to match the Clojure old export (e.g. attribute conversation_id is renamed to zid) ? The latter renames to zid, but its unit test test_conversation.py still checks for conversation_id and thus fails.
Note: get_full_data() does not have tests. But is used in the code and the notebooks. In particular, get_full_data() is user-facing via the route /api/v3/conversations/{conversation_id}
On the other hand to_dict() (the one with the Clojure-compatible renaming) is used in all the tests, and by the ConversationManager serialization tool, and its internal saving tool (when it stores a json to the data_dir -- and I'm unsure where those files are then used).
Thickening the plot, while to_dict() renames conversation_id to zid, from_dict() expects a conversation_id, not a zid .... So I'm not sure if from_dict() is supposed to be callable with the output of get_full_data(), as its use of conversation_id seems to indicate, or that of to_dict(), as its function name seems to indict.
It would be clearer to:
- Have only one export method, if possible -- I guess it depends on the downstream uses?
- Or if not possible, at least factorize the duplicate code, and fix the tests for
to_dict()andfrom_dict()and add similar tests forget_full_data()and for the serialization
As a temporary (😅 ) fix, I can try to modify from_dict() so it accepts both conversation_id and zid, as long as only one is present, or as long as both agree if both are present. If neither is present or if they disagree, default to the current default behaviour of empty conversation name 🤷 (ideally I'd raise an error if both are present and disagree, but I'm not sure how robust are the callers and wouldn't want to break the server...).
I'll do the same for any other similar fields which store the exact same content but go by two distinct names (i.e. new name vs renamed to old Clojure's name).
The plot thickens even more: Except for the unit tests, the only usage of to_dict() that I can find is in ConversationManager._save_conversation(). That is a no-op unless a data directory was specified. Only unit tests seem to provide one.
Production code, instead uses a completely different serialization mechanism to dictionary, with to_dynamo_dict() (which itself uses zid, unlike get_full_data() which is the user-facing export)
Even in the tests, the data_dir that enables _save_conversation() is only used for testing its own behaviour 😆
There's just one test where to_dict() is possibly genuinely used, it's at the end of test_real_data.py's test_biodiversity_conversation() (which just checks that things run), which stores the data at the end -- but I do not see any code that uses those files, so they may have only been used for manual inspection during the development of delphi.
So, to summarize, we seem to have three Conversation serialization functions:
1/ get_full_data(), not tested, user-facing at /api/v3/conversations/ route
2/ to_dynamo_dict(), not tested that I can see, used for actual persistence in prod.
3/ to_dict(), includes more things than get_full_data(), such as counts data, tid, votes_base, group_clusters, and renames fields for clojure compatibiltiy, also has comments that it optimizes for faster processing than to_dynamodb_dict() but, neither me nor Claude Code seem to see it meaningfully used anywhere...
10:22
As to de-serialization:
1/ from_dict(), which from the name and the tests seems to be the inverse of to_dict(), but is unaware of the Clojure renames done by to_dict() and thus fails the tests
2/ dynamodb.read_math_by_tick() loads the output of Conversation.to_dynamo_db_dict(() from the Dynamo DB, but then modifies it to make it with the current from_dict()'s format --> so it's a fetcher+converter+call to from_dict().
Edited8m
So we have 3 written formats, and 2 read formats, one of which is converted to the other upon reading.
At first I was thinking the to_dict() format was here so we could converse with clojure, but since the reader from_dict() doesn't quite match it, I guess not ?
## Recommendation I think it would be terrific to have only one single format for serialization, if we can. If not possible for performance reason, then at most two: user-facing, and internal, with a lot of factorized code.
Three questions:
- Q1: Do we feel ready to modify any typescript code that reads convos from DynamoDB, to match the
get_full_data()schema? - Q2: Or, do we feel OK to modify the user-facing format produced by
/api/v3/conversations/?
If (Q1 or Q2):, then refactor to the corresponding format. Else keep both but refactor common code.
- Q3: Do we confirm that
to_dict()'s format is legacy? If so, replace that function by a call toget_full_data()?
Claude Code's helpful summary:
| Feature | get_full_data() | to_dynamo_dict() |
|---|---|---|
| Key format | conversation_id | zid |
| Moderation | Nested dict | Flat lists |
| Computed aggregates | ❌ No | ✅ Yes (votes-base, group-votes, consensus) |
| Float handling | Lists | Decimals (DynamoDB-safe) |
| ID conversion | Keep as strings | Convert to int |
| Projections | ✅ Included | ❌ Separate |
| Clojure compatibility | No | Yes |
| Round-trip with from_dict() | ✅ Yes | ❌ No (needs read_math_by_tick()) |
Interestingly, there's a third method zto_dict() that appears to be a hybrid:
- Starts like get_full_data()
- Then adds Clojure-specific fields (like
to_dynamo_dict()) - Uses hyphenated keys (
user-vote-counts,votes-base) instead of underscores - Uses
'A'/'D'/'S'format for vote stats (Agree/Disagree/Sum) instead of 'agree'/'disagree'/'total' - Includes projections (unlike
to_dynamo_dict()).
This suggests to_dict() is for Clojure compatibility while to_dynamo_dict() is specifically for DynamoDB export.
See also #2290 , to address some type conversions of keys in to_dict(). Are they needed? 🤷