oppia icon indicating copy to clipboard operation
oppia copied to clipboard

Implement validation job for prod models inheriting from base models

Open ankita240796 opened this issue 10 months ago • 31 comments

NOTE TO CONTRIBUTORS:

  • Please use #22225 as a base for creating any new jobs.
  • For any validation job you implement, the corresponding domain object of the prod model should be updated to do the following in the same PR:
    • Instead of throwing errors using utils.ValidationError("some message"), they should throw typed errors for each message category.
    • These error types should be defined in a new file as per the following example:
      • Domain class: class Exploration, file: exp_domain.py
      • Error class: class ExplorationDomainError(utils.ValidationError), file: exp_domain_errors.py

This issue aims to serve as a checklist for implementing prod validation for all models inheriting from base model.

Here is an umbrella doc describing the design for validation and further elaborating on next steps of migration and fixing root cause. Note that, migration and root cause fixing is out of scope for this issue but contributors can feel free to attempt if they would like to.

Note, for any model that you end up picking, before starting with code, list down the validation you will be implementing as described in the above doc and get it approved from me. Post that you will be assigned to the model in the list. After implementing the job, you should also be submitting a request to test it on backup server.

Implement Audit job for:

  • [ ] BaseModel: @ankita240796
  • [ ] ActivityReferencesModel @ankita240796 - work on sample PR to show how domain errors should be defined and link it as an example
  • [ ] AnswerSubmittedEventLogEntryModel
  • [ ] AppFeedbackReportModel
  • [ ] AppFeedbackReportStatsModel
  • [ ] AppFeedbackReportTicketModel
  • [ ] AuthorBlogPostAggregatedReadingTimeModel
  • [ ] AuthorBlogPostReadsAggregatedStatsModel
  • [ ] AuthorBlogPostViewsAggregatedStatsModel
  • [ ] BeamJobRunModel
  • [ ] BeamJobRunResultModel
  • [ ] BlogAuthorDetailsModel
  • [ ] BlogPostExitedEventLogEntryModel
  • [ ] BlogPostModel
  • [ ] BlogPostReadEventLogEntryModel
  • [ ] BlogPostReadingTimeModel
  • [ ] BlogPostReadsAggregatedStatsModel
  • [ ] BlogPostRightsModel
  • [ ] BlogPostSummaryModel
  • [ ] BlogPostViewedEventLogEntryModel
  • [ ] BlogPostViewsAggregatedStatsModel
  • [ ] BulkEmailModel
  • [ ] ClassroomModel
  • [ ] CollectionProgressModel
  • [ ] CollectionSummaryModel
  • [ ] CommunityContributionStatsModel
  • [ ] CompletedActivitiesModel
  • [ ] CompleteExplorationEventLogEntryModel
  • [ ] CsrfSecretModel
  • [ ] DeletedUserModel
  • [ ] DeletedUsernameModel
  • [ ] EntityTranslationsModel
  • [ ] EntityVoiceoversModel
  • [ ] ExplorationActualStartEventLogEntryModel
  • [ ] ExplorationContextModel
  • [ ] ExplorationIssuesModel
  • [ ] ExplorationOpportunitySummaryModel
  • [ ] ExplorationStatsModel
  • [ ] ExplorationStatsTaskEntryModel
  • [ ] ExplorationUserDataModel
  • [ ] ExplorationVersionHistoryModel
  • [ ] ExplorationVoiceArtistsLinkModel
  • [ ] ExpSummaryModel
  • [ ] ExpUserLastPlaythroughModel
  • [ ] FeatureFlagConfigModel
  • [ ] FirebaseSeedModel
  • [ ] GeneralFeedbackMessageModel
  • [ ] GeneralFeedbackThreadModel
  • [ ] GeneralFeedbackThreadUserModel
  • [ ] GeneralSuggestionModel
  • [ ] IncompleteActivitiesModel
  • [ ] JobModel
  • [ ] LearnerAnswerDetailsModel
  • [ ] LearnerGoalsModel
  • [ ] LearnerGroupModel
  • [ ] LearnerGroupsUserModel
  • [ ] LearnerPlaylistModel
  • [ ] LeaveForRefresherExplorationEventLogEntryModel
  • [ ] MachineTranslationModel
  • [ ] MaybeLeaveExplorationEventLogEntryModel
  • [ ] PendingDeletionRequestModel
  • [ ] PinnedOpportunityModel
  • [ ] PlaythroughModel
  • [ ] PseudonymizedUserModel
  • [ ] QuestionContributionStatsModel
  • [ ] QuestionReviewerTotalContributionStatsModel
  • [ ] QuestionReviewStatsModel
  • [ ] QuestionSkillLinkModel
  • [ ] QuestionSubmitterTotalContributionStatsModel
  • [ ] QuestionSummaryModel
  • [ ] RateExplorationEventLogEntryModel
  • [ ] RoleQueryAuditModel
  • [ ] SentEmailModel
  • [ ] SkillOpportunityModel
  • [ ] SkillSummaryModel
  • [ ] SolutionHitEventLogEntryModel
  • [ ] StartExplorationEventLogEntryModel
  • [ ] StateAnswersModel
  • [ ] StateCompleteEventLogEntryModel
  • [ ] StateCounterModel
  • [ ] StateHitEventLogEntryModel
  • [ ] StoryProgressModel
  • [ ] StorySummaryModel
  • [ ] TopicSimilaritiesModel
  • [ ] TopicSummaryModel
  • [ ] TransientCheckpointUrlModel
  • [ ] TranslationContributionStatsModel
  • [ ] TranslationCoordinatorsModel
  • [ ] TranslationReviewerTotalContributionStatsModel
  • [ ] TranslationReviewStatsModel
  • [ ] TranslationSubmitterTotalContributionStatsModel
  • [ ] UnsentFeedbackEmailModel
  • [ ] UserAuthDetailsModel
  • [ ] UserBulkEmailsModel
  • [ ] UserContributionProficiencyModel
  • [ ] UserContributionRightsModel
  • [ ] UserContributionsModel
  • [ ] UserEmailPreferencesModel
  • [ ] UserGroupModel
  • [ ] UserIdByFirebaseAuthIdModel
  • [ ] UserIdentifiersModel
  • [ ] UsernameChangeAuditModel
  • [ ] UserQueryModel
  • [ ] UserSettingsModel
  • [ ] UserSkillMasteryModel
  • [ ] UserSubscribersModel
  • [ ] UserSubscriptionsModel
  • [ ] VoiceArtistMetadataModel
  • [ ] VoiceoverAutogenerationPolicyModel

ankita240796 avatar Feb 06 '25 18:02 ankita240796

Hi @ankita240796, I'd like to take up the UserContributionsModel. Validation: It only has 2 fields: created_exploration_ids: list of explorations created by the user. edited_exploration_ids: list of explorations edited (the user has made a postive non revert commit to the exp) by the user. Both of these should correspond to valid exploration models and exist in the datastore.

The domain layer validate checks: In user_domain.py: Class UserContributions: validate (Already implemented): Checks that user_id and all exploration ids in created_exploration_ids, edited_exploration_ids are valid.

mon4our avatar Feb 21 '25 06:02 mon4our

Hi @mon4our, sorry, this is currently blocked on implementation of a BaseValidationJob in #21969. I will get back to you here once that is done and discuss what you proposed and assign this then. #21969 should be done early next week.

ankita240796 avatar Feb 21 '25 12:02 ankita240796

@Ashu463 -- @ankita240796 will review your comment. (I cannot reply to everything.)

If you need help with something, writing a debugging doc and being clear about what you are asking will make it easier for others to help you.

seanlip avatar Mar 11 '25 17:03 seanlip

Hi @Ashu463, sorry this one is blocked as mentioned in the issue description. I will remove the blocker by the end of this week and will respond to your comments then. Thanks!

ankita240796 avatar Mar 12 '25 14:03 ankita240796

Hi @Ashu463, sorry this one is blocked as mentioned in the issue description. I will remove the blocker by the end of this week and will respond to your comments then. Thanks!

Cool, till then I'll update the comment by removing code blocks and adding explanation to validation jobs. Thanks.

Ashu463 avatar Mar 12 '25 14:03 Ashu463

Thanks @Ashu463, that sounds great!

ankita240796 avatar Mar 12 '25 14:03 ankita240796

Hey @ankita240796 is this open now ? Suggesting to implement validation jobs for ClassroomModel, here are some few validation methods which I thought to built, before starting to validation jobs I'm listing all attributes of classroomModel and explain them shortly.

- name: Name of classroom.
- url_fragment: End point of classroom.
- course details: Descriptoin of course in classroom.
- teaser_text: Text to provide a summary of the classroom.
- topic_list_intro: Text to provide an introduction for all the topics in the classroom.
- topic_id_to_prerequisite_topic_ids: List of topic IDs which are prerequisites for any particular topic.
- is_published: Either the classroom is published or not.
- diagnostic_test_is_enabled: Whether this classroom's diagnostic test functionality is enabled or not
- thumbnail_filename: Thumbnail filename of classroom.
- thumbnail_bg_color: Thumbnail background color of classroom.
- thumbnail_file_size_in_bytes: Thumbnail file size of classroom.
- banner_filename: The banner filename of the classroom.
- banner_bg_color: The banner bg color of the classroom.
- banner_file_size_in_bytes: The banner file size of the classroom.
- index: Iteratable value of classroom.

Validation Criteria

  • GetInvalidNameJob : A name is considered invalid if it is not of type str, is an empty string, or has a length greater than constants.MAX_CHARS_IN_CLASSROOM_NAME.
  • GetInvalidURLFragmentJob: A url_fragment is invalid if it is not of type str, is an empty string, exceeds constants.MAX_CHARS_IN_CLASSROOM_URL_FRAGMENT, or contains characters not matching the defined regex (constants.VALID_URL_FRAGMENT_REGEX).
  • GetInvalidCourseDetails: course_details is invalid if it is not of type str, is an empty string, or exceeds constants.MAX_CHARS_IN_CLASSROOM_COURSE_DETAILS.
  • GetInvalidTeaserText: teaser_text is invalid if it is not of type str, is an empty string, or exceeds constants.MAX_CHARS_IN_CLASSROOM_TEASER_TEXT.
  • GetInvalidTopicListIntro: topic_list_intro is invalid if it is not of type str, is an empty string, or exceeds constants.MAX_CHARS_IN_CLASSROOM_TOPIC_LIST_INTRO.
  • GetClassroomsHavingCyclicPrerequisitesTopicsIdsJob: A topic_id_to_prerequisite_topic_ids mapping is considered invalid if it contains cyclic dependencies. For example, if Topic A depends on Topic B and Topic B depends on Topic A, it is marked as invalid. To detect cyclic dependencies, we can use the logic implemented in the validate method.
  • GetInvalidBgColor: thumbnail_bg_color and banner_bg_color are invalid if they are not of type str, are empty strings, or contain colors not defined in constants.ALLOWED_THUMBNAIL_BG_COLORS['classroom'].
  • GetOtherInvalidAttrs: Validates all other attributes by comparing their types in the backup/local server with the expected types. Any attribute with a type mismatch is considered invalid.

I took reference from other validation jobs exist in codebase before presenting to you. Lemme know your views on it.

Thanks.

Ashu463 avatar Mar 20 '25 17:03 Ashu463

@ankita240796 freindly ping, I need to show that I could write Beam Jobs along with their tests in my GSoC proposal specifically needed for my project idea, that's why I'm requesting to PTAL.

Should I open draft PR for now ?

Ashu463 avatar Mar 22 '25 12:03 Ashu463

Hi @Ashu463, let's wait for the PR till tomorrow until I get mine reviewed, will be sending it tonight.

Let's discuss a bit more on the structure you described:

  • Can you break the checks down to what is already covered by the domain object of ClassroomModel?
  • Once you do that, we just need to cover checks which cannot happen in domain layer.

Can you create a table like one described here to clarify the same? Thanks!

ankita240796 avatar Mar 22 '25 17:03 ankita240796

Sure @ankita240796 here is the list of checks done in domain object of classroom model:

  • Name validation of classroom.
  • Teaser text validation.
  • Topic list intro validation.
  • Course details validation.
  • URL fragment validation.
  • BG color validation for both thumbnail and banner.
  • Checking cyclic topic ids in prerequisite topic ids.
  • Is_published validation.
  • diagnostic_test_is_enabled validation.

And in strict validation, rest all attributes are validated. Now a question arises in my mind if validation for each attribute is done in this ClassroomModel domain object, then what we would gonna check in validation jobs ?

I'm preparing table for this classroomModel. @ankita240796 I have one more question do I need to propose criteria for all listed validation jobs in my GSoC proposal or I need to pick some of them, similarly with domain objects.

Ashu463 avatar Mar 22 '25 18:03 Ashu463

Thanks @Ashu463!

And in strict validation, rest all attributes are validated. Now a question arises in my mind if validation for each attribute is done in this ClassroomModel domain object, then what we would gonna check in validation jobs ?

You would check for inter-model relationships if any exists. Further, we can check for properties not present in domain object which might be a rare case. I would recommend going through the table in the linked doc which will give you an idea of what checks can exist for model and what for domain objects.

I have one more question do I need to propose criteria for all listed validation jobs in my GSoC proposal or I need to pick some of them, similarly with domain objects.

You need to do it for all models and all domain objects in the format described in here.

ankita240796 avatar Mar 22 '25 18:03 ankita240796

Got it @ankita240796 i.e I need to fill table present in the doc for all listed models in both associated issues, and hence attach it to GSoC proposal. Right ?

Ashu463 avatar Mar 22 '25 19:03 Ashu463

Yes, that's correct @Ashu463.

ankita240796 avatar Mar 22 '25 20:03 ankita240796

@Ashu463 feel free to review #22225 if you are interested (no pressure to review).

ankita240796 avatar Mar 22 '25 20:03 ankita240796

Got it @ankita240796 i.e I need to fill table present in the doc for all listed models in both associated issues, and hence attach it to GSoC proposal. Right ?

Another note, you should also fill the first table for all base models for your proposal as described in the doc.

ankita240796 avatar Mar 22 '25 20:03 ankita240796

@ankita240796 here is that table for ClassroomModel, I can't think validation job for other than model id, as each attribute could be validated in domain layer.

Image

Now the parts which I want your focus is, look at below codeblock, validation for index is done on type int but comment says expected index to be boolean. Isn't this wrong ?

https://github.com/oppia/oppia/blob/9324ef744a3d28014eac71a995bb54c373b3fbf3/core/domain/classroom_config_domain.py#L366

Another one look this codeblock, where thumbnail_data is of type ImageData dict, and error msg says its expectation of thumbnail_data to be string. I think they wanna validate thumbnail_data.filename aren't they ?

https://github.com/oppia/oppia/blob/9324ef744a3d28014eac71a995bb54c373b3fbf3/core/domain/classroom_config_domain.py#L371

Got it @ankita240796 i.e I need to fill table present in the doc for all listed models in both associated issues, and hence attach it to GSoC proposal. Right ?

Another note, you should also fill the first table for all base models for your proposal as described in the doc.

Hey, required_validation part toh be same for both the tables, aren't they ? Then what's the point of writing same things in two different tables ? Either I would add a row to right of second table with Models which inherit these checks ,Howz that?

Ashu463 avatar Mar 22 '25 21:03 Ashu463

@Ashu463 I am unable to check the links you have in the image. Can you put the table down in a comment here (it's okay to have it as bullet points instead of a table if that's easier) or share the doc where you have this info?

I can't think validation job for other than model id, as each attribute could be validated in domain layer.

Does ClassroomModel has any relationships with other prod models? If yes, we need to validate those.

Now the parts which I want your focus is, look at below codeblock, validation for index is done on type int but comment says expected index to be boolean. Isn't this wrong ?

Yes, it is wrong. Good catch! Let's fix that - seems like someone copied over code from another place and missed changing this.

Another one look this codeblock, where thumbnail_data is of type ImageData dict, and error msg says its expectation of thumbnail_data to be string. I think they wanna validate thumbnail_data.filename aren't they ?

Hmm, I think the error message is wrong and we should fix that. Why do you think it was meant to validate thumbnail_data.filename though?

Hey, required_validation part toh be same for both the tables, aren't they ? Then what's the point of writing same things in two different tables ? Either I would add a row to right of second table with Models which inherit these checks ,Howz that?

I am not sure what you mean by required validation is same for both tables. The first table is just for base models defined in here. A separate table is required for it because it will lay down the basis for all other validation jobs. All models which inherit from BaseModel should have validation jobs which inherit from BaseValidationJob. So, this job should capture all common checks which will apply to all children of BaseModel. A separate table will make this easier to read and segregate. You do not need to put base models down in the second table then. Does that clarify?

ankita240796 avatar Mar 23 '25 00:03 ankita240796

@mon4our let me know if you are still interested in working on this. Thanks!

ankita240796 avatar Mar 23 '25 00:03 ankita240796

@ankita240796 here is link of complete doc. Pls refer to classroomModel for this time coz it is under process right now.

  • Yeah this model has relation with topic model, now I think validation jobs could be made on validating that topic ID present in classroom model must present in topic model. Right ?
  • Should I fix this in upcomig PR ? or open a issue regarding it ?
  • If you look closely to validation of filename, they're validating filename to be non-empty string directly insted of validating filename first on its type. And I think we should validate both thumbnail_data as dict and filename as non-empty string.
  • That's my bad I just got confused between two, thanks for clearing it out.

@ankita240796 I noticed one thing, there are some models who doesn't have any validation(e.g. AppFeedbackReportStatsModel, AuthorBlogPostAggregatedReadingTimeModel, etc.) and you'd not added them in domain object issue so do I need to list up validations for them also in table ?

EDITS: Removed doc link.

Ashu463 avatar Mar 23 '25 04:03 Ashu463

Thanks @Ashu463, the ClassroomModel table looks good, feel free to proceed with a PR. I have just one suggestion on validating thumbnail data and banner data. We should have a validate method in the ImageData domain object with all the checks you are mentioning for filename, size and colors. This makes validation logic in the right place and we can directly invoke validate method on these sub-fields. Does that make sense? (If yes, please add it in the table too).

Now for your questions:

Yeah this model has relation with topic model, now I think validation jobs could be made on validating that topic ID present in classroom model must present in topic model. Right ?

Yes, that's correct. We should check that a topic model exists for all topic ids present in any field of the classroom model.

Should I fix this in upcomig PR ? or open a issue regarding it ?

Fix it in the upcoming PR you create for classroom model.

If you look closely to validation of filename, they're validating filename to be non-empty string directly insted of validating filename first on its type. And I think we should validate both thumbnail_data as dict and filename as non-empty string.

I have answered initially on how to validate filename and thumbail data in general.

I noticed one thing, there are some models who doesn't have any validation(e.g. AppFeedbackReportStatsModel, AuthorBlogPostAggregatedReadingTimeModel, etc.) and you'd not added them in domain object issue so do I need to list up validations for them also in table ?

Hmm, I think domain objects exist for them but the naming is confusing. For example, for AuthorBlogPostAggregatedReadingTimeModel, we have this domain object: https://github.com/oppia/oppia/blob/770936014c68d791c1ccc33fe9df0fec268779ca/core/domain/blog_statistics_domain.py#L137

I think it is the same for other examples you mentioned. If you would like, you can go ahead and update the names to align with model names but it's not a requirement.

ankita240796 avatar Mar 24 '25 00:03 ankita240796

Thanks @ankita240796 for all replies, everything LGTM now. Tasks for me:

  • Add validate method for ImageData, and hence update existing methods to use this new method for validating Imagedata attrs.
  • Fix index validation.
  • Write validation job for classroomModel on listed criterions in table.

I'm not gonna changes these names, rather suggest to file the issue for it.

Ashu463 avatar Mar 24 '25 06:03 Ashu463

All sounds good @Ashu463!

Will you be filing the issue or want me to file one?

ankita240796 avatar Mar 24 '25 11:03 ankita240796

Fine @ankita240796 I'll open a new one.

Edit: I think let's do that when I complete design doc, as it helps to find more such cases.

Ashu463 avatar Mar 24 '25 13:03 Ashu463

Hi @ankita240796, very sorry for the late reply. I am very swamped till the end of this quarter so I won't be able to take this up at the moment. Lmk if I can help from Q2 onwards!

mon4our avatar Mar 27 '25 11:03 mon4our

Sure, will let you know @mon4our, thanks!

ankita240796 avatar Mar 27 '25 11:03 ankita240796

@ankita240796, I would like to work for StorySummaryModel. The domain already has the validate function. The model consists of the following fields:

  • story_id: str. The unique id of the story.
  • title: str. The title of the story.
  • description: str. The description of the story.
  • language_code: str. The language code of the story.
  • version: int. The version of the story.node_titles: list(str). The titles of nodes present in the story.
  • thumbnail_bg_color: str|None. The thumbnail background color of the story.
  • thumbnail_filename: str|None. The thumbnail filename of the story.
  • url_fragment: str. The url fragment for the story.
  • story_model_created_on: datetime.datetime. Date and time when the story model is created.
  • story_model_last_updated: datetime.datetime. Date and time when the story model was last updated.

Now, for validation, I could only think of the following:

  • checking model_id to be valid: According to the documentation, "The key of each instance is the story id". So, I think this can be done by retrieving all the storySummaryModels and the storyModels and then checking if each storySummaryModel ID exists in the corresponding storyModel ID list as well.
  • checking story_model_created_on : This should be the same as corresponding story's created_on value.
  • checking story_model_last_updated: This should be the same as corresponding story's last_updated value.
  • checking description: Description should be same as corresponding story's description.
  • checking url_fragment
  • checking node_titles
  • thumbnail_filename
  • thumbnail_bg_color

So basically, since the entire model is dependent on the StoryModel, we need to validate if the corresponding models have all the field's values to be same. Right?

Shriyam-Avasthi avatar Mar 28 '25 04:03 Shriyam-Avasthi

Thanks @Shriyam-Avasthi, a couple follow-ups:

  • For first bullet point, you only need to fetch the story model for a given summary model ID and see if that exists. I am assuming that's what you mean as you are validating one instance of the model at a time and do not retrieve all the instances together at once for this single check.
  • What do you mean by the last 4 bullet points? Are you implying checking that the values match the corresponding ones in the story model?
  • What about title and language code?
  • What checks will be done for version?
  • What checks will you perform for node_titles?

ankita240796 avatar Mar 28 '25 18:03 ankita240796

@ankita240796,

  • For first bullet point, you only need to fetch the story model for a given summary model ID and see if that exists. I am assuming that's what you mean as you are validating one instance of the model at a time and do not retrieve all the instances together at once for this single check.

Actually, I was thinking of fetching all the StoryModels at once. Then, similarly all the StorySummaryModels, and then validate if there are any StorySummaryModels without a corresponding storyModel. Here is the draft code I was thinking of:


class InconsistentStorySummaryModelError(base_validation.BaseValidationError):
    """Error class for inconsistencies between StoryModel and StorySummaryModel."""
    def __init__(self, story_id: str, field: str, expected, actual):
        message = (f"StorySummaryModel {story_id} has mismatched {field}: "
                   f"expected {expected}, found {actual}")
        super().__init__(message)

class MissingStoryModelError(base_validation.BaseValidationError):
    """Error class for missing StoryModel."""
    def __init__(self, story_id: str):
        message = f"StorySummaryModel {story_id} has no corresponding StoryModel."
        super().__init__(message)

class MissingStorySummaryModelError(base_validation.BaseValidationError):
    """Error class for missing StorySummaryModel."""
    def __init__(self, story_id: str):
        message = f"StoryModel {story_id} has no corresponding StorySummaryModel."
        super().__init__(message)

class ValidateStoryModelsJob(base_validation_jobs.BaseValidationJob):
    """Job that validates StoryModel and StorySummaryModel consistency."""

    def run(self) -> beam.PCollection[job_run_result.JobRunResult]:
        story_models_pcoll = (
            self.pipeline
            | 'Get all StoryModels' >> ndb_io.GetModels(story_models.StoryModel.get_all())
            | 'Map StoryModel by ID' >> beam.Map(lambda model: (model.id, model))
        )

        story_summary_models_pcoll = (
            self.pipeline
            | 'Get all StorySummaryModels' >> ndb_io.GetModels(story_models.StorySummaryModel.get_all())
            | 'Map StorySummaryModel by ID' >> beam.Map(lambda model: (model.id, model))
        )

        joined_pcoll = (
            {'story': story_models_pcoll, 'summary': story_summary_models_pcoll}
            | 'CoGroupByKey Story and Summary Models' >> beam.CoGroupByKey()
        )

        validation_results = (
            joined_pcoll
            | 'Validate StorySummaryModels' >> beam.ParDo(ValidateStorySummaryModel())
        )

        return validation_results


class ValidateStorySummaryModel(beam.DoFn):
    """Validates if StorySummaryModel fields match StoryModel fields."""

    def process(self, element):
        story_id, grouped_models = element
        story_models = grouped_models['story']
        summary_models = grouped_models['summary']

        if not story_models:
            yield MissingStoryModelError(story_id)
            return

        if not summary_models:
            yield MissingStorySummaryModelError(story_id)
            return

        story_model = story_models[0]
        summary_model = summary_models[0]

        mismatches = {}
        if story_model.title != summary_model.title:
            mismatches['title'] = (story_model.title, summary_model.title)
        if story_model.description != summary_model.description:
            mismatches['description'] = (story_model.description, summary_model.description)
        if story_model.language_code != summary_model.language_code:
            mismatches['language_code'] = (story_model.language_code, summary_model.language_code)
        if story_model.version != summary_model.version:
            mismatches['version'] = (story_model.version, summary_model.version)
        if story_model.thumbnail_bg_color != summary_model.thumbnail_bg_color:
            mismatches['thumbnail_bg_color'] = (story_model.thumbnail_bg_color, summary_model.thumbnail_bg_color)
        if story_model.thumbnail_filename != summary_model.thumbnail_filename:
            mismatches['thumbnail_filename'] = (story_model.thumbnail_filename, summary_model.thumbnail_filename)
        if story_model.url_fragment != summary_model.url_fragment:
            mismatches['url_fragment'] = (story_model.url_fragment, summary_model.url_fragment)
        if story_model.created_on != summary_model.story_model_created_on:
            mismatches['story_model_created_on'] = (story_model.created_on, summary_model.story_model_created_on)
        if story_model.last_updated != summary_model.story_model_last_updated:
            mismatches['story_model_last_updated'] = (story_model.last_updated, summary_model.story_model_last_updated)

        # Validate node_titles
        expected_node_titles = set()
        for story_content in story_model.story_contents:
            for story_node in story_content.story_nodes:
                expected_node_titles.add(story_node.title)
        
        summary_node_titles = set(summary_model.node_titles)

        if expected_node_titles != summary_node_titles:
            mismatches['node_titles'] = (expected_node_titles, summary_node_titles)

        if mismatches:
            for field, (expected, actual) in mismatches.items():
                yield InconsistentStorySummaryModelError(story_id, field, expected, actual)

  • What do you mean by the last 4 bullet points? Are you implying checking that the values match the corresponding ones in the story model?

I am sorry, I should have been more specific about the points. Yes, we will check for those values to be same as the corresponding story's values.

  • What about title and language code?

Similar to all the other fields, we will check if the values are same as the corresponding story's value.

  • What checks will be done for version?

Again, this should be the same as corresponding story's version.

  • What checks will you perform for node_titles?

We can verify if all the storyModel.story_contents.story_nodes.title are present in the node_titles. (Here we iterate over all the story_contents and further iterate over every story_node being a part of the story_content)

P.S. : I am a bit unsure if this is the correct architecture which is desired according to the master doc. Please guide me if this is not the desired architecture.

Shriyam-Avasthi avatar Mar 28 '25 20:03 Shriyam-Avasthi

Thanks @Shriyam-Avasthi!

Actually, I was thinking of fetching all the StoryModels at once. Then, similarly all the StorySummaryModels, and then validate if there are any StorySummaryModels without a corresponding storyModel. Here is the draft code I was thinking of:

We should not be doing that. Your jobs should follow a particular template extending a base validation job, see implementation in https://github.com/oppia/oppia/pull/22225. So, for the validation of checking if story model exists, we will have a separate validation function doing that which will try to fetch story model for a corresponding summary model.

I am sorry, I should have been more specific about the points. Yes, we will check for those values to be same as the corresponding story's values.

Sounds good, let's list down all criteria again clearly in a list once we have completed the discussion on all the points.

Again, this should be the same as corresponding story's version.

I don't see a version field in the StoryModel.

We can verify if all the storyModel.story_contents.story_nodes.title are present in the node_titles. (Here we iterate over all the story_contents and further iterate over every story_node being a part of the story_content)

That makes sense.

P.S. : I am a bit unsure if this is the correct architecture which is desired according to the master doc. Please guide me if this is not the desired architecture.

Yes, the architecture is not correct. You will have one validation job per model and it should not be mixing the checks of two models by fetching all at once as you proposed. Refer to the linked BaseValidationJob and let me know if anything is unclear.

ankita240796 avatar Mar 30 '25 12:03 ankita240796

@ankita240796

We should not be doing that. Your jobs should follow a particular template extending a base validation job, see implementation in #22225. So, for the validation of checking if story model exists, we will have a separate validation function doing that which will try to fetch story model for a corresponding summary model.

I think something like the following is what you are looking for. Right?

def validate_story_summary_domain(model):

            story_id = model.id

            story = story_fetchers.get_story_by_id(story_id)
            if story is None:
                yield StorySummaryDomainValidationError(
                    story_id,
                    "Missing corresponding Story domain object."
                )
                return

            if story.title != model.title:
                yield StorySummaryDomainValidationError(
                    story_id,
                    f"Title mismatch: expected '{story.title}' but found '{model.title}'."
                )

            # Additional validations can be added here:
            # e.g., comparing description, language_code, etc.
            if story.description != model.description:
                yield StorySummaryDomainValidationError(
                    story_id,
                    f"Description mismatch: expected '{story.description}' but found '{model.description}'."
                )

So, from what I am able to understand, this validation function will be mentioned in the get_validation_fns() method as that is the final list of validation functions which end up executin. But, I am unable to understand what goes in the get_validate_domain_object_fn() method .

Also, the base model loads all the models using :

        all_models = (
            self.pipeline
            | 'Get all models' >> (
                ndb_io.GetModels(datastore_services.query_everything()))
        )

But for any of the subclass, we will only need a few models. How should I be extracting the model I need? In this case, it is the story summary model.

I don't see a version field in the StoryModel.

The field is there. Please refer to the attached screenshot.

Image

Yes, the architecture is not correct. You will have one validation job per model and it should not be mixing the checks of two models by fetching all at once as you proposed. Refer to the linked BaseValidationJob and let me know if anything is unclear.

To be honest, I am still a bit confused about the architecture. It would be great if you could briefly point what I need to do exactly. I think the validation method is fine this time. So, apart from writing the validation method, how am I supposed to define those abstract methods for extending the base class? Also, I am assuming that I won't need to rewrite the run() method for the subclasses.

Shriyam-Avasthi avatar Apr 01 '25 03:04 Shriyam-Avasthi