site-kit-wp icon indicating copy to clipboard operation
site-kit-wp copied to clipboard

Add “partial data” states infrastructure

Open techanvil opened this issue 1 year ago • 9 comments

Feature Description

Add the full infrastructure for determining and exposing the "partial data" states for audiences, custom dimensions and properties.

See partial data states in the design doc.


Do not alter or remove anything below. The following sections will be managed by moderators only.

Acceptance criteria

  • Analytics module should have new selectors for detecting whether an audience, custom dimension or analytics property (referred as resource in the following points) is in "partial data" state.
    • A resource is considered to be in "partial data" state until it has been active for the full duration of the currently selected date range.
    • A resource is also considered to be in "partial data" state if the GA4 itself is in gathering data.
  • Partial data state should be determined by retrieving a report of the given resource and checking the date of the earliest event and making a comparison with the start date of the current date range.
  • Similarly how it's done for the gathering data states, the date of the earliest event, once determined, should be persisted on the server and made available in client on page load.
    • The persisted date for a given resource. whenever available. should be used instead of making a report request to determine the partial states in the resolvers of partial data selectors.
    • Persisted dates for all resources should be reset whenever Analytics property or measurement ID changes, Analytics module is deactivated or Site Kit is reset.

Implementation Brief

Note: the following IB is heavily based on and inspired by Data_Available state for modules and custom dimensions. Any gap in the IB may be filled in by reassessing the implementation and comparing with the aforementioned infrastructure.

PHP

  • [ ] Create class Google\Site_Kit\Modules\Analytics_4\Resource_Data_Availability_Date.
    • [ ] Take Transients $transients in the constructor and initialize as a field.
    • [ ] Use Const VALID_CUSTOM_DIMENSION_SLUGS and VALID_AUDIENCE_SLUGS to store valid and allowed custom dimensions and slugs.
    • [ ] Have RESOURCE_TYPE_** consts for audience, custom dimension and property resources.
    • [ ] Method get_resource_transient_name takes resource name and resource type parameters and returns the computed transient name. ie. return "googlesitekit_{$resource_type}_{$resource_name}_data_availability_date";
    • [ ] Method get_resource_dates should return an associative array of the data availability date of resources. This can be multi dimensional array or the resources can be prefixed with the resource type.
    • [ ] Other methods get_resource_date, set_resource_date, reset_resource_date etc should be implemented similarly to how it is done on Google\Site_Kit\Modules\Analytics_4\Custom_Dimensions_Data_Available class.
  • [ ] In Google\Site_Kit\Modules\Analytics_4 class:
    • [ ] Add $resource_data_available_date field and instantiate it with Resource_Data_Availability_Date in the constructor.
    • [ ] Create A New REST Endpoint POST:save-resource-data-availability-date in Analytics_4 module.
      • [ ] It should check if the passed resource(s) in the $data (audience, customDimension or property) are valid, and then persist the date values as a timestamp in the DB using the $this->resource_data_available_date->set_resource_date method.
    • [ ] Expose the persisted dates of resource data availability to client using googlesitekit_inline_modules_data filter in the register method.
    • [ ] Call $resource_data_available_date->reset_resource_date() in on_deactivation method to reset all persisted dates on module deactivation.
    • [ ] Call $resource_data_available_date->reset_resource_date() in the $this->get_settings()->on_change() when property ID or measurement ID is different, similarly to how it's done with $this->custom_dimensions_data_available->reset_data_available() to reset the persisted dates when analytics property/measurement ID changes.

JS

  • [ ] Create assets/js/modules/analytics-4/datastore/partial-data.js file.
    • [ ] Create a fetch store for the aforementioned POST API.
    • [ ] Actions:
      • [ ] saveResourceDataAvailabilityDate takes an array of the object {resource name, resource type and date} and save it to the server using the fetch store.
    • [ ] Selectors:
      • [ ] getResourceDataAvailabilityDate(resourceName, resourceTyoe): returns the date associated with the given resource if available, otherwise resolves to the first date in the last 90 days that the report data became available using the associated resolver (described below). The 90 days is chosen because that's the longest date range available in Site Kit.
      • [ ] is{audience|customDimension|Property}PartialData(resourceName):
        • [ ] Return true when GA4 is in gathering Data state.
        • [ ] Return false when the dataAvailabilityDate for the is same or earlier than the startDate of currently selected date range.
        • [ ] Otherwise, return true. This also handles the case where dataAvailabilityDate for a given resource can not be determined due to errors or being in the shared dashboard.
    • [ ] Resolvers
      • [ ] getResourceDataAvailabilityDate:
        • [ ] Get reportArgs for the given resource.
        • [ ] For a property, this ReportArgs is similar to one returned by getSampleReportArgs from assets/js/modules/analytics-4/utils/report-args.js, while the change here being:
          • [ ] Start date: creation date of the current GA property.
          • [ ] End date: the reference date.
        • [ ] For audience, the reportArgs will include audienceResourceName as an additional dimension.
          • [ ] This will allow for a single report for all audience resources, and filtering the resulted report for a specific resource in JS to get the earliest date for a given audience resource.
        • [ ] For Custom Dimension, report args should be the following:
          • [ ] Start date: creation date of the current GA property.
          • [ ] End date: the reference date.
          • [ ] The dimension: date for property resource, and customEvent:${ resourceName }
          • [ ] Metric: eventCount.
          • [ ] See getDataAvailabilityReportOptions selector in assets/js/modules/analytics-4/datastore/custom-dimensions-gathering-data.js and getSampleReportArgs in assets/js/modules/analytics-4/datastore/report.js for more complete example. The implementation can largely be followed.
        • [ ] Make a simple report request to the given resource using the above report args.
        • [ ] Find the date of the first available report.
        • [ ] If there is any error or user doesn't have permission (ie. the property creation date can not be accessed in shared dashboard), return null and do not persist anything.
        • [ ]Otherwise, persist the date for the given resource using saveResourceDataAvailabilityDate and return the date.
  • [ ] Add the newly added store partial to assets/js/modules/analytics-4/datastore/index.js.

Test Coverage

  • Add PHP Unit test for the newly added infrastructure.
  • Add Jest test for the newly added selectors and actions.

QA Brief

Changelog entry

techanvil avatar Jan 24 '24 17:01 techanvil

AC ✔️

eugene-manuilov avatar Mar 11 '24 22:03 eugene-manuilov

  • Create assets/js/modules/analytics-4/datastore/custom-dimensions-partial-data.js file.

I think the file should be renamed to be more generic, something like partial-data.js because custom-dimensions- prefix refers to the custom dimensions matter which is just one out of three matters of the task.

Add the full infrastructure for determining and exposing the "partial data" states for audiences, custom dimensions and properties.

The "determining" part is missing in IB. We need to add instructions how to detect and save partial data information for all three matters.

eugene-manuilov avatar Mar 18 '24 17:03 eugene-manuilov

Thank you @eugene-manuilov for the review!

I think the file should be renamed to be more generic, something like partial-data.js because custom-dimensions- prefix refers to the custom dimensions matter which is just one out of three matters of the task.

Correct! I've updated the file name accordingly.

The "determining" part is missing in IB. We need to add instructions how to detect and save partial data information for all three matters.

The getResourceDataAvailabilityDate will either determine the first available date with data using a getReport request to the given resource with a 90-day report window (in resolver) or return the persisted date. We then use this date for the current date range in the is{audience|customDimension|Property}PartialData(resourceName) selectors to determine the partial data state. We can't persist the boolean value of this without needlessly complicating this, as this can be different based on the currently selected date range.

My thinking here is that something can be in partial data state for a 28-day range, but still can have all the data it needs for a 7-day range and thus not being in partial data. So by saving the first available date for a 90 day report instead, we can recompute the partial data state for all our supported date range.

Let me know what you think!

kuasha420 avatar Mar 19 '24 16:03 kuasha420

My thinking here is that something can be in partial data state for a 28-day range, but still can have all the data it needs for a 7-day range and thus not being in partial data. So by saving the first available date for a 90 day report instead, we can recompute the partial data state for all our supported date range.

Hey @kuasha420 @eugene-manuilov, just chipping in here as I had imagined we'd probably want to take the approach of requesting a report with a start date of the property creation time, that way we could get a definitive first-event-date and not keep requesting reports if say a property's events are all prior to the current 90 window. WDYT?

techanvil avatar Mar 19 '24 16:03 techanvil

My thinking here is that something can be in partial data state for a 28-day range, but still can have all the data it needs for a 7-day range and thus not being in partial data. So by saving the first available date for a 90 day report instead, we can recompute the partial data state for all our supported date range.

Hey @kuasha420 @eugene-manuilov, just chipping in here as I had imagined we'd probably want to take the approach of requesting a report with a start date of the property creation time, that way we could get a definitive first-event-date and not keep requesting reports if say a property's events are all prior to the current 90 window. WDYT?

Thanks, @techanvil. I think this is a good idea. @kuasha420, could you please update your IB to use what Tom suggests? We also need to make sure that this information is reset when the user changes Analytics settings.

eugene-manuilov avatar Mar 19 '24 19:03 eugene-manuilov

@eugene-manuilov Thank you for the review and @techanvil for the pointers. I've updated the IB accordingly and added an additional point in the AC about resetting the persisted dates (also reflected the addition in IB as well) based on the review and some internal discussion. Let me know what you think. Cheers.

kuasha420 avatar Mar 24 '24 04:03 kuasha420

Thanks, @kuasha420. Mostly looks good to me. Added a few pretty minor comments for you:

  • Method get_resource_data_availability_date_transient_name takes resource name ...
  • Method get_resource_data_availability_dates should return ...
  • Other methods get_resource_data_availability_date, set_resource_data_availability_date, reset_resource_data_availability_date etc ...

There is no need to duplicate data_availability_date in methods names if only having it makes a big difference for the method. In other words, if we call the method as get_resource_transient_name it will remain the same meaning and will be more concise.

  • get_resource_data_availability_dates -> get_resource_dates
  • get_resource_data_availability_date -> get_resource_date
  • set_resource_data_availability_date -> set_resource_date
  • reset_resource_data_availability_date -> reset_resource_date

... and resource type parameters and returns the computed transient name. ie. return "googlesitekit_custom_dimension_{$resource_type}_{$resource_name}_data_availability_date";

I believe the _constom_dimension_ part is not needed and the template should be as googlesitekit_{$resource_type}_{$resource_name}_data_availability_date, right?

eugene-manuilov avatar Mar 26 '24 13:03 eugene-manuilov

Thanks @eugene-manuilov ! Your suggested names are shorter and while a little ambiguous (what date?), makes sense in the broader context, because the methods will be called from the class instance (ie. $this->resource_data_available_date->set_resource_date) so the meaning can be inferred. I've updated the method names and their references accordingly.

I believe the _constom_dimension_ part is not needed and the template should be as googlesitekit_{$resource_type}_{$resource_name}_data_availability_date, right?

Yep, that's correct. It was ~~skill issue~~ copy paste error on my part, fixed!

Cheers.

kuasha420 avatar Mar 28 '24 07:03 kuasha420

Thanks, @kuasha420. IB ✔️

eugene-manuilov avatar Mar 28 '24 14:03 eugene-manuilov

QA Update ❌

Great work, @kuasha420. The functionalities work as expected except for an issue regarding removing the transient when disconnecting Analytics.

  • Verified: The test environment was set up successfully, and connections were established to two different Analytics properties:
    • A property with existing data (oi.ie), active for over 7 days.
    • A property recently created without data.
  • isAudiencePartialData Selector
    • Verified: Works as expected on both properties. It correctly identified partial data states based on audience data availability relative to the selected date range.
  • isCustomDimensionPartialData Selector
    • Verified: Works correctly, showing partial data states when the data for the custom dimensions is insufficient.
  • isPropertyPartialData Selector
    • Verified: Accurately reflects the partial data state when the GA4 is still in the gathering data state.
  • getResourceDataAvailabilityDate Selector
    • Verified: The selector successfully retrieves the earliest event dates, and the data is persisted through the POST:save-resource-data-availability-date endpoint to WordPress Transients.
    • Verified: The persistence of data availability dates is independent of the partial data state.
  • Resetting Behavior
    • Verified: All related transients are reset upon Site Kit reset.
    • Verified: The following transients related to data availability dates are correctly reset when changing the account or property or measurement ID.
      • _transient_googlesitekit_audience_**_data_availability_date
      • _transient_googlesitekit_customDimension_**_data_availability_date
      • _transient_googlesitekit_property_**_data_availability
    • Issue Found: _transient_googlesitekit_audience_**_data_availability_date is not being deleted upon disconnecting the Analytics module. However, the other two transients are correctly removed. ❌

hussain-t avatar May 09 '24 07:05 hussain-t

Excellent catch, thank you @hussain-t! The follow-up PR has been merged and this is now back with you for another QA:Eng round.

nfmohit avatar May 09 '24 20:05 nfmohit

QA Verified ✅

Issue Found: transient_googlesitekit_audience**_data_availability_date is not being deleted upon disconnecting the Analytics module. However, the other two transients are correctly removed. ❌

  • Verified: _transient_googlesitekit_audience_**_data_availability_date transient and other transients are removed upon disconnecting the Analytics module. ✅

hussain-t avatar May 10 '24 09:05 hussain-t