dataserver icon indicating copy to clipboard operation
dataserver copied to clipboard

creator as non-classic data object

Open abaevbog opened this issue 1 year ago • 6 comments

  • Removed ClassicDataObject as a parent class of Zotero_Creators
  • Removed all load()-ing of Zotero_Creators. An object is constructed with data right away.
  • One join in loadCreators to load all creators on GET requests, instead of sequential $item->load().
  • Some minor refactoring to try to avoid extra queries
  • One bulk-insert query for creators instead of multiple $creator->save() calls on create or update

abaevbog avatar Jul 21 '23 23:07 abaevbog

One of the consequences of removing libraryID from the creators table is that the items from the same shard essentially "share" creators across libraries. So if 2 items are created with the same creator's information, we have 1 creator entry, and 2 itemCreators entries linking this creator to items.

I suppose this is a good thing, since we don't have to store duplicate data about authors across libraries. But something to account for is how to deal with the deletion of obsolete creators. On every delete of an item, we can check if its creator belongs to any other items, and delete them if not. Or, we can run a cron job every 30 minutes or so to fetch and delete creators that are not linked to any items.

I think the second option is better, since the first option requires performing extra queries on every single DELETE Api call, which puts extra load on the dataserver

abaevbog avatar Jul 24 '23 16:07 abaevbog

Sorry, should've noted this earlier — we should just do what we do for itemData and inline the values in itemCreators. That'll mean a bit of data duplication, but it'll be faster, and we really don't want to have to purge these in an event.

(The empty itemDataValueHash column in itemData is a holdover from many years ago when we had a separate itemDataValues table like the client.)

dstillman avatar Jul 24 '23 20:07 dstillman

inline the values in itemCreators

Got it. So then, we completely drop the creators table, and the itemCreators will become something like this:

CREATE TABLE itemCreators ( itemID bigint unsigned NOT NULL, firstName varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, lastName varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, fieldMode tinyint(1) unsigned DEFAULT NULL, creatorTypeID smallint(5) unsigned NOT NULL, orderIndex smallint(5) unsigned NOT NULL, PRIMARY KEY (itemID,orderIndex), KEY creatorTypeID (creatorTypeID), KEY name (lastName(7),firstName(6)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

With a many-to-one relationship with the items table

abaevbog avatar Jul 24 '23 20:07 abaevbog

In my last commit, I got rid of creators table, moved all data storage about creators into the itemCreators table, and removed some code dealing with creators that we no longer need. For example, we no longer need the for-loop when we add creators, so I grouped creator insertion into one query. I think there are a few other things can simplify now but that's a start.

abaevbog avatar Jul 24 '23 23:07 abaevbog

Added a php script to move data from creators and itemCreators into the new itemCreators table on a shard-by-shard basis.

To make it possible to migrate a few shards at a time, I added the php models without changes from this PR with an old_ prefix. Then, in ApiController.php, a global variable $shardID is set - based on it and a hardcoded array of shardIDs, zotero_autoload in header.php decides to use old_* files for shards that were not updated yet or usual files for updated shards. I had no problems running tests or using the API after migrating one shard out of 2 locally.

abaevbog avatar Jul 25 '23 21:07 abaevbog

Brief summary of changes up to this point:

  • Added migration script to remove creators and tags tables and migrate to schema where all data is stored in itemCreators and itemTags.
  • In header.php, a temporary global variable shards is used to specify which shards have been migrated. Based on this, there is some conditional behavior in ItemController.php and TagController.php, depending on if the current shard was migrated or not.
  • When a method from any affected model is called, auto_loader checks current shard against migrated shards and selects either model/...inc.php - model with recent changes or model/old_...inc.php - model using old schema.
  • Both for creators and tags, load(), init(), and save() functions are removed. All data loading happens in the item model via loadCreators and loadTags directly into objects via constructor, and data saving/updating is done via static bulkInsert and bulkDelete.
  • loadTags, loadCreators, bulkCreate and bulkInsert batch multiple queries into one, so fetching or inserting 10 tags is 1 large transaction instead of 10.
  • changed property is set in loadTags, loadCreators - there is a check if the payload is different from existing tags/creators or not.
  • Zotero_Tag and Zotero_Creator properties now match the SQL columns, so there is some refactoring (e.g. in Cite.inc.php) to account for that.
  • All code for Tag and Creator models that is not used is removed. I thought it's easier to see changes that way.
  • Other minor changes, mainly in SQL joins, to work with new schema

I ran some benchmarking on a shard post-upgrade vs a shard before upgrade with profiler on check the performance of adding 5000 tags and then 5000 creators.

Before: tags ~ 30 sec, creators ~ 30 sec. ~20,000 small DB queries that took 3-7 seconds. After: tags ~ 40 sec, creators ~ 20 sec. ~30 larger DB queries that took 0.4-0.6 seconds.

So the database queries are better (since I cut down the # of queries a lot) but there is some inefficiency elsewhere that slows things down.

abaevbog avatar Jul 31 '23 19:07 abaevbog