dataserver
dataserver copied to clipboard
creator as non-classic data object
- Removed ClassicDataObject as a parent class of Zotero_Creators
- Removed all
load()
-ing of Zotero_Creators. An object is constructed with data right away. - One
join
inloadCreators
to load all creators on GET requests, instead of sequential$item->load()
. - Some minor refactoring to try to avoid extra queries
- One bulk-insert query for creators instead of multiple
$creator->save()
calls on create or update
One of the consequences of removing libraryID
from the creators
table is that the items from the same shard essentially "share" creators across libraries. So if 2 items are created with the same creator's information, we have 1 creator
entry, and 2 itemCreators
entries linking this creator to items.
I suppose this is a good thing, since we don't have to store duplicate data about authors across libraries. But something to account for is how to deal with the deletion of obsolete creators. On every delete of an item
, we can check if its creator belongs to any other items
, and delete them if not. Or, we can run a cron job every 30 minutes or so to fetch and delete creators
that are not linked to any items.
I think the second option is better, since the first option requires performing extra queries on every single DELETE Api call, which puts extra load on the dataserver
Sorry, should've noted this earlier — we should just do what we do for itemData
and inline the values in itemCreators
. That'll mean a bit of data duplication, but it'll be faster, and we really don't want to have to purge these in an event.
(The empty itemDataValueHash
column in itemData
is a holdover from many years ago when we had a separate itemDataValues
table like the client.)
inline the values in
itemCreators
Got it. So then, we completely drop the creators
table, and the itemCreators
will become something like this:
CREATE TABLE itemCreators
(
itemID
bigint unsigned NOT NULL,
firstName
varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
lastName
varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
fieldMode
tinyint(1) unsigned DEFAULT NULL,
creatorTypeID
smallint(5) unsigned NOT NULL,
orderIndex
smallint(5) unsigned NOT NULL,
PRIMARY KEY (itemID
,orderIndex
),
KEY creatorTypeID
(creatorTypeID
),
KEY name
(lastName
(7),firstName
(6))
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
With a many-to-one relationship with the items table
In my last commit, I got rid of creators
table, moved all data storage about creators into the itemCreators
table, and removed some code dealing with creators that we no longer need. For example, we no longer need the for-loop when we add creators, so I grouped creator insertion into one query. I think there are a few other things can simplify now but that's a start.
Added a php script to move data from creators
and itemCreators
into the new itemCreators
table on a shard-by-shard basis.
To make it possible to migrate a few shards at a time, I added the php models without changes from this PR with an old_
prefix. Then, in ApiController.php
, a global variable $shardID
is set - based on it and a hardcoded array of shardIDs, zotero_autoload
in header.php
decides to use old_*
files for shards that were not updated yet or usual files for updated shards. I had no problems running tests or using the API after migrating one shard out of 2 locally.
Brief summary of changes up to this point:
- Added migration script to remove
creators
andtags
tables and migrate to schema where all data is stored initemCreators
anditemTags
. - In
header.php
, a temporary global variableshards
is used to specify which shards have been migrated. Based on this, there is some conditional behavior inItemController.php
andTagController.php
, depending on if the current shard was migrated or not. - When a method from any affected model is called,
auto_loader
checks current shard against migrated shards and selects eithermodel/...inc.php
- model with recent changes ormodel/old_...inc.php
- model using old schema. - Both for creators and tags,
load()
,init()
, andsave()
functions are removed. All data loading happens in theitem
model vialoadCreators
andloadTags
directly into objects via constructor, and data saving/updating is done via staticbulkInsert
andbulkDelete
. -
loadTags
,loadCreators
,bulkCreate
andbulkInsert
batch multiple queries into one, so fetching or inserting 10 tags is 1 large transaction instead of 10. -
changed
property is set inloadTags
,loadCreators
- there is a check if the payload is different from existing tags/creators or not. -
Zotero_Tag
andZotero_Creator
properties now match the SQL columns, so there is some refactoring (e.g. inCite.inc.php
) to account for that. - All code for
Tag
andCreator
models that is not used is removed. I thought it's easier to see changes that way. - Other minor changes, mainly in SQL joins, to work with new schema
I ran some benchmarking on a shard post-upgrade vs a shard before upgrade with profiler on check the performance of adding 5000 tags and then 5000 creators.
Before: tags ~ 30 sec, creators ~ 30 sec. ~20,000 small DB queries that took 3-7 seconds. After: tags ~ 40 sec, creators ~ 20 sec. ~30 larger DB queries that took 0.4-0.6 seconds.
So the database queries are better (since I cut down the # of queries a lot) but there is some inefficiency elsewhere that slows things down.