nodejs-firestore
nodejs-firestore copied to clipboard
Improve read performance by using stale reads
In the documentation, it is mentioned that stale reads may improve the performance of reading from Firestore as data can be just fetched from the nearest replica without having to reconfirm with the leader replica: https://firebase.google.com/docs/firestore/understand-reads-writes-scale#stale_reads
I'm using the following code to perform a stale read:
const random = Math.random();
const useStaleReads = random < USE_STALE_READ_PERCENTAGE;
logger.profile(`stale-read-${random}`);
let snap: DocumentSnapshot<FirebaseFirestore.DocumentData>;
if (useStaleReads) {
export const STALE_READ_STALENESS = 60 * 1000; // 1 minute
const maxDataStaleness: Date = new Date(
new Date().getTime() - STALE_READ_STALENESS
);
snap = await firestore.runTransaction(
async t => {
return t.get(ref);
},
{
readOnly: true,
readTime: Timestamp.fromDate(maxDataStaleness),
}
);
} else {
snap = await ref.get();
}
logger.profile(`stale-read-${random}`, {
level: 'info',
message: 'Read from Firestore',
meta: {
useStaleReads,
},
});
As the data is not changed very often it's fine to have one minute (or even longer) stale content.
But what we are seeing is that the strong reads are faster than the stale reads:
Query used for analysing the logs
WITH latencies AS (
SELECT
timestamp ,
JSON_VALUE(json_payload.metadata.useStaleReads) as uses_stale_reads,
JSON_VALUE(json_payload.metadata.profile.durationMs) as duration_in_ms,
FROM `simpleclub.global._Default._AllLogs` AS logs
WHERE NORMALIZE_AND_CASEFOLD(logs. resource.type , NFKC) = "cloud_run_revision"
AND NORMALIZE_AND_CASEFOLD(SAFE.STRING(logs. resource.labels ["revision_name"]), NFKC) = "cloud-run-revision"
AND NORMALIZE_AND_CASEFOLD(SAFE.STRING(logs. resource.labels ["service_name"]), NFKC) = "cloud-run-service"
AND REGEXP_CONTAINS(SAFE.STRING(logs. json_payload ["metadata"]["profile"]["id"]), "stale")
AND JSON_VALUE(json_payload.metadata.useStaleReads) = "true"
ORDER BY timestamp DESC
)
SELECT
STRUCT(
APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(5000)] AS percentile_50,
APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(7500)] AS percentile_75,
APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9000)] AS percentile_90,
APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9500)] AS percentile_95,
APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9900)] AS percentile_99,
APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9950)] AS percentile_99_5,
APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9990)] AS percentile_99_9,
APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9995)] AS percentile_99_95,
APPROX_QUANTILES(duration_in_ms, 10000)[OFFSET(9999)] AS percentile_99_99
) as duration_in_ms,
uses_stale_reads,
COUNT(*) as request_count
FROM latencies
GROUP BY uses_stale_reads
I wanted to share this experience with you and maybe I'm doing something wrong here... Not sure if increasing to the 60s staleness (instead of the 15s) breaks it?
Interesting data:
- We are using Firestore via GRPC (not REST)
- @google-cloud/firestore: v6.8.0
- Firestore database is hosted in eur3 (multi-region)
- Deployed on Cloud Run
- Always on CPU
- CPU start-up boost
- max 40 requests / instance
- 1st gen execution environment
- 1 CPU
- 4GiB memory
A quick test with the 15s staleness shows very similar numbers ...
There is an unfortunate implementation detail that transactions will send a begin transaction request, followed by your get document requests. Effectively, that means transactions are sending multiple requests instead of one with the regular get request.
We are looking to improve this.
The v1 FirestoreClient allows complete access to communication protocol, including ability to set readTime on get document requests. With this, you could achieve improved performance. However, this means taking responsibility for many of the things the regular API surface handles for you. Unless you really need this, I suggest you wait until we improve the regular API surface and/or optimize our handling of transactions with readTime.
Thank you for the question.
Interest in features like this from the developer community helps inform priorities for SDK development. I will be sure to pass this on. Feel free to tell us why this important.
@tom-andersen Thanks for the provided details 👌
The reason I'm asking is that we are looking into this particular technique for a latency-sensitive service where we want to improve the latency even more.
We have already looked into and adopted techniques like caching, optimizing business logic, etc.
--
I could imagine the following designs for such a native read-time feature:
const firestore = getFirestore();
firestore.settings({
readTime: Timestamp.fromDate(),
});
(For use-cases where you'd want all requests to query at a particular point in time. This would be useful for data recovery scripts, to not having to redefine the read time every time)
and/or:
getFirestore()
.doc('foo/bar')
.get({
readTime: Timestamp.fromDate(maxDataStaleness),
})
getFirestore()
.collection('foo')
.where('bar', '==', true)
.get({
readTime: Timestamp.fromDate(maxDataStaleness),
})
I've quickly implemented a version of this and ran some tests (10k requests) in a Cloud Shell: https://github.com/googleapis/nodejs-firestore/compare/main...simpleclub-extended:nodejs-firestore:feat/support-read-time-on-get
| Metric | With readTime |
Without readTime |
Improvement |
|---|---|---|---|
| 50th percentile | 16 ⭐ | 17 | -5.88% |
| 75th percentile | 18 | 18 | - |
| 87.5th percentile | 19 ⭐ | 20 | -5% |
| 93.75th percentile | 21 | 21 | - |
| 96.88th percentile | 23 | 23 | - |
| 98.44th percentile | 25 ⭐ | 27 | -7.41% |
| 99.22th percentile | 35 | 32 ⭐ | +8.57% |
| 99.61th percentile | 48 | 45 ⭐ | +6.25% |
| 99.80th percentile | 77 | 70 ⭐ | +9.09% |
| 99.90th percentile | 101 | 86 ⭐ | +14.85% |
| 99.95th percentile | 110 ⭐ | 112 | -1.79% |
| 99.98th percentile | 115 ⭐ | 359 | -67.97% |
| 99.99th percentile | 125 ⭐ | 565 | -77.88% |
| 99.99th percentile | 512 ⭐ | 1326 | -61.39% |
Test script
import {Firestore, Timestamp} from '@google-cloud/firestore';
import {createHistogram} from 'perf_hooks';
async function run() {
const firestore = new Firestore({
projectId: '<project>',
});
const histogram = createHistogram();
for (let i = 0; i < 10000; i++) {
const start = performance.now();
const maxDataStaleness: Date = new Date(
new Date().getTime() - 15 * 1000
);
await firestore
.doc('always/the/same/document')
.get({
readTime: Timestamp.fromDate(maxDataStaleness),
});
const end = performance.now();
histogram.record(Math.round(end - start));
}
console.log('min', histogram.min);
console.log('max', histogram.max);
console.log('mean', histogram.mean);
console.log('stddev', histogram.stddev);
console.log('exceeds', histogram.exceeds);
console.log('percentiles', histogram.percentiles);
}
run();
Okay, quickly ran another test, that randomly picks a document, instead of reading the same topic all the time (as this may result in a different behavior).
| Metric | With readTime |
Without readTime |
Improvement |
|---|---|---|---|
| 50th percentile | 10 ⭐ | 12 | -16.99% |
| 75th percentile | 12 ⭐ | 13 | -7.69% |
| 87.5th percentile | 13 ⭐ | 14 | -7.14% |
| 93.75th percentile | 14 ⭐ | 15 | -6.67% |
| 96.88th percentile | 16 ⭐ | 17 | -5.88% |
| 98.44th percentile | 18 ⭐ | 20 | -10% |
| 99.22th percentile | 20 ⭐ | 26 | -23% |
| 99.61th percentile | 25 ⭐ | 48 | -47.92% |
| 99.80th percentile | 54 ⭐ | 79 | -31.65% |
| 99.90th percentile | 73 ⭐ | 96 | -23.96% |
| 99.95th percentile | 96 ⭐ | 129 | -25.58% |
| 99.98th percentile | 110 ⭐ | 150 | -26.67% |
| 99.99th percentile | 138 ⭐ | 202 | -31.68% |
| 99.99th percentile | 145 ⭐ | 218 | -33.49% |
Test script
import {Firestore, Timestamp} from '@google-cloud/firestore';
import {createHistogram} from 'perf_hooks';
async function run() {
const firestore = new Firestore({
projectId: '<project>',
});
const documentIds = await firestore.collection('the/test/collection').listDocuments();
console.log(documentIds.length);
const histogram = createHistogram();
for (let i = 0; i < 10000; i++) {
const start = performance.now();
const maxDataStaleness: Date = new Date(
new Date().getTime() - 15 * 1000
);
const randomDocument = documentIds[Math.floor(Math.random() * documentIds.length)];
await randomDocument.get({
readTime: Timestamp.fromDate(maxDataStaleness),
});
const end = performance.now();
histogram.record(Math.round(end - start));
}
console.log('min', histogram.min);
console.log('max', histogram.max);
console.log('mean', histogram.mean);
console.log('stddev', histogram.stddev);
console.log('exceeds', histogram.exceeds);
console.log('percentiles', histogram.percentiles);
}
run();
Note: I don't get those numbers consistently 🤔
Looks like you were able implement the optimization. This is a good test case, where the only difference is readTime.
Understanding why you see these latencies, is a little beyond SDK support. I am sure there are other customer specific factors in play, such as database size, concurrent writes, warmup.
You may want to use Firebase support to get answer specific to your use case:
https://firebase.google.com/support/troubleshooter/firestore/queries
Can I help you with anything else?
Follow up for @IchordeDionysos. I asked internally, and was given some explanation:
Stale reads have two main values:
- Avoiding any waits for pending writes. So if they are comparing strong vs stale reads on a write only workload there is likely little difference.
- Using the non-primary region for reads. If they are using a regional instance than this one isn't applicable.
In your case, (2) is applicable.
You should run the workload (a) without transactions (b) from europe-west4 instead of europe-west1
@IchordeDionysos The next release of SDK will have optimization for transactions with readTime. They will reduce the number of requests required, and thereby reduce the latency. Feel free to do your test again with version 7.3.1 or newer.
https://github.com/googleapis/nodejs-firestore/pull/2002