cmc-csci143 State of the Industry Assignment + Extra Credit

Due date: Before you take the final exam

Background: There are many videos linked below. Each video describes how a major company uses postgres, and covers some of their lessons learned. These videos were produced for various industry conferences and represent the state-of-the-art in managing large complex datasets. It's okay if you don't understand 100% of the content of each of the videos, but you should be able to learn something meaningful from each of them.

Instructions:

For each video below that you choose to watch, write 3 facts that you learned from the video. Make a single reply to this post that contains all of the facts.
The assignment is worth 2 points. If you watch:
- 1 video, you get 1 point
- 2 videos, 2 points
- 4 videos, 3 points
- 8 videos, 4 points
So you can get up to a 4/2 on this assignment.

Videos:

About postgres:

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E
The Evolution of Reddit.com's Architecture

https://www.youtube.com/watch?v=nUcO7n4hek4
Postgres at Pandora

https://www.youtube.com/watch?v=Ii_Z-dWPzqQ&list=PLN8NEqxwuywQgN4srHe7ccgOELhZsO4yM&index=38
PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

https://www.youtube.com/watch?v=BgcJnurVFag
Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

https://www.youtube.com/watch?v=4GB7EDxGr_c
Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

https://www.youtube.com/watch?v=eZhSUXxfEu0
Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

https://www.youtube.com/watch?v=kd-F0M2_F3I
Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

https://www.youtube.com/watch?v=M7EWyUrw3XQ&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=6
Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow, another adtech company)

https://www.youtube.com/watch?v=PzGNpaGeHE4&list=PLlrxD0HtieHjSzUZYCMvqffEU5jykfPTd&index=13

About general software engineering:

Mastering Chaos - a guide to microservices at netflix

https://www.youtube.com/watch?v=CZ3wIuvmHeM&t=1301s
Why Google Stores Billions of Lines of Code in a Single Repository

https://www.youtube.com/watch?v=W71BTkUbdqE

(If you watch this, keep in mind it's an old 2015 video and try to imagine the increase in scale over the last 7 years.)
The kubernetes documentary. Recall that google's developed kubernetes as a more powerful version of docker-compose. (Note that this documentary has 2 parts, and you must watch both parts to count as a single video.)

https://www.youtube.com/watch?v=BE77h7dmoQU

https://www.youtube.com/watch?v=318elIq37PE

Apr 16 '24 15:04 mikeizbicki

Why Google Stores Billions of Lines of Code in a Single Repository

There are thousands of commits a week and there are more commits by automated systems than by humans.
All users work in the style of truck-based development combined with a centralized repo and there isn't a significant use of branching for development
Changes to base libraries are instantly propagated through the dependency chain to the final product to avoid technical debt build up.

Lessons learned scaling our SaaS on Postgres to 8+ billion events

The speaker suggested creating the right product before worrying about scale
Moving their Postgres database from a single node to a clustered node database that has a coordinator node and multiple worker nodes helped scale their project for 1 billion events
They decided to clear old stale data that was in hot storage so that users were getting up to date statistics and if they had less data then they had lower costs and more profit

Apr 20 '24 00:04 rachelHoman

Scaling Instagram Infrastructure

There are two different types of services—storage and computing. Storage servers store global data that used to be consistent across multiple data centers, whereas computing servers process requests by users, are temporary, and can be constructed by the global data.
Memcache is very important in scalability and without it the databases would be failing. It is a high-performance key-value store in memory. It provides millions of reads and writes per second.
Each process has two parts of memory—the shared part and the private part. The code, in fact, is a big part of where the memory is used up, so code should be optimized.

Postgres at Pandora

Pg_dump is slow to generate, and it is even slower to restore. Further, it can block other processes, so it’s not great. This is why Pandora moved to pg_basebackup which has its own problems, nonetheless. Some of the difficulties surround the connection to the database as it may be terminated under the stress for the data.
Pandora originally had SLONY replication, meaning that it only replicated tables and sequences.
Autovacuum runs for more than a week per table, which can be inconvenient due to the locks that it needs.

Why Google Stores Billions of Lines of Code in a Single Repository

The rate of change, in 2015, at which how many changes were committed was increasing rapidly—today it must be even more.
In 2015, there were about 15k commits per workday by humans!
The google workflow is usually as follows: the user links their workspace to the repo, they write their code, they review their code, and, finally, they commit (not before their code is reviewed by humans and other software, though)!

The Evolution of Reddit.com’s Architecture

Locks can slow queues down and inhibit us from getting real time data (this happened to them in mid-2012)
Sometimes, when reddit has a parent comment that is very popular, they have to make their own queue for it called a Fastlane because it tends to slow the rest of the site down otherwise.
Sanity checks are unbelievable important—especially when working on a high-traffic tool like reddit. Especially for when terminating servers!

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

Adjust, like many other large-scale companies, use MapReduce to break things down to speed up the process and keeping them close to real time.
You can create data structures to decrease the number of bytes that data takes up (Adjust does this with country names)
Autovacuum is not triggered nearly enough to keep being efficient, so Adjust changed the requirement at which autovacuum is triggered.

Citus: Postgres at Any Scale

Citus can help make making tables more efficient by allowing the creation of reference tables where you can then just join the table with the information needed.
Citus has better COPY performance because they send the commands to each of the worker nodes in parallel, which speed everything up.
With three different implementations of INSERT…SELECT (Co-located, Re-Partitionaning, and Merge step), Co-located works at about 100M rows per second, Re-partitioning at about 10M rows per second, and Merge step at about 1M rows/second.

Data modeling, the secret sauce of building and managing a large scale data warehouse

To track Wi-Fi connectivity or audio and video quality, they would use a measure, which is time series data on a Windows device.
One of the reasons they use Citus is called dynamic compute to compute the inner aggregations over precomputing the inner aggregations which can be costly.
They are using partial covering index. Some tables can have more than 50 indexes, so it can take up a lot of storage, so instead of a random IO, they use a sequential IO.

Lesson learned scaling our SaaS on Postgres to 8+ billion events

Optimizing from 1 to 1M events was not too hard, as running out of storage wasn’t an issues and queries were already fast. So, it just meant creating primary indexes. They found that instead of concurrently thinking of optimization during product building—first focus on building the product.
The reason they had trouble scaling up to 100M from 1M is because as the events database grew, the amount of data needed to bring into memory for each query was too large for a single node database.
When they wanted to change to the big integer column type for the analytical tables primary keys, they were able to do it withing about 30 minutes by creating a new big integer id column and indexing it and then updating the told ids and switching over.

Apr 20 '24 06:04 epaisano

Why Google Stores Billions of Lines of Code in a Single Repository (1)

There are many advantages of a monolithic repository but the main reason why it works and why it's worth investing in is that it is great for collaborative culture.
The Diamond Dependency problem is when it is difficult to build A since A depends on both B and C and B and C both depend on D or D.1 and D.2, this is a problem if there's a lot of repos to update at the same time.
In order for monolithic repos to work at such a large scale, you also need to consider code health, one such example is API visibility, where we set it to private and this encourages more consideration for more "hygienic code"

The Evolution of Reddit.com's Architecture (1)

Permissions are really useful considering a Postgres crash they encountered was that they were able to write to the secondary database.
Sanity checks are really important as well as observability is key and multiple layers of safeguards + simple-to-understand code are key to preventing problems.
Timers in code are good, they also give us cross-sections, as well as p99, which gives us a lot of information that helps us trace the problems/causes of weird cases.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale- Chris Travers - FOSSASIA 2018 (1)

Unless we have multiple TB of data, we're better off optimizing a single instance; this approach is better for dealing with heavy velocity and big data
By default, autovacuum doesn't kick in until we have 50 rows plus 20% of the table being "dead" but if we have a lot of rows, it doesn't keep up
Istore is a way to model integer arrays allowing us to model time series and stuff, allowing us to have arithmetic operations

Mastering Chaos - A Netflix Guide to Microservices (1)

Autoscale is fundamental as it can replace nodes easily and a loss of a node isn't of a mode isn't a problem for us; it also gives us computing efficiency.
A stateful service is a database/cache where the loss of a node is a notable event and may take hours to replace that node. We can deal with this with an EV cache where there multiple copies.
Some solutions to excessive load include workload partitioning, request level caching, and secure token fallback which is embedding a token into a device itself and it'll hopefully have enough information on the customer, letting them access the platform.

Apr 21 '24 04:04 myngpog

Why Google Stores Billions of Lines of Code in a Single Repository

Back when this talk was presented, on an average workday ⅔ of commits made to the repository are automated, and the other ⅓ of commits are made by humans.
Branching is rarely used as a version control strategy for development at Google; if a branch is created, it is usually because a new feature is being released.
Google has developed many source systems, such as Piper and CitC, which allow the monolithic repository structure to function properly. Piper is Google’s version of GitHub, and CitC is the system in which developers view and edit files across the codebase.

Data modeling, the secret sauce of building & managing a large scale data warehouse

Min Wei talked a bit about what Citus Data actually is in this talk, and why Microsoft acquired it. Essentially, from what I gathered Citus is an open-source Postgres extension that distributes data and queries in a way that improves performance and scalability.
In the use case that Min Wei presented, they use a partial covering index. Some “commonly queried tables” can have up to 50 indices on them.
In the data schema for a measure table, which is time series diagnostic data from Windows devices, the rows in the table are separated into top level and secondary dimensions. I’m a bit curious as to how this hierarchical structure is actually implemented within Postgres.

Apr 22 '24 00:04 oliver-ricken

Why Google Stores Billions of Lines of Code in a Single Repository

One advantage of a monolithic code base is to reduce technical debt and manage deoendicies easier, as library updates are propagated through the entire code base.
A single repository require significant amounts of tooling in order to place an emphasis on code health and scalability, which is very expensive but worth the benefits of a single repository.
The code base is a tree, with certain people in charge of subdirectories. It is their responsibility to determine whether something is commited or not.

The Evolution of Reddit.com's Architecture

The speaker emphasizes the importance of timing your code, as it allows one to get a cross section of the code and root out problematic areas within it.
Reddit has something called a Fastlane which allows Reddit to manually mark a thread and give it dedicated processing. A bug was found in it in 2016 due to comments being stuck in the non-fastlane queue.
Peer-reviewed checklists and sanity checks are crucial, especially with “destructive actions.”

Apr 22 '24 02:04 ben-smith23

Why Google Stores Billions of Lines of Code in a Single Repository

The first thing I learned is how much data is stored in their monolithic repository. 1 billion files, 2 billion lines of code, and 86 terabytes of data.
I learned about the google workflow. All their code is synced, then an engineer writes code. It is then review by humans and automated tests, and finally able to be committed.
Finally, I learned about all the tools that have been built to mantain the code. For example, checking to see dead code and unused dependencies.

Scaling Instagram Infrastructure

I learned some of the issues instagram faced. this includes how a rm -rf operation on a container was wiping out a couple of the servers. As well, it is good to have multiple datacenters as they can be unreliable at times
I learned how instagram has so many frameworks and databases in their stack. They split it up into computing and storage.
I learned how instagram minimizes their cpu usage and server usage. they do this by monitoring computer instructions and see how much a new feature increases the performance.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

There is a 5 minute window between when a request is made and when it shows up on the dashboard.
There are one to two hundred thousand requests coming in a second and even more during peak times. There are also over 400tb of data to analyze
The general architecture is they take request from internet, write to backend, materialize to analytics shards, and show on dashboard. The companyt uses mapreduce to aggregate data from many servers

Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

Windows diagnostic data is from an audience of 1.5 billion devices.
A typical windows measure could have data points from 200m devices per day and they get 2 million distinct queries per day
Use citus because it works with PosthresSql, and built in custom data types, and is both scalable and reliable

Apr 22 '24 03:04 rajabatra

1. Scaling Instagram Infrastructure:

There is a concept called ScaleUp, which is the idea of using as few CPU instructions as possible to achieve a goal. This consists of writing good code and using as few servers as possible. Each server would be able to serve more users when this method is implemented.
There are two types of memories: shared and private memory. Code is a big part of this memory, so it is important to reduce the amount of code to free up memory space. Instagram removed dead code and optimized their code to achieve this.
There are risks to everyone working on Master: if people misconfigure, updates could be prematurely released. But according to the speaker, they haven't really seen any downsides.

2. The Evolution of Reddit.com's Architecture

r2 is the original Reddit, and it is monolithic in that each server has the same code. However, each server may be running different parts of the code, but they all have the same general code.
There have been queue issues in the past, although it is usually processed quickly. Reddit had issues with this in 2012 with their "vote" feature. They found through adding timers that the locks were the feature that were causing the problems.
Locks are not great in general, and should only be used if you really have to use them. Reddit is trying different data models so they can go "lockless" (as of the time this video was released).

3. Postgres at Pandora:

Pandora decided to transition to Postgres from Oracle because Postgres is free and open source, and they wanted to increase their scale. Postgres offered all the benefits they were looking for.
In terms of monitoring current activity, Pandora was not doing much but they have since increased their emphasis. Now, they are doing more including monitoring current activity, errors, long duration queries and transactions, blocked processes, and more.
Pandora made an in-house developed product called Clustr. It is a non-ACID database, so it is made for Pandora's specific needs. It is intended for shared databases.

4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

PostgreSQL is VERY big: it has over 100k requests per second (at minimum). It also has over 400 TB of data to analyze, and a very high data velocity.
There are some challenges regarding data modeling. This includes the large volume of data, the fact that most data is new, the fact that there are a lot of btrees, and more.
They were experiencing an issue in which 50 rows and 20% of table were dead rows, by default. This is an issue related to AutoVaccuum, and they tried to reduce that 20% to 8%. One approach was to change it to 150k rows + 0% -- however, this also caused issues. AutoVaccuum required significant fine-tuning in order to work well.

5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

One difference between common stack and a growing stack lies in scaling out. A common stack has one master database, but a growing stack has two standby databases. Scaling out is very important.
pgdumps are very important. They allow for testing or pgrestore, and this tells you whether or not your database is corrupt.
Pairs of servers are related in the sense that each server is backed up by one Barman. These are in alternative patterns, so their backups will change and alternate. having multiple Barman helps with disk space.

6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

In the beginning (for smaller amounts of data), pgdumps are the best things to start out with for backups.. They are a simple and efficient backup strategy. They also work quickly. It is better to use pgdump than spend tons of time coming up with an elaborate backup strategy.
A best practice is to be intentional with work_mem and base it on actual temporary files being created in the logs. You can set it to two or three times the largest file.
You should keep shared buffers to 16-33GB. Going above this will increase checkpoint activity without much actual performance benefit.

7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

Citus is an open source extension for PostgreSQL that turns it into a distributed database. Some of the abilities it provides are: distributed tables with co-location, reference tables, and query routing, among others.
Citus is useful in instances where there is a database that is bigger than 100GB or will become bigger, when you are dealing with muli-tenet applications, and for real-time analytics dashboards.
Microsoft Windows is one of the biggest users of Citus. It uses Citus to make ship/no-ship decisions. Citus allows for split-second analytics on billions of JSON events.

8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

A primary use case of Citus is providing customer experience insights. It allows for data rich decision dashboards and ad hoc queries.
A measure refers to time series data on a specific use case executed on a device running Windows 10 or 11. It has dimension columns and metric columns.
The data Windows is dealing with is very large as a typical Windows measure could easily have data points from 200M devices per day.

Apr 22 '24 03:04 meghnapamula

The Evolution of Reddit.com’s Architecture:

Many core data models (called “thing”- stores accounts, links, subreddits) which uses Postgres
Thing is r2’s oldest data model
starting to move Thing into its own service (from r2).. Upside: no longer tangled in legacy code

Data modeling, the secret sauce (Citus Con)

Windows collects data from 1.5 billion devices
use hyperloglog to aggregate across millions of devices (hyperloglog is a custom data type built into Citus) for example, in finding the denominator of failure rate
some common tables have 50+ indexes (very high dimensional)

Lessons learned scaling our SaaS on Postgres (Citus Con)

Postgres easy to deploy and scale, but needed to expand past single node databases
Citus Postgres extension to distribute Postgres (sharding)
cluster nodes: coordinator node and multiple worker nodes, 2-4x as fast

Why Google Stores Billions of Lines of Code in a Single Repository

1billion+ files, 86 terabytes, 45 thousand commits per workday
CitC cloud-based system allows users to see their local changes overlaid on top of the Piper repository (where the single repo is stored)
trunk-based development; Piper users work at “head”, and all changes immediately update the view of other users (no branching).. Creates “one source of truth”

Mastering Chaos - A Netflix Guide to Microservices

Cap theorem: in the presence of a network partition, you must choose between consistency and availability (Netflix: chose availability, “eventual consistency’ through cassandra).
“Stateless service”- not a cache or database, frequently accessed metadata, loss of a node is no issue, survives instance failure after applying Chaos Monkey… vs “stateful service”- databases and caches, custom apps w/ large amounts of data, loss of a node is notable
Conway’s law- any piece of software reflects the organizational structure that produced it (relevant to the idea of integrating new software/breaking away things towards the goal of long term architecture)

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

Mapreduce from backend servers to analytics shards.. And from shards to client dashboards (usually under 5 minutes)
use istore for integers, useful for modelling timeseries/ supports GIN indexing
even in 400TB analytics environment, PostgreSQL still capable of scaling up

Scaling Instagram Infrastructure

Three priorities for scaling: scale out (more hardware, servers) scale up (take full advantage of servers) scale dev team.
use postgres to store user/media/”friendship” data, typically stored with a master copy and several replicas to account for high rates of reading/writing
Scaling up: use as few CPU instructions/reduce CPU demand, use as few servers as possible.. For example, multiple url generation to allow for differences in devices can be simplified to one function call

PostgreSQL at Pandora

Migrated to PostgreSQL from oracle, primarily due to finances, open source, and high-scale
developed in-house distributed database called Clustr; not ACID compliant, but works well with sharded databases (compromise consistency for availability)
Scaling issues: WAL backlog (150 WAL files per minute); wraparound, vacuum: autovacuum runs for more than a week per table

Apr 22 '24 04:04 lbielicki

v1: Scaling Instagram Infrastructure

there are three dimensions of scaling. scale out means adding more servers and data centers. scale up means make each added servers count. scale dev team means enabling more developers to move fast without breaking things.
instagram tech stack centers around django, which is python code that's stateless and handle tasks. django is the compute. storage consists of cassandra, memecache, and other storing systems in the database. instagram needs to store data in different memcache and makes the data sync across datacenters
instagram would monitor cpu activities through linux, analyze c profile and optimize. after iterations, they realize c code is faster

v2: The Evolution of Reddit.com's Architecture

reddit's techstack consists of the core - R2, api, listing, search, thing, rec, cdn. r2 has a load balancer that separate systems
loadbalancer separate the functions and minimize the negative impact of one queue on the overall user experience. for example, when vote queue is long, users who are only retrieving data through the front-end don't need to wait
vote queue has slowed down multiple times throughout the development of reddit. they first parallel it and then introduce the load balancer. reddit tries to log each action to make debug fast

v3: Postgres at Pandora

Pandora's architecture revolves around a core nexus, leveraging music metadata and radio functionalities. Their data warehouse, supported by an in-house database like Clustr, forms the backbone for managing vast volumes of user data efficiently, enabling seamless scalability.
The talk highlighted Pandora's robust monitoring system, focusing on detecting and addressing issues promptly. This includes tracking errors, frequency of occurrences, identifying and resolving long-duration queues and transactions, monitoring for blocked processes, and staying vigilant about any errors logged in syslog files.
Pandora emphasizes monitoring changes and the rate of change within their systems. They employ sophisticated alerting mechanisms to detect any deviations from normal operations promptly. This proactive approach allows them to respond swiftly to evolving conditions and maintain service reliability.

v4: PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

Citus offers a solution for scaling PostgreSQL databases seamlessly. By serving as an extension, Citus transforms PostgreSQL into a distributed database, capable of handling datasets ranging from 100GB to larger sizes suitable for multi-tenant applications and real-time analytics dashboards.
Citus enables the creation of a data warehouse environment, particularly suited for relatively static queries. This setup allows for efficient querying of vast datasets, enhancing analytical capabilities without compromising performance.
The architecture involves a coordinator node that orchestrates communication with worker nodes. Through commands like "create distributed_table," administrators can seamlessly connect different shards, ensuring data distribution across the cluster for optimal performance and scalability.

v5: Data modeling, the secret sauce of building & managing a large scale data warehouse

Following its acquisition by Microsoft, Citus serves various use cases, including handling Windows diagnostic data from an enormous user base of 1.5 billion devices
These use cases often involve managing data-rich environments for decision-making dashboards and accommodating ad hoc queries efficiently.
Citus manages time-series data related to specific Windows 10 use cases. Leveraging PostgreSQL's rich data types, particularly hstore and JSON, administrators can optimize data storage and retrieval. They typically utilize JSON for staging tables to handle flexible data structures and hstore for dynamic columns in reporting tables, ensuring efficient querying and analysis.

v6: navitaging chaos: netflix

Netflix operates under the understanding of the CAP theorem, which states that in the event of a network partition, a system must choose between consistency and availability. Netflix opts for eventual consistency, particularly leveraging Cassandra for its distributed database needs.
Following a significant error on Christmas Eve in the us-east-1 AWS region, where all services were affected due to a single point of failure, Netflix diversified its infrastructure across multiple AWS locations to enhance fault tolerance and resilience.
Auto scaling groups play a vital role in Netflix's infrastructure. They enable dynamic scaling of resources based on demand fluctuations, ensuring optimal resource utilization and performance across their services.

v7: why google store billions of lines of code in a single repository

Google's repository hosts an enormous amount of data: 1 billion files, 35 million commits, and over 2 billion lines of code, with a staggering 45,000 commits per workday and a daily peak of 800,000 queries per second, including configuration files.
Google opts for a monolithic repository for various reasons, including unified versioning, extensive code sharing and reuse, facilitating atomic changes, enabling collaboration across teams, enhancing code readability, and eliminating ambiguity regarding file versions.
Google follows a trunk-based development model with a single repository. Users typically work at the head, ensuring a single, consistent version, with branching being rare and primarily used for releases.

v8: kubernetes documentary

The DevOps movement evolved from virtualization to cloud computing, with AWS becoming a dominant force in 2013, notably introducing services like S3, revolutionizing infrastructure management.
Kubernetes functions like a post office, managing the logistics of application deployment by abstracting away underlying infrastructure details. Containers are likened to envelopes, encapsulating applications and their dependencies for streamlined deployment and execution.
Embracing open-source principles allows for transparency and collaboration, enabling anyone to view and modify code. This fosters the creation of vibrant communities, driving innovation and collective problem-solving.

Apr 22 '24 04:04 giffiecode

Scaling Instagram Infrastructure

Instagram's ability to scale efficiently involved three primary strategies: scaling out by increasing the number of servers to match user growth, scaling up by improving server efficiency to handle more operations, and scaling the development team to ensure that new features could be deployed quickly without compromising the system's stability.
The transition from AWS to Facebook's data centers was pivotal. This move facilitated better disaster recovery preparedness and allowed Instagram to leverage Facebook's advanced monitoring and scaling tools.
Instagram significantly improved performance through software optimizations, such as memory optimization and CPU usage reduction. These improvements were achieved by refining code efficiency, optimizing data storage and retrieval processes, and employing tools to pinpoint and rectify high-cost functions in their codebase.

Postgres at Pandora

Initially, Pandora used Oracle but switched to PostgreSQL due to cost issues and the desire for scalability and open-source flexibility. The transition involved creating predefined DBMS classes to manage numerous PostgreSQL instances effectively.
Pandora faced significant challenges in managing data consistency, backup processes, and replication strategies. Innovations include implementing monitoring tools for database activity, optimizing backup processes with tools like PG dump and PG base backup, and managing replication to enhance data availability and disaster recovery.
To address limitations in high availability and scalability, Pandora developed an in-house solution called "Cluster" (without an 'e'). This system allows for more flexible database management, including seamless schema updates and PostgreSQL version upgrades without downtime.

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!

The presentation highlights the importance of careful planning in infrastructure management, especially when dealing with large databases and multiple servers. It stresses the need for scaling databases both vertically and horizontally, utilizing cloud services alongside on-premises servers to ensure flexibility and scalability.
High availability is emphasized through strategies like streaming replication and logical replication to ensure data consistency and system reliability. The presenter discusses the implementation of multiple standby databases and replication slots to manage potential data lags and system failures.
Effective backup strategies are crucial for data security and recovery. The talk covers the use of tools like Barman for managing physical backups and PG Dump for logical backups, ensuring data can be restored accurately and efficiently.

Citus: Postgres at any Scale

Citus extends PostgreSQL to manage distributed databases, effectively allowing PostgreSQL to scale horizontally across multiple servers. This extension helps overcome hardware limitations on storage and memory, enabling databases to grow without physical constraints.
The video highlights Citus’s capability to distribute tables across various servers, which helps in managing very large datasets efficiently. Citus supports distributed transactions, ensuring that operations either completely succeed or fail across all involved servers, thus maintaining data integrity.
Citus is particularly effective in handling real-time analytics and multi-tenant SaaS applications. It provides the flexibility to accommodate growing data demands by distributing loads across several nodes.

Why Google Stores Billions of Lines of Code in a Single Repository

Google's approach consolidates all its code into one massive repository, providing a single source of truth. This structure simplifies access and modification of the code, ensuring that all developers work with the most current and correct version of any file.
The monolithic repository facilitates extensive code reuse and sharing across different teams at Google. It significantly simplifies dependency management, allowing changes to any library to be propagated instantly through the entire codebase. This setup prevents the "diamond dependency problem" where different projects might depend on multiple versions of the same package, which can lead to compatibility issues.
Google has developed sophisticated tools to handle the scale of its repository. Tools like Piper, Critique, and Tricorder integrate with the code repository to support code review, static analysis, and automated testing.

Lessons learned scaling our SaaS on Postgres to 8+ billion events

ConvertFlow initially chose PostgreSQL due to its compatibility with Ruby on Rails and ease of setup on Heroku. This choice supported their early growth, handling up to a million events without significant issues. This stage emphasized the importance of selecting technology that aligns with the team's familiarity and the project's immediate needs.
As event volumes grew, the original single-node PostgreSQL setup began to struggle, particularly when the service scaled to around 100 million events. The team initially experimented with Firebase to alleviate load issues but eventually transitioned to Citus, a PostgreSQL extension that allows efficient database sharding. This shift was critical in managing larger data volumes effectively, improving query performance by distributing the workload across multiple nodes.
Upon reaching billions of events, further optimizations were necessary. ConvertFlow implemented strategies such as data rollups for frequently accessed reports and periodic clearing of old data from active storage to maintain performance and manage costs.

Mastering Chaos - a guide to microservices at netflix

One of the biggest challenges in a microservices architecture is managing dependencies and interactions between services. Netflix experienced issues where failures in one service could cascade and impact others, affecting overall system stability. To manage this, Netflix developed Hystrix, a library designed to handle the complexity of inter-service communication by implementing patterns like circuit breakers, fallbacks, and isolation techniques.
To ensure the robustness of their system, Netflix uses a technique called "fault injection testing" which involves deliberately introducing faults into the system to test resilience and confirm that fallback mechanisms function correctly in production.
Netflix emphasizes the importance of identifying and defining "critical" microservices that are essential for basic user functionality, such as browsing and streaming. By focusing testing and reliability strategies on these critical services, they can ensure that the core functions of their platform remain operational even when peripheral services are compromised.

The Evolution of Reddit.com's Architecture

Reddit began with a monolithic architecture centered around a large Python application known as 'r2'. The platform started transitioning towards a more modern architecture by introducing microservices. This shift was primarily motivated by the need to improve scalability and manageability, with new services written in Python to maintain compatibility with the existing codebase.
The front-end engineers at Reddit, frustrated with the outdated aspects of 'r2', developed new, modern front-end applications using Node.js. These applications interact with Reddit's back-end through APIs, reflecting a shift towards separating the user interface from the data processing layers.
Reddit's architecture for handling listings (ordered lists of links) and comments involves complex caching and data normalization strategies to manage the high volume of user interactions efficiently. For listings, Reddit caches IDs and sort data to avoid expensive database queries. For comments, the platform uses a threaded model, where the parent-child relationships are stored to facilitate easier rendering of comment threads.

Apr 22 '24 06:04 danzhechen

Mastering Chaos - a guide to microservices at netflix

Fault Injection Testing (FIT) allows Netflix to test their service after they think that it is working. Through it, they can have functional tests through the whole system, testing each microservice individually but also within the context of any adjacent microservices that use it. Additionally, it allows them to test their scalability requirements using a percentage of the real traffic the system would face.
Eventual Consistency is the notion of preferring Availability over Consistency as per the CAP Theorem. Netflix uses Cassandra for this, where a write is usually done to a single node, which then propagates the information to other nodes later on. Information is considered written when a certain fraction of the nodes have received the information.
Putting all your eggs in one basket is always a bad idea. For example, Netflix used to only use the US-East-1 AWS servers, and when it went down, all of Netflix also went down. Hence, they split their load across multiple regional servers so that when one fails, the remaining servers can handle the failed server's load for the meantime.

Scaling Instagram Infrastructure

Instagram uses one main database that they write to from Django, and that database writes to the other "read" replica databases eventually. When Django then needs to perform a read, it reads directly from the "read" replica databases in order to prioritize availability.
In order to scale memcache across regions, Instagram uses cache invalidation. After the Postgres replication is performed, a demon is run on all the Postgres replicas that invalidates the cache in its region. After this, each cache is aligned with its read replica database.
To scale their memory to a 20+% capacity increase, Instagram reduced the amount of code stored by running in optimized mode (-O) and removing dead code. The also moved configuration data into a shared memory and disabled garbage collection in Python.

Apr 23 '24 05:04 Eshaan-Lumba

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E

Initially, when comments are made, they are inserted into the Postgres server and updated to a local memcache.
If users accessing the same media are served from different data centers, there is a problem as the memcache in one data center isn't updated with new comments from the other center. This results in users seeing stale comments.
To address this, instead of updating memcache from Django, they use Postgres class replication to update memcache on each Postgres replica, ensuring consistency across regions. This approach forces users to read directly from Postgres, avoiding stale cache issues.

The Evolution of Reddit.com's Architecture https://www.youtube.com/watch?v=nUcO7n4hek4

Python application at the heart of Reddit stores accounts, links, subreddits, and comments in postgres with memcache
Listings are ordered a set of links. Instead of directly querying the database, listings are put in memcache with asynchronous job queues handling updates. Queue partitioning was implemented to address slowdowns caused by lock contentions
Reddit has faced challenges from transitioning away from postgres because of replication failures and incorrect data in cached listings

Apr 23 '24 07:04 abizermamnoon

Scaling Instaram Infrastructure

Scaling involves “scaling out” which is adding more servers, “scaling up” which is making sure that these servers are used efficiently and “scaling dev team” which involves expanding the engineering team to be more productive.
Services could be of 2 types: storage or computing. Storage servers store global data and need to be consistent across data centers, while computing servers process requests on a user-demand basis.
There is a need for Instagram to move their data centers closer to where users are as that would help them reduce latency.

The Evolution of Reddit.com’s Architecture

Adding timers helped Reddit identify why the processing of requests was taking a very long in 2012 and ultimately helped them figure out that the locks were the sources of these issues.
Looking at the p99s also helps us identify where the problem cases might be, so having a strategy to get info out of them is very important.
Specialized solutions, such as creating a separate queue (Fastlane), are necessary to manage high-traffic areas effectively and prevent slowdowns in the overall system.

Why Google Stores Billions of Lines of Code in a Single Repository

In 2015, Google’s giant monolithic repo contained over 1 billion files, over 35 million total commits and sees about 45 thousand commits on an average workday, with two-thirds of these daily commits coming from automated systems!
This strategy allows for easy code reuse by engineers, collaboration across teams, and allows engineers to make backwards changes easily.
However, there are also some very important costs associated with this model including by introducing codebase complexity, unnecessary dependencies, and inefficiencies in the form of maintaining abandoned projects that are in the repo.

Lessons Learned Scaling our SaaS on Postgres to 8+ Billion Events

At the very early stages of a startup, scaling shouldn’t be the priority and startup founders should focus on identifying the right customers and markets- “Premature optimization is the root of all evil.”
The speaker stresses the need for startup founders to expect their stack and vendors to change as their company grows and gets more users.
However, he believes that Postgres is a great database for startups to initially start with and scale.

Apr 24 '24 02:04 nati-azmera

Citus: Postgres at any Scale

While not directly programming related, I wasn't aware of the google cloud credits available to startups and those pursuing education.
I learned that postgres was a single-node database. This likely was already covered somewhere in the reading, but this seems to suggest that single-node databases have only one VM.
I learned that the small integer column has an upper bound of 2.1 billion values.

Data modeling, the secret sauce of building & managing a large scale data warehouse

A measure is time series data bout a specific use case executed on a windows 10 or windows 11 device.
Windows uses the partial covering index due to the reason of incredibly expensive storage on the cloud.
I may have missed it in the first vidoe, but from this video I learned that Citus works as an addition to postgres, which allows for multi-node usage rather than the typicla single-node.

Apr 24 '24 15:04 GusAlbach

1. Scaling Instagram Infrastructure

Instagram employs three scaling strategies: scaling out by adding more servers for server efficiency and productivity
Memcache enhances Instagram's infrastructure by swiftly delivering frequently accessed data from memory, reducing database load and latency.
To reduce latency, Instagram aims to locate data centers closer to its user base.

2. The Evolution of Reddit.com's Architecture

Reddit's architecture initially revolved around a monolithic application called 'r2', which served as the backbone of the platform. This monolithic structure deployed the same code across all servers, although different servers handle different parts of code.
Reddit intially faced challenges with processing requests in 2012 and eventually found out the locks as the source of slowdowns.
Analyzing percentile metrics helps Reddit pinpoint problem areas in their system, enabling effective troubleshooting

3. Postgres at Pandora

Pandora migrated from Oracle to PostgreSQL primarily for cost reasons and to take advantage of the open-source nature of PostgreSQL.
They developed an in-house distributed database called Clustr, which sacrifices ACID compliance for better scalability and performance.
Pandora faced scaling issues such as WAL (Write-Ahead Logging) backlog and vacuum inefficiencies, which required careful management and optimization.

4. PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of postgres)

PostgreSQL is utilized at a massive scale, handling over 100k requests per second and managing more than 400 TB of data for analytics purposes, with a significant data velocity.
Challenges in data modeling arise due to the large volume of data, frequent influx of new data, and the prevalence of btree structures and this necessitates careful optimization for efficient performance.
AutoVacuum requires fine-tuning to address issues such as dead rows in tables.

5. Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All! (AdTech use of postgres)

Implementing high availability strategies such as streaming replication and logical replication ensures data consistency and system reliability in large-scale PostgreSQL deployments.
Effective backup strategies using tools like Barman are crucial for data security and recovery in large database environments.
Utilizing cloud services complemented with on-premises servers ensures flexibility and scalability in managing large databases and multiple servers.

6. Breaking Postgres at Scale (how to configure postgres for scaling from 1GB up to many TB)

Proper configuration of PostgreSQL settings is essential for scaling from small to large datasets, requiring adjustments in memory allocation, caching, and parallelism. Tuning PostgreSQL for high-traffic environments involves optimizing settings related to connection pooling and query execution.
Regular monitoring and performance testing help identify bottlenecks and inefficiencies.

7. Citus: Postgres at any Scale (Citus is a company specializing in scaling up postgres that Microsoft bought)

Citus extends PostgreSQL into a distributed database, allowing for handling datasets from 100GB to larger sizes.
Citus supports distributed transactions to ensure data integrity across all involved servers, maintaining consistency in a distributed environment.
Citus enables efficient querying of vast datasets, enhancing analytical capabilities without compromising performance.

8. Data modeling, the secret sauce of building & managing a large scale data warehouse (The speaker is the Microsoft employee responsible for purchasing Citus)

He emphasized prioritizing product development over scaling in a startup's early stages and recommended choosing well-supported technologies like Postgres.
ConvertFlow initially faced performance issues and high costs, leading them to temporarily use Firebase before migrating to Citus for scalable database management.
The speaker highlighted the importance of adapting technology stacks as businesses grow, showcasing their migration from Firebase to Citus for scalable database management, which enabled them to efficiently handle billions of events.

Apr 24 '24 19:04 tylerting

Scaling Instagram Infrastructure

You can use postgres replication to sync up data between servers deployed on different data centers. Each postgres server can also be used to invalidate memcache, but this design will increase load on the posgres servers.
Memcache lease is a way to reduce load on the actual postgres servers. Clients uses lease-get to request data from the memcache, and they can either get permission to go to the database to get the data itself or wait or use stale data.
There are functions in c and python that help you profile specific parts of the code (for example: perf_event_open in c and cProfile.Profile() in python

Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022

Cubing is a type of pre-computation. It can be used to compute the inner parts of a two-level aggregation. If one of the things you are grouping by has high cardinality (any values it can take up), this pre-computation can become much more expensive (the curse of dimensionality). I thought this was pretty interesting, but I’m not sure if i got the details right.
hyperloglog is a data structure that can be used to estimate the number of unique items. It is useful if you just want an estimate and care about computational efficiency
H store column allows you to add json data into postgres where the value for each json item could be different. This allows you to make larger “schema”-like changes to the tables without actually changing the schema. One of the downside of using H store is that the column is full text so for indexing you need to use something like GIN and also it’s harder to query for specific properties on the columns (got this info from the video and then using ChatGPT).

Apr 24 '24 20:04 echen4628

The evolution of reddit.com’s architecture: 1 – R2’s oldest data model stores data in postgres
2 – referring to items in postgres that didn’t exist caused a lot of errors in ~2011 – the cause of that was R2 trying to remove dead databases 3 – autoscaler changes the number of servers used based on demand – it does this because each post is tracked by a daemon

Postgresql at 20tb and beyond 1 – mapreduce is used to make things faster 2 – restarting postgres doesn’t disrupt the backend servers because of how the servers are run 3 – the mapreduce procedure used by materializer aggregates data and transfers between servers

Data modeling, the secret sauce of building & managing a large scale data warehouse 1 – the data that is being discussed is collected from an audience of 1.5 billion devices and it goes through a processing pipeline for it to be more usable for the team 2 – there is an inner and an outer query when the team tries to identify device id – this is a query the presenter described as “bread and butter” for their team – they pre-compute the inner query 3 – they use json for staging tables and then hstore for dynamic columns – the hstore type is smaller, so there is less io for fetching the data

Lessons learned scaling our SaaS on Postgres to 8+ billion events 1 – at the beginning of their company, they ran on postgres due to that being the database that rails and Heroku used. Scaling up within the early stage was easy because it meant creating indexes 2 – the speaker advised developing a marketable product that you know customers would buy prior to thinking about hypothetical issues of scale 3 – postgres is cheap and easy to deploy and also scale – it is easy for early stage companies and has been proven to be usable even for very large websites

Scaling Instagram infrastructure 1 – there is a difference between storage and computing servers. Storage servers are stored using postgres 2 – Cassandra is used to store user activities and is important for scalability 3 – database replication occurs across regions but computing resources are contained within one region

Mastering chaos – a Netflix guide to microservices 1 – a problem that they faced early on is that there was a monolithic codebase and also a monolithic database – this caused crashes every time something went down 2 – things being deeply interconnected means that there is a lot of difficulty to make any changes – this is why this is a bad way to build services today 3 – a microservice is a style of building an application that is a collection of small services – each service has its own task. This is a response to monolithic applications

Citus: postgresql at any scale 1 – citus is a service that creates distributed database abilities within postgresql. Citus distributes tables across clusters 2 – unique constraints on tables must include the distribution column (within distributed tables) – this is a limitation od distributed tables 3 – there are some missing features in this model, such as replicating triggers, and table inheritance

Breaking postgresql at scale 1 – how you use postgres changes operationally as a database grows, but postgres can generally handle any size that you need 2 – for small databases (like 10 gigabytes), postgresql will run quickly, even if you are always doing sequential scans 3 – don’t just randomly create indexes – they take up space/time. It is better to create indexes in response to specific problems, don’t just add them randomly because they will do things like slow down insert time

Apr 24 '24 23:04 nliyanage

The Evolution of Reddit.com's Architecture

Reddit uses SQL queries and primary keys just like we are learning now to index their subreddits and listings.
Using print statements helped reddit determine what was causing voting issues. It was domain listings that had to be split up and the queries would be processed separately to allow better voting results.
Comment trees can be optimized and organized using offline job processing, and put them into ordered batches.

Citus: Postgres at any Scale

Postgres is able to plug queries to another extension if it has certain features, in this case create_distributed_table would go through the Citus planner instead.
Complex queries can be fully sent to worker nodes if they filter by distribution column and the distributed tables are co-located.
Citus is beneficial as it does not have limits on CPU, memory, storage, I/O capacity and has the full range of PostgreSQL capabilities and extensions at hand.

Apr 25 '24 17:04 agemeda

2/2 points

Scaling Instagram Infrastructure

1 - A sizable portion of people working at Facebook/Instagram have not been there for a long time: 30% of the engineers joined in the last 6 months, and there are interns/bootcampers etc.

2 - When launching new features, Instagram uses gates to control access and gradually releases in the following order: engineers, dogfood, employees, some demographics, world.

3 - Instagram uses no branches in order to enjoy the benefits of continuous integration, collaborate easily, and they can still easily bisect problems and revert when needed.

The Evolution of Reddit.com's Architecture

1 - Reddit had issues with vote queues filling up too quickly and slowing processes down, so they partition votes into different queues so that there would be less fighting/waiting at the same lock. Ultimately, they split up the queues altogether so that the votes would not be aiming for domain listing

2 - For comment trees, there are threads that experience a lot of comments, so Reddit created a fastlane for these special cases. However, the fastlane filled up quickly and used too much memory and also caused skipping of child comments past parent comments that were not in the fastlane. Reddit used queue quotas to solve this problem

3 - Reddit faced difficulties in migrating to new servers because their autoscaler terminated transition to new servers after they were restarted, so they realized that there need to be more sanity checks when doing commands that make major changes.

Apr 26 '24 07:04 baron-zeng

Scaling Instagram Infrastructure:

I thought it was interesting that the entirety of Instagram is able to run on only 7 data centers worldwide, I would’ve guessed that they used many more before watching this presentation.
There’s a difference between storage and computing services in data centers, storage is used for global data across many data centers whereas computing is temporary and can be reconstructed from global data.
Memcache is a key-value store that can provide millions of read/writes per second but isn’t able to provide consistency across different networks.

The Evolution of Reddit.com’s Architecture:

Reddit uses asynchronous queues to handle expensive actions on their site such as commenting on a post to allow them to process in the background without slowing their main processes.
Reddit at its core is just a list of links and its original purpose can be boiled down to a simple SQL query but as this got too expensive they started using a cache and mutating it.
Reddit is able to manually mark larger threads to be designated to their own queues called the “fastlane queue” so that larger threads don’t take up too much of their processing power at once.

PostgreSQL at Pandora

Pandora decided to use PostgreSQL because it was free and they wanted to be open source.
They use a nexus environment that contains subjective data combined with objective data on the music along with user data to create classes within their Postgres service.
The utilize replication in their production database in both a local and remote storage to protect from outages.

PostgreSQL at 20TB and Beyond

They utilize two map reduces to aggregate data from other servers to transfer to their server in about 5 minutes usually which can be 100-200k actions a minute.
At a large scale it's hard to change the data types because of issues with rewriting tables and minimizing downtime emphasizing the importance of getting the data types right initially.
Incremental map reduces can help to reduce further and aggregate more data on demand.

Large Databases, Lots of Servers, on Premises, in the Cloud

RAID 10 servers should be used always though they are more expensive since they provide better storage services than RAID 5 or 8 servers.
Your backup data centers should be located further away from your main data center to ensure that your data will be safe regardless of what happens to the primary data center.
Pg_dump backups are important to use because they have many testing options to make sure that your restore is safe and able to be accesses.

Breaking PostgreSQL at Scale

Databases under 10gb can be queried any way even if its a sequential scan and they will still run fine.
To find the memory a PostgreSQL database needs you can check if it fits into the memory, if not you can fit the top 1-3 largest indexes and more memory doesn't always help performance.
If there's too much data for pg_dump to be used, PITR can be used for backups which can restore to a point in time.

Citus PostgreSQL at any Scale

Citus is best used at 100gb and beyond and for either multi-tenant applications such as SAAS apps or real-time analytics dashboards such as IoT.
Citus has clusters that consist of multiple PostgreSQL servers that appear as only one server with each server being able to handle 100-200 transactions a second.
Citus adds a feature to Postgres called create_distributed_table which distributes a table across a cluster of databases.

Data Modeling, the Secret Sauce of Building & Managing a Large Scale Data Warehouse

Windows collects data from over 1.5 billion devices in their databases.
Windows has about 10 million queries and 2 million distinct queries per day causing a technical challenge in their databases.
A windows measure could have data points from 200 million devices a day and only fails occasionally.

Apr 29 '24 07:04 gibsonfriedman

The Evolution of Reddit.com's Architecture

More hardware does not equal more performance. The code also needs to written to take advantage of the increased hardware.
Permissions should be used appropriately to prevent accidental operations.
Databases shouldn't be expected to have a general solution for every case. Exceptions such as fast lane demonstrate how special cases should be handled when necessary.

Why Google Stores Billions of Lines of Code in a Single Repository

Monolithic repos allow Google developers to work off of the same version of everything.
Monolithic model demands and allows for maintenance of code health. Tools were made to find dead code and unused dependencies can be removed.
Custom repo management tools are required for Google's scale. These tools are the biggest con of monolithic model. A huge investment is required for this model to work.

Apr 30 '24 19:04 justinchiao

"Mastering Chaos - A Netflix Guide to Microservices" talk

Designing for Resilience: Emphasizing the importance of designing microservices for resilience to prevent cascading failures, ensuring that issues in one service don’t compromise the entire system.
Decentralized Data Management: Highlighting the need for decentralized data management to enhance scalability and fault tolerance across various microservices.
Continuous Delivery and Automation: Leveraging continuous delivery and automation to efficiently manage, deploy, and scale microservices, allowing for frequent updates with minimal downtime.

Scaling Instagram's Infrastructure

Infrastructure Transition and Scaling: Initially hosted on Amazon Web Services (AWS), Instagram later moved to Facebook data centers, expanding its infrastructure from one to multiple data centers to enhance scalability and fault tolerance.
Database and Data Center Strategy:
- Utilizes Django for backend interactions with a Postgres database.
- Data writes are directed to a primary data center, while two additional data centers serve as replicas handling read operations to distribute load and increase read efficiency.
Performance Optimization Tools:
- Deployed tools like COLLECT for experimentation to assess CPU usage impacts of new features and ANALYZE with Python Cprofile to optimize high-impact functions, aiming to reduce CPU consumption and server requirements.

Overview of Reddit's Backend Service

Software Architecture and Technologies:
- Reddit’s backend is primarily developed in Python, with a focus on splitting from its legacy codebase. It uses Thrift or HTTP for client interactions and integrates Content Delivery Networks (CDNs) to efficiently route requests.
Load Balancing and Management:
- Implements Elastic Load Balancers (ELB) to manage and distribute user requests across its infrastructure, isolating application areas to prevent slowdowns in one section from affecting the performance of others.
Database Management and Scalability:
- Manages data using SQL and memcache for caching mechanisms. Encounters scalability challenges, particularly with database locks in high-traffic scenarios, which can delay vote processing and content updates on popular posts.

Why Google Stores Billions of Lines of Code in a Single Repository

The rate of commits being done to the repo is growing exponentially, and a majority of the commits are done automatically.
Google's workflow is to connect users to the workspace, write code, and review both done through people and automation, then committed. Each commit has an owner, and if you're trying to commit, you need permission.
A major positive about this is that they don't have to do tedious and time-consuming merging as all the code base is unified and doesn't have much confusion about the version of a file.

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of PostgreSQL)

20 TB is the maximum amount of DB storage for PostgreSQL DB, and to expand further, we use multiple servers that can track and make the DB size of an entire system bigger.
They aggregate their data and then shuffle that data and then map reduce the data, which then outputs.
At some point, they were running out of 32-bit integers to track the advertisement data trackers; it took them two months; they solved this by changing the data type of the trackers.

Breaking PostgreSQL at Scale

The largest community in PostgreSQL is petabytes in size big, and PostgreSQL can handle it; however, what works for a 10 GB DB is different from a petabyte DB.
Basic backups for a 10GB database are pg_dumps and cron job it every 6 hours or S3 dumps.
Monitoring by process logs through pg_badger or pg_statement provides real times of query performances or specific new relic, data dogs, and other web app monitoring applications.

Lessons learned scaling our SaaS on PostgreSQL to 8+ billion events (ConvertFlow, another adtech company)

Talks about how scaling in a startup looks like, i.e., short-term and long-term goals and how you can scale accordingly.
Used Citus to break up the database into multiple nodes and results in sharper and faster queries.
Reindexing the tables made the storage of the tables decrease as over the years PostgreSQL has given many updates, and also update the tables concurrently, or else halts in the DB queries happen.

Postgres at Pandora

Changed from Oracle as they wanted to use open-source and a much more affordable option.
Implementing monitoring current activities, and looking at the current logs of PostgreSQL and looking through the error logs and long query logs.
Created an in-house DB called cluster and it was not ACID intended for sharded database and each group has two or more DB's and tied together by the cluster application.
Transaction ID wrap around took around two weeks, takes way too long

May 02 '24 06:05 KentaWood

Scaling Instagram Infrastructure

Facebook conducts drills called "storms" to ensure that Instagram's services can withstand regional failures of data centers.
A "rm -rf" command nearly wiped out some crucial parts of Instagram's data services.
By adding the number of likes to a table, Instagram was able to reduce the query time down from 100 milliseconds to 10 microseconds.
Instagram does not uses branches when doing continuous integration.
Instagram does load testing in a gated production environment.

Why Google Stores Billions of Lines of Code in a Single Repository

Google's single repository, which contains all of its code, has about 1 billion files, 2 billion lines of code, and has 35 million historic commits.
The number of changes committed to the repository has exhibited exponential growth since 2004, when Gmail was launched. Most of the growth of total commits per week has been driven by automated commits and changes via programs like Rosie.
Just like Instagram, Google does not make significant use of branching for development.

Kubernetes: The Documentary

For a long time, Amazon was dominant in the cloud services department with their service AWS.
Docker used containers to allow scalability to be more accessible to hobbyist computer scientists, including coders with both development and operations background.
Google made the decision to make Kubernetes, which can be likened to a post-office, open source.
The release of Kubernetes was announced at Dockercon on June 10, 2014. Many other container orchestration projects were also released on the same day.
Initially, other companies like Facebook and Netflix chose Docker Swarm and Mesos over Kubernetes. As time went on, larger products like PokemonGo used Kubernetes.

The Evolution of Reddit.com's Architecture

Reddit is the 4th largest website in the U.S. serving 320 million users every month.
Like Instagram uses caches to bypass the need for expensive queries to count likes, Reddit uses caches to bypass the need for expensive queries to count votes.
Like we do, Reddit used a bunch of timers and print statements to understand what was causing large delays in vote processing.
Throughout its existence, Reddit has dealt with a lot of bugs and setbacks in regards to its data maintenance.

May 03 '24 05:05 henrylong612

Scaling Instagram Infrstructure

Instagram had 130 million new users sign up in the first half of 2017.
The company created a denormalized table for its like, which sped up index lookups for like counts by 10x.
Instagram ships code 40-60 times a day!

The Evolution of Reddit.com's Architecture

In 2018, Reddit had 70 million searches each day.
In mid 2012, vote queues stacked for hours and hardware upgrades and scaling did not fix the issue. Instead, it was lock competition/contention that caused this; to fix it, they fully split processing after looking at 99th percentile outliers.
After "fastlaning" a thread in 2016, the queue filled up with messages and filled the message broker's memory. This issue was caused by a missing parent. Now they use queue quotas to prevent resource hogging.

Postgres at Pandora

Pandora switched to Postgres due to significant advantages open source coding provided and the heavy-handedness of Oracle.
The company employed people who would analyze a track and generate meta data for it. Crazy how far we've come.
For Pandora, disadvantage of using clustering meant that joins, subselects, and more wouldn't be possible, such that they had a "flat-file database".

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

Adjust tracked over 2 billion data points in 2017
They do this with only standard Postgres and some extensions.
With data on this scale, it is important to hand-hold new rollouts and features. Small errors and gains will be much larger.
The company is confident in scaling Postgres indefinitely.

Breaking Postgres at Scale (PostgreSQL Experts, Inc.)

It's hard to do much wrong with a ten gigabyte database, even with sequential scans.
When you're above 1TB, many more discontinuities happen and memory must be used sparingly depending on how much you can afford hardware-wise.
Additionally, consider "moving to fast local storage from slower SAN-based" options

Mastering Chaos - a guide to microservices at netflix

A stateless service is different from a cache or database and allows for frequent access of metadata and allows for compute efficiency (thanks to usage of on demand capacity). When there is a ddos attack, you can absorb the attack while figuring out defenses.
ChaosMonkey has really helped Netflix to absorb node losses, especially with the use of stateless services. "Redundancy is fundamental"
Avoid storing your business logic and state in one application. It is important to split up your data, especially since losing nodes is a much larger, more consequential event.

Why Google Stores Billions of Lines of Code in a Single Repository

Storing Google's code in a single repository removes the difficulty of implementing changes in one codebase to another, but increases the difficulty of merging code.
Tens of thousands of commits occur each day, and the rate of change for the number of commits is growing to today. Twice the number of commits are submitted by automated systems than humans.
Google uses trunk-based development, the second pillar of a monolithic development model. This means developers are using the most recent code and avoid using development branches. This is so cool to me, as it means everyone is operating off the same codebase.

Lessons learned scaling our SaaS on Postgres to 8+ billion events (ConvertFlow)

"Premature optimization is the root of all evil". It is important to solve the right problem when resources are so thinly spread as a startup.
Initially used Ruby on Rails, Heroku, Postgres to set up to optimize, but didn't really need to devote time to this, especially since Postgres can scale easily.
As you scale as a company, be ready to change your stack and migrate to new vendors.

May 06 '24 00:05 adamzterenyi

Scaling Instagram Infrastructure

Instagram works to improve in 3 dimensions of scalability:
- Scaling out: ability to add more hardware
- Scaling up: more efficient code
- Scaling the dev team: ability to grow the team without breaking things
Python code can be sped up by "Cythonizing" it (C in python) for functionality that is stable and used extensively.
Instagram uses certain metrics in production to monitor how a new feature performs under load, such as the ratio of 500s HTTP status codes to 200s codes.

The Evolution of Reddit.com's Architecture

Software engineers at large companies such as Reddit still use basic print statements for debugging code.
Reddit scales its usage of AWS services in automatically in accordance with fluctuation in site traffic.
You should always be very cautious when terminating servers or migrating services.

May 06 '24 01:05 tylerheadley

(1)Data modeling, the secret sauce of building & managing a large scale data warehouse

Decision dashboards are widely used within engineering teams within Windows. There are 100s of dashboards that are refreshed every few hours.
The Windows database has a set of internal jobs that are building table for consumption. This is for efficient query support.
Device-centric aggregation, also called one device / one vote, is how Window's engineering team sees the noisy devices (i.e. a faulty device with too much noise which may dampen the signal about the actual set of quality issues).

(2)Lessons learned scaling our SaaS on Postgres to 8+ billion events

ConvertFlow is a no-code platform for e-commerce marketing teams to create personalized and launch visitor conversion campaigns on their websites without waiting on developers.
To optimize queries, in the case of this video, includes creating the primary indexes that are needed to quickly query reports for specific customers and their individual ConvertFlow campaigns.
To migrate a database off of Citus Cloud to Azure, one can stream cahnges to a replica database on Azure and then switching over to the Azure connection stream in a scheduled maintenance window. This included scaling with Citus on Azure.

(3)PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale

Fossasia Analytics Shards: at the time of the video they had 16 shards (which is approaching 60/64 terabytes).
There are 3 major infrastructure pieces this team was comfortable with and could scale: Postgres, Kafka, H a proxy.
Under a heavy load of a data model, it is very painful to change. This can possibly be helped using a few seconds of table walks in the whole process (such has Fossasia discussed in the video).

(4)Mastering Chaos - a guide to microservices at Netflix

A disaster scenario, when you have one service that fails -- with improper defenses against that one service failing, it can cascade and take town your entire service for your members (known as cascading failure)
An anti-pattern example can be where a subscriber service leans on EVC too much. That is, online & offline calls going to the same cluster / the same ev cache cluster so batch processes doing recommendations, looking up subscriber info, and the real-time call path. The anti pattern here is the fallback when the entire EVC layer goes down, was still a fallback to the service and the database.
The autonomic nervous system is functions our body just takes care of automatically; you can set up a similar environment to make best practices subconscious using a cycle of continuous learning and automation (as Netflix did).

(5)Why Google Stores Billions of Lines of Code in a Single Repository

The google repository is laid out in a tree structure. Every directory in the tree has a set of owners who can decide whether modifications to the files in their directory are allowed to be committed or not.
Google (in 2015) had the following source systems: Piper and CitC. Piper is custom system that hosts the monolithic repository. CitC is a file system that supports normal code browsing, UNIX tools, etc. without needing to explicitly clone or sync any state locally.
Google uses trunk-based development. A combination of trunk-based development with a centralized repository defines the monolithic model of source code management. So for Google, this type of development helps avoid the merges that often happen with long live branches.

(6)Postgres at Pandora

Pandora uses replication for a high availability solution. This means Pandora employs a method where data is duplicated across multiple servers or databases.
Clustr architecture (a non acid database) is a high availability system. This means it compromises consistency for availability.
PostgreSQL provides utilities like PG_dump and PG_restore for backing up and restoring databases. However, PG_dump and PG_restore are impractical to use, in most cases. That is, if you have any sort of large size issues.

(7)The Evolution of Reddit.com's Architecture

Listings is the foundation of Reddit. It's an ordered list of links.
Locks are bad for throughput and if you have to use them, you should partition on the right thing
Tree structure is sensitive to ordering. There was a statement on how inconsistencies trigger automatic recompute.

(8)Scaling Instagram Infrastructure

Scale up, in this video's context, means to use as few CPU instructions as possible to achieve something (writing good code to reduce the CPU demand on the infrastructure) and use as few servers as possible.
The memory layout is ran with n parallel processes where m is greater than the number of CPU cores on the system.
InfoQ ships the code 40-60 rollouts per day (they continuously roll out their master repo whenever a diff is checked in).

May 06 '24 09:05 sjanefullerton

(1) Scaling Instagram Infrastructure

One thing I learned in this video is the general web structure of Instagram’s backend; there were some familiar terms like PostgreSQL and Django but there were some unfamiliar ones like Cassandra, Celery, and RabbitMQ
Denormalized data, even though it seemed inconvenient in practice in our assignments, is still useful in some scenarios for Instagram
Scaling up to the level of such a big platform has a lot of different challenges; a lot of optimizations need to happen to reduce CPU load

(2) Why Google Stores Billions of Lines of Code in a Single Repository

Due to the sheer size of the repository, there are owners of designated areas of the repository who must approve changes
Google practices “trunk-based development,” where there is little use of branches and changes are made to the repository in a single-file order
Advantages to this is that there is no confusion about which repository people should be working on; “single source of truth”, also helps avoid the “diamond dependency problem” where a library can be difficult to build due to changes in dependencies

(3) Lessons learned scaling our SaaS on Postgres to 8+ billion events | Citus Con 2022

Difficulties with scalability happened around 100 million events; Firebase was used to alleviate traffic with faster querying
Citus (Postgres extension) enabled significant scaling of Postgres database, allowed tables to be “sharded” to lower expenses and speed up queries; also divided server into a cluster of nodes
Data rollups can save a lot on querying operations, helping them scale up 10x to billions of events, also started clearing up very old irrelevant data to save up on costs

(4) Data modeling, the secret sauce of building & managing a large scale data warehouse | Citus Con 2022

Windows works off “measures”, which is time series data about a specific use case executed on Windows devices
Some tables are high in dimension, up to 50+ indexes, they also collect data from up to 1.5 billion devices
Also uses Citus for similar reasons (scalability and works with Postgres well)

(5) Mastering Chaos - A Netflix Guide to Microservices

Uses Cassandra (again!) because it has a lot of flexibility and it worked for how Netflix wanted to handle network partitions and will provide eventual consistency
Scalability issues included excessive overload, saw up to 800k to a million requests per second, implemented workload partitioning and workload caching as a solution
Following a huge crash of their US-East AWS server, they implemented a split of their load/traffic into multiple regional servers

(6) The Evolution of Reddit.com's Architecture

Main block of Python code (monolith) is known as r2 and they also employ Cassandra
Voting system was handled with caches and queues; particularly popular posts get their own queue to manage site load/traffic
They work with “Thing” types (lol) that is represented by a pair of tables in psql

(7) PostgreSQL at Pandora

Pandora chose Postgres for its scalability and economic efficiency
Experienced scaling issues with WAL files and streaming replication
Cons: autovacuum ran for more than a week per table; pg_dump and pg_restore are impractical

(8) PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale- Chris Travers - FOSSASIA 2018

Mapreduces are useful to speed up processes
Autovacuum does not keep up well with tables that are largely scaled up and requires fine tuning/”hand holding”
PostgreSQL is able to support the 400TB environment they’ve built, although there were challenges in data maintenance and a lot of map reducing involved to optimize processes

May 07 '24 03:05 amyyu116

Postgres at Pandora

Pandora monitors the frequency of different activities that are damaging to the company’s service, like long duration transactions, PostgreSQL and syslog errors, and blocks The company also captures historical data, both short and long term, to know what metrics are normal for every postgres instance Pandora made their own internal application, Clustr, to maintain an architecture of data updates that affect several databases (interestingly, the Clustr system is not ACID compliant)

Mastering Chaos - a guide to microservices at netflix

CAP Theorem says that a service with different databases contained in different networks must either place preference on either consistency or availability. If you cannot perform an update on one database, either stop the update process and return an error, or update the ones that are available. Netflix chose availability and relied on “eventual consistency” Netflix purposefully induces “chaos” into their stateless services, in which nodes are deleted and the system is checked to still be running as normal. Netflix uses a global cloud management/delivery service called Spinnaker, which allows them to integrate automated components into their deployment path easily. As the company learns more about how to improve their own service, it can add new automated systems to their deployment.

May 07 '24 05:05 westondcrewe

Mastering Chaos - a guide to microservices at netflix

Cap Theorem is choosing between consistency or availability in a network partition which helps system designers understand the trade-offs involved in designing a robust system.
Cassandra is a great service to embrace eventual consistency where they don't expect every single write to be read back immediately from any one of the sources the data was written to. Thus, a client might write to one node which writes to other nodes.
Conway's Law is where any piece of software reflects the organizational structure that produced it

Why Google Stores Billions of Lines of Code in a Single Repository

Although a single repository for this size of code base (1 billion files, 2 billion lines of code) might be expensive and hard to keep the code base healthy, it is worth the benefits of not being confused about the version of things and not having outdated dependencies within the files
Although there were 15K commits by humans, there were more commits by automated systems where now only a third of the commits are made by humans
Branching is rare within version control in Google which is similar to Instagram and their development.

Scaling Instagram Infrastructure

There are three always to scaling within Instagram: scaling out which grows the server or hardware for user growth, scaling up which makes the servers more efficient, and scaling the development team to make sure there are new features while maintaining the stability of Instagram.
Instagram uses PostgreSQL for its primary data storage, with data replicated across "read" replicas to enhance availability. To manage memcache across regions and ensure data consistency, Instagram employs cache invalidation techniques following PostgreSQL replication.
Memcache plays a critical role in scaling by acting as a high-performance, in-memory key-value store, handling millions of reads and writes per second. This capability is crucial to prevent database failures due to overload.

The Evolution of Reddit.com's Architecture

Reddit began with a monolithic architecture using a large Python application known as 'r2'. To improve scalability and manageability, Reddit has been transitioning towards a microservices architecture.
Reddit incorporates timers in their code to diagnose delays in request processing. They specifically look at p99 metrics (99th percentile), which help them identify and trace the root causes of unusual delays or issues within their system.
To efficiently manage high volumes of data interactions, Reddit uses complex caching strategies and asynchronous job queues.

Citus: Postgres at any Scale

Citus is an open-source extension for PostgreSQL that transforms it into a distributed database system. This allows PostgreSQL to scale horizontally by distributing data and queries across multiple servers, thus enhancing performance and handling larger datasets efficiently.
Citus enhances the COPY command performance by executing these commands in parallel across worker nodes. This parallelism significantly speeds up data movement operations, making it highly efficient for large-scale data processes.
Citus supports distributed transactions, which ensures that all database transactions either completely succeed or fail across all involved nodes.

Lesson learned scaling our SaaS on Postgres to 8+ billion events

The transition of their PostgreSQL database from a single node to a clustered node setup, which includes a coordinator node and multiple worker nodes, was instrumental in scaling the project to handle 1 billion events.
The initial choice of PostgreSQL, paired with Ruby on Rails and hosted on Heroku, supported the platform's early growth stages without major issues up to a million events.
To further improve scalability and manage larger datasets, the team adopted Citus, a PostgreSQL extension that allows for efficient database sharding.

**Postgres at Pandora **

Pandora originally utilized Oracle but switched to PostgreSQL due to its open-source nature and cost-effectiveness.
Pandora has significantly ramped up its monitoring of database activities. This includes tracking errors, long duration queries and transactions, and blocked processes.
To address specific needs within their infrastructure, Pandora developed an in-house distributed database called Clustr. This system is not ACID compliant, prioritizing scalability and performance over strict consistency, tailored for handling shared databases effectively.

** Data modeling, the secret sauce of building & managing a large scale data warehouse**

Citus, as an extension of PostgreSQL, is crucial in managing large scale data such as Windows who gets inputs from around 1.5 billion devices.
Citus enables Windows to provide data-rich decision dashboards that refresh every few hours
Windows employs PostgreSQL's rich data types, such as JSON for staging tables and hstore for dynamic columns in reporting tables. These data types facilitate flexible data structures and efficient querying, which are essential for managing the diverse data collected from billions of devices.

May 08 '24 11:05 pangsark

Data modeling, the secret sauce of building & managing a large scale data warehouse

Citus is a SQL execution engine that works well with real-time queries over large datasets
Citus has great scalability as it scales PostgreSQL across multiple machines, (1920 cores, 15TB DRAM, 960TB Premium SSD)
Citus is also well design to handle tables high in dimensions, with up to 50 or more indexes and also collects data from up to 1.5 billion devices

Mastering Chaos - a guide to microservices at Netflix

A microservice architecture is a modern approach to software development, where a single application is built as a collection of small services
Companies like Netflix employ fault injection testing to introduce faults to their system, in order to ensure resilience and that their fallback mechanisms function correctly
The company emphasizes the importance of identifying and defining critical microservices for basic user functionality, focusing testing and reliability strategies on these services to maintain operational core functions even during service failures.

Why Google Stores Billions of Lines of Code in a Single Repository

Google's monolithic repository contains over a billion files, with over 2 billion lines of code, with a data size of roughly 86 terabytes
There are about 45 thousand commits per workday, with roughly 2/3 of them being automated and the other 1/3 is by human developers
The rate of change to the repository has been increasing exponentially, which makes the graph of the rate of change barely shows data from 2004 on the graph

Large Databases, Lots of Servers, on Premises, in the Cloud - Get Them All!

Uses RAID 10 for disks over RAID 5,6 as they provide the best storage, explains how data is expensive
Combining cloud services with on-premise servers gives better flexibility and scalability when managing large databases and servers
Backup strategies utilize tools like Barman for physical backups and PG Dump for logical backups

May 08 '24 20:05 JTan242

cmc-csci143 cmc-csci143 copied to clipboard

State of the Industry Assignment + Extra Credit

"Mastering Chaos - A Netflix Guide to Microservices" talk

Scaling Instagram's Infrastructure

Overview of Reddit's Backend Service

Why Google Stores Billions of Lines of Code in a Single Repository

PostgreSQL at 20TB and Beyond: Analytics at a Massive Scale (AdTech use of PostgreSQL)

Breaking PostgreSQL at Scale

Lessons learned scaling our SaaS on PostgreSQL to 8+ billion events (ConvertFlow, another adtech company)

Postgres at Pandora

cmc-csci143
cmc-csci143 copied to clipboard