til icon indicating copy to clipboard operation
til copied to clipboard

Database

Open anitsh opened this issue 4 years ago • 1 comments

Database

image

The two main types of modern databases to choose from are relational and non-relational, also known as SQL or NoSQL (for their query languages). There are a few main differences to be familiar with when deciding which database works best for your needs. SQL databases are known as relational databases, and have a table-based data structure, with a strict, predefined schema required. NoSQL databases, or non-relational databases, can be document based, graph databases, key-value pairs, or wide-column stores. NoSQL databases don’t require any predefined schema, allowing you to work more freely with “unstructured data.” Relational databases are vertically scalable, but usually more expensive, whereas the horizontal scaling nature of NoSQL databases is more cost-efficient.

Resource:

  • https://trello.com/c/BAKP4mf4/805-%F0%9F%93%9D%F0%9F%93%9Amust-know-database-dev
  • https://en.wikipedia.org/wiki/Database_normalization
  • [ ] https://robertovitillo.com/what-every-developer-should-know-about-database-consistency
  • [ ] https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-general-nosql-design.html
  • [ ] https://elephantdolphin.blogspot.com/2021/01/confuse-lot-of-those-new-to-structured.html
  • [ ] https://eng.uber.com/schemaless-sql-database
  • [ ] https://vladmihalcea.com/clustered-index

Databases

  • #119 MySQL
  • #263 Postgres
  • #391 MongoDB
  • https://gitlab.com/bramw/baserow

anitsh avatar Jan 24 '21 13:01 anitsh

Database Model

  • [ ] https://en.wikipedia.org/wiki/Data_model

anitsh avatar Apr 30 '21 16:04 anitsh

Relational Database

anitsh avatar Apr 30 '21 18:04 anitsh

Not Only SQL (NoSQL)

The RDBMS has been the de-facto standard for managing data since it first appeared from IBM in the mid-1980s. The RDBMS really exploded in the 1990s with Oracle, Sybase, IBM DB2, Microsoft SQL Server, and other similar databases appearing in the data centers of nearly every enterprise - databases you likely use today. With the first wave of Web applications, open source relational database management systems (RDBMS) such as MySQL and Postgres emerged and became a standard at many companies that desired alternatives to expensive proprietary databases sold by vendors such as Oracle. However, it wasn’t long before things began to change, and the application and data center requirements of key Internet players like Amazon, Facebook, and Google began to outgrow the RDBMS for certain types of applications. The need for more flexible data models that supported agile development methodologies and the requirements to consume large amounts of fast-incoming data from millions of cloud applications users around the globe - while maintaining extreme amounts of performance and uptime - necessitated the introduction of a new data management platform.

Enter NoSQL. Today, with every company utilizing modern cloud applications, the data problems originally encountered by the Internet giants have become common issues for every company, including yours. This means that you and your peers in DevOps and Operations must realize that it is no longer a question of if you will be deploying and managing NoSQL database systems, but when, and how much of your company’s data will eventually be stored on NoSQL platforms.

Types of NoSQL Databases

There are different types of NoSQL databases, with the primary difference characterized by their underlying data model and method for storing data. The main categories of NoSQL databases are:

  • Tabular​ - Also known as wide-column or wide-row stores, these databases store data in rows and users are able to perform some query operations via column-based access. A wide-row store offers very high performance and a highly scalable architecture. Examples include: DSE, HBase, and Google BigTable.
  • Key/Value​ - These NoSQL databases are some of the least complex as all of the data within consists of an indexed key and a value. Examples include Amazon DynamoDB, Riak, and Oracle NoSQL database. Some tabular NoSQL databases, like DSE, can also service key/value needs.
  • Document​ - Expands on the basic idea of key-value stores where “documents” are more complex, in that they contain data and each document is assigned a unique key, which is used to retrieve the document. These are designed for storing, retrieving, and managing document-oriented information, oftentimes stored as JSON. Examples include MongoDB and CouchDB. Note that some RDBMS and NoSQL databases outside of pure document stores are able to store and query JSON documents.
  • Graph​ - Designed for highly complex and connected data, which outpaces the relationship and JOIN capabilities of an RDBMS. Graph databases are often exceptionally good at finding commonalities and anomalies among large datasets. Examples include: DataStax Enterprise Graph and Neo4J.

Data models for NoSQL databases

Model Characteristics
Document Store Data and metadata are stored hierarchically in JSON-based documents inside the database.
Key Value Store The simplest of the NoSQL databases, data is represented as a collection of key-value pairs.
Wide-Column Store Related data is stored as a set of nested-key/value pairs within a single column.
Graph Store Data is stored in a graph structure as node, edge, and data properties.

Resource

  • [ ] https://stackoverflow.blog/2021/01/14/have-the-tables-turned-on-nosql
  • [ ] https://news.ycombinator.com/item?id=25798925
  • [ ] https://stackoverflow.com/questions/2608103/is-there-any-nosql-data-store-that-is-acid-compliant
  • [ ] https://dba.stackexchange.com/questions/286078/what-exactly-is-an-unstructured-data-and-why-to-use-non-relational-dbms-for-that

anitsh avatar Apr 30 '21 18:04 anitsh

Deciding between RDBMS and NoSQL

How do you decide when to use an RDBMS and when to use a specific type of NoSQL database? In short, an RDBMS is great for centralized applications that need ACID transactions and whose data fits well within the relational data model. The following chart provides a general comparison between the characteristics that point towards an RDBMS vs. those that signal a NoSQL database may be a better choice:

RDBMS NOSQL
Master-slave architecture Masterless architecture
Moderate velocity data High velocity data (devices, sensors, etc.)
Data coming in from one/few locations Data coming in from many locations
Primarily structured data Structured, with semi-structured
Always strongly consistent Tunable consistency (eventual to strong)
Complex/nested transactions Simple transactions
Protect uptime via failover/log shipping Protect uptime via architecture
High availability Continuous availability
Deploy app central location / one server Deploy app anywhere / many servers
Primarily write data in one location Write data everywhere / anywhere
Primary concern: scale reads Scale writes and reads
Scale up fro more users / data Scale out for more users / data
Maintain data volumes with purge High data volumes; retain as needed
Transaction workloads Mixed workloads of transactions and analytics
One datacenter or cloud region Multi-data center, multi-cloud region, or hybrid cloud
Relational data Multi-model (tabular, key/value, document, graph)

Looking at the data model requirements is another tactic to use when evaluating an RDBMS vs. NoSQL. Certain NoSQL databases require the denormalization of data and aren’t concerned with the relationships between data entities whereas others are built to handle complex and very intense data relationship scenarios: image

RDBMS and Graph (NoSQL) databases are at the high end of the data model continuum where the relationships between data are concerned and are somewhat similar in their base characteristics:

Characteristics RDBMS GRAPH DB
An identifiable “something” to keep track of Entity Vertex
A connection or reference between two objects Relationship Edge
A characteristic of an object Attribute Property

One of the key differences between a graph database (NoSQL) and an RDBMS is how relationships between entities/vertexes are prioritized and managed. While an RDBMS uses mechanisms like foreign keys to connect entities in a secondary fashion, edges in a graph database are of first order importance. As such, a graph database is more scalable and performant than an RDBMS when it comes to complex data that is highly connected (e.g. millions or billions of relationships).

Unlike most other ways of displaying data, graphs are foundationally designed to express relatedness. Graph databases can uncover patterns that are difficult to detect when using traditional representations, such as RDBMS tables.

Suggestions for when to use a graph database vs. an RDBMS is the following:

RDBMS GRAPH
Simple to moderate data complexity Heavy data complexity
Hundreds of potential relationships Hundreds of thousands to millions or billions of potential relationships
Moderate JOIN operations with good performance Heavy to extreme JOIN operations required
Infrequent to no data model changes Constantly changing and evolving data model
Static to semi-static data changes Dynamic and constantly changing data
Primarily structured data Structured and unstructured data
Nested or complex transactions Simple transactions
Always strongly consistent Tunable consistency (eventual to strong)
Moderate incoming data velocity High incoming data velocity (e.g. sensors)
High availability (handled with fail-over) Continuous availability (no downtime)
Centralized application that is location dependent (e.g. single location), especially for write operations and not just read Distributed application that is location independent (multiple locations involving multiple data centers and/or clouds) for write and read operations

NoSQL and Hadoop: A Comparison

You’ve no doubt heard about Hadoop and perhaps your company is already using it to handle various new data warehousing projects. Perhaps you’re wondering how Hadoop differs from NoSQL.

Apache HadoopTM is an open source software project that enables the distributed processing of large data sets, and uses a scale-out architecture that stores and processes data across many machines. Hadoop is an ecosystem umbrella term that encompasses many different software components.

In general, Hadoop is not a database, but is instead a framework primarily devoted to handling modern batch analytics “data lake” use cases. Hadoop does offer a NoSQL database as part of its framework (Hbase), but it is used mostly for analytics situations.

By contrast, a NoSQL database is used operational and transactional information and processes. Hadoop, as well as Spark are complementary to DSE. DSE is a mixed workload data platform that can handle both operational / transactional processes used for cloud applications as well as data analysis via its tight integration with both Hadoop and Spark.

Looking at the data model requirements is another tactic to use when evaluating an RDBMS vs. NoSQL. Certain NoSQL databases require the denormalization of data and aren’t concerned with the relationships between data entities whereas others are built to handle complex and very intense data relationship scenarios: image

CAP Theorem

image The theorem states that distributed data systems will offer a trade-off between consistency, availability, and partition tolerance. And, that any database can only guarantee two of the three properties:

  • Consistency. Every node in the cluster responds with the most recent data, even if the system must block the request until all replicas update. If you query a "consistent system" for an item that is currently updating, you'll wait for that response until all replicas successfully update. However, you'll receive the most current data.
  • Availability. Every node returns an immediate response, even if that response isn't the most recent data. If you query an "available system" for an item that is updating, you'll get the best possible answer the service can provide at that moment.
  • Partition Tolerance. Guarantees the system continues to operate even if a replicated data node fails or loses connectivity with other replicated data nodes.

Resource

  • [ ] https://docs.microsoft.com/en-us/dotnet/architecture/cloud-native/relational-vs-nosql-data
  • [ ] https://www.imaginarycloud.com/blog/mongodb-vs-mysql
  • [ ] https://memgraph.com/blog/graph-database-vs-relational-database

anitsh avatar Apr 30 '21 18:04 anitsh

Document-Oriented Database

The central concept of a document-oriented database is the notion of a document. While each document-oriented database implementation differs on the details of this definition, in general, they all assume documents encapsulate and encode data (or information) in some standard format or encoding. Encodings in use include XML, YAML, JSON, as well as binary forms like BSON (Binary JSON).

Documents in a document store are roughly equivalent to the programming concept of an object. They are not required to adhere to a standard schema, nor will they have all the same sections, slots, parts or keys. Generally, programs using objects have many different types of objects, and those objects often have many optional fields. Every object, even those of the same class, can look very different. Document stores are similar in that they allow different types of documents in a single store, allow the fields within them to be optional, and often allow them to be encoded using different encoding systems.

A document database is a type of non-relational database that is designed to store and query data as JSON-like documents. Document databases make it easier for developers to store and query data in a database by using the same document-model format they use in their application code. The flexible, semi-structured, and hierarchical nature of documents and document databases allows them to evolve with applications’ needs. The document model works well with use cases such as catalogs, user profiles, and content management systems where each document is unique and evolves over time. Document databases enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents.

Document-oriented databases are one of the main categories of NoSQL databases, and the popularity of the term "document-oriented database" has grown with the use of the term NoSQL itself. XML databases are a subclass of document-oriented databases that are optimized to work with XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal.

Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store, the data is considered to be inherently opaque to the database (only key-value relationship is known, and the value can be anything so the content in the value cannot be determined), whereas a document-oriented system relies on internal structure in the document in order to extract metadata that the database engine uses for further optimization. Although the difference is often negligible due to tools in the systems, conceptually the document-store is designed to offer a richer experience with modern programming techniques.

Document databases contrast strongly with the traditional relational databases. Relational databases generally store data in separate tables that are defined by the programmer, and a single object may be spread across several tables. Document databases store all information for a given object in a single instance in the database, and every stored object can be different from every other. This eliminates the need for object-relational mapping while loading data into the database.

In a relational database, data is first categorized into a number of predefined types, and tables are created to hold individual entries, or records, of each type. The tables define the data within each record's fields, meaning that every record in the table has the same overall form. The administrator also defines the relationships between the tables, and selects certain fields that they believe will be most commonly used for searching and defines indexes on them. A key concept in the relational design is that any data that may be repeated is normally placed in its own table, and if these instances are related to each other, a column is selected to group them together, the foreign key. This design is known as database normalization.

For example, an address book application will generally need to store the contact name, an optional image, one or more phone numbers, one or more mailing addresses, and one or more email addresses. In a canonical relational database, tables would be created for each of these rows with predefined fields for each bit of data: the CONTACT table might include FIRST_NAME, LAST_NAME and IMAGE columns, while the PHONE_NUMBER table might include COUNTRY_CODE, AREA_CODE, PHONE_NUMBER and TYPE (home, work, etc.). The PHONE_NUMBER table also contains a foreign key column, "CONTACT_ID", which holds the unique ID number assigned to the contact when it was created. In order to recreate the original contact, the database engine uses the foreign keys to look for the related items across the group of tables and reconstruct the original data.

In contrast, in a document-oriented database there may be no internal structure that maps directly onto the concept of a table, and the fields and relationships generally don't exist as predefined concepts. Instead, all of the data for an object is placed in a single document, and stored in the database as a single entry. In the address book example, the document would contain the contact's name, image, and any contact info, all in a single record. That entry is accessed through its key, which allows the database to retrieve and return the document to the application. No additional work is needed to retrieve the related data; all of this is returned in a single object.

A key difference between the document-oriented and relational models is that the data formats are not predefined in the document case. In most cases, any sort of document can be stored in any database, and those documents can change in type and form at any time. If one wishes to add a COUNTRY_FLAG to a CONTACT, this field can be added to new documents as they are inserted, this will have no effect on the database or the existing documents already stored. To aid retrieval of information from the database, document-oriented systems generally allow the administrator to provide hints to the database to look for certain types of information. These work in a similar fashion to indexes in the relational case. Most also offer the ability to add additional metadata outside of the content of the document itself, for instance, tagging entries as being part of an address book, which allows the programmer to retrieve related types of information, like "all the address book entries". This provides functionality similar to a table, but separates the concept (categories of data) from its physical implementation (tables).

In the classic normalized relational model, objects in the database are represented as separate rows of data with no inherent structure beyond that given to them as they are retrieved. This leads to problems when trying to translate programming objects to and from their associated database rows, a problem known as object-relational impedance mismatch. Document stores more closely, or in some cases directly, map programming objects into the store. These are often marketed using the term NoSQL.

Resource

  • [ ] https://en.wikipedia.org/wiki/Document-oriented_database

anitsh avatar Apr 30 '21 18:04 anitsh

Object Storage

Object storage, also known as object-based storage, is a strategy that manages and manipulates data storage as distinct units, called objects. These objects are kept in a single storehouse and are not ingrained in files inside other folders. Instead, object storage combines the pieces of data that make up a file, adds all its relevant metadata to that file, and attaches a custom identifier.

Object storage adds comprehensive metadata to the file, eliminating the tiered file structure used in file storage, and places everything into a flat address space, called a storage pool. This metadata is key to the success of object storage in that it provides deep analysis of the use and function of data in the storage pool.

Resource

  • [ ] https://www.netapp.com/data-storage/storagegrid/what-is-object-storage

anitsh avatar Apr 30 '21 18:04 anitsh

The ACID Consistency Model

Many developers are familiar with ACID transactions from working with relational databases. As such, the ACID consistency model has been the norm for some time. The key ACID guarantee is that it provides a safe environment in which to operate on your data. The ACID acronym stands for:

  • Atomic: All operations in a transaction succeed or every operation is rolled back.
  • Consistent: On the completion of a transaction, the database is structurally sound.
  • Isolated: Transactions do not contend with one another. Contentious access to data is moderated by the database so that transactions appear to run sequentially.
  • Durable: The results of applying a transaction are permanent, even in the presence of failures.

ACID properties mean that once a transaction is complete, its data is consistent (tech lingo: write consistency) and stable on disk, which may involve multiple distinct memory locations. Write consistency is a wonderful thing for application developers, but it also requires sophisticated locking which is typically a heavyweight pattern for most use cases. When it comes to NoSQL technologies, most graph databases(including Neo4j) use an ACID consistency model to ensure data is safe and consistently stored.

The BASE Consistency Model

For many domains and use cases, ACID transactions are far more pessimistic (i.e., they’re more worried about data safety) than the domain actually requires.

In the NoSQL database world, ACID transactions are less fashionable as some databases have loosened the requirements for immediate consistency, data freshness and accuracy in order to gain other benefits, like scale and resilience.

(Notably, the .NET-based RavenDB has bucked the trend among aggregate stores in supporting ACID transactions.)

Here’s how the BASE acronym breaks down:

Basic Availability: The database appears to work most of the time.

Soft-state: Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time.

Eventual consistency: Stores exhibit consistency at some later point (e.g., lazily at read time).

BASE properties are much looser than ACID guarantees, but there isn’t a direct one-for-one mapping between the two consistency models (a point that probably can’t be overstated).

A BASE data store values availability (since that’s important for scale), but it doesn’t offer guaranteed consistency of replicated data at write time. Overall, the BASE consistency model provides a less strict assurance than ACID: data will be consistent in the future, either at read time (e.g., Riak) or it will always be consistent, but only for certain processed past snapshots (e.g., Datomic).

The BASE consistency model is primarily used by aggregate stores, including column family, key-value and document stores.

Navigating ACID vs. BASE Trade-offs

There’s no right answer to whether your application needs an ACID versus BASE consistency model. Developers and data architects should select their data consistency trade-offs on a case-by-case basis – not based just on what’s trending or what model was used previously.

Given BASE’s loose consistency, developers need to be more knowledgeable and rigorous about consistent data if they choose a BASE store for their application. It’s essential to be familiar with the BASE behavior of your chosen aggregate store and work within those constraints.

On the other hand, planning around BASE limitations can sometimes be a major disadvantage when compared to the simplicity of ACID transactions. A fully ACID database is the perfect fit for use cases where data reliability and consistency are essential (banking, anyone?).

anitsh avatar May 02 '21 08:05 anitsh

image

anitsh avatar Jul 29 '22 10:07 anitsh