Data models + Query Languages

Data models are extremely important. They influence

How software is written
How we think about hte problem

For each layer, key question is how is it represented in terms of the next-lower layer?

App developer models real world with data structures + APIs
Data structures expressed in general purpose data model like jSON
Engineers who built your database software decided how to represent jSON in terms of bytes
Hardware engineer figured out how to represent bytes in terms of electrical currents ,etc. May be many more intermediary layers (API built on APIs)

Clean data models allow for abstraction and embody assumptions about its usage

Relational vs Document Model

Relational: SQL

Data organized into relations
Relation: unordered collection of tuples

Birth of NoSQL

latest attempt to overthrow relational model dominance Why NoSQL?
Need for greater scalability
OSS
Specialized query operations
More expressive + dynamic model

Object-Relational mismatch

Impedance mistmatch: most development today done in object-oriented programming
Data in SQL (relational) necessitates awkward translation layer

Object relational mapping (ORM) frameworks like Hibernate

Try to reduce boilerplate for translation layer

Example of resume

Traditional SQL model: put positions, education, contact info in separate tables
- Foreign key reference to users table
Later SQL: support for structured datatype- ulti-valued data in a single row
3rd option: Encode jobs, education, contact info as JSON document, store it on text column in database Or, use JSON- quite appropriate

JSON has better locality than the multi-table schema- no need to make multiple queries

Many-to-one and many-to-many relationships

Why use ids instead of plain text strings? (ie. region_id instead of “Greater Seattle Area”)

Consistent style / spelling
Avoid ambiguity
Ease of updating
Localizaton support- dif. languages
Easy search

Key idea: since ID has no meaning to humans, never needs to change

Anything meaningful to humans may need to change in the fturue
If information is duplicated, also need to change all redundant copies
- Removing this duplication is the key idea behind normalization in databases

Normalizing this data requires many-to-one relationaships, which don’t fit nicely into document model

Would have to emulate a join by making multiple queries

Are document databases repeating history? Back in the day, IMS was a popular database- use da hierarchical data model (tree of records within records)

Made many-to-many relationships difficult
No support for joins

2 prominent solutions were proposed to solve. limitatin sof hierarchical model

Relational model: became SQL
Network model: eventually faded into obscurity

Network model -CODASYL model

Generalization of hierarchical model but where a record could have multiple parents
Allows to model many-to-one and many-to-many relationships
Issue: access paths got complicated

Relational model

Lay data in the open
- No nested structures or access paths

Relational vs document dtabases today

Lots of differences to consider- here we’ll only discuss the differences in the data model

Which leads to simpler application code? If data is doument-like: probably use document model

Could do shredding for relational model, but leads to cmbersome schemas- unnecessarily complicated
Poor support for joins may be a problem depending on the application If application uses many-to-many relationships, document model is less appealing

Schema flexibility in docuemnt model Document databses don’t enforce any schema

Schema-on-read: sructure of data is implicit
- Advantageous if items don’t all have same structure
Similar to dynamic type checking

Schema changes

All those AlLTER and UPDATE commands
Bad reputation for being inefficient, but actually quite fast

Data Locality Document stores as single contiguous string usually Storage locality: all stored near itself

Only advantage if need large parts of document at same time- otherwise, loading for no reason

Convergence of document / relational datbbases Most relational database systems hae also supported XML Some document databases support relaton-like joins

Query Languages for Data

Relation model introduuced new way of querying data

Declarative query language
- Just specify the pattern of the data you want (conditions, data transformations)
- Makes it easier to improve the underlying engine while abstracting away updates
In contrast to previous imperative querying code
- Tells compute to perform operations in a certain order

Declarative Queries on the web Declarative query language also has pros when it comes to web browsers

CSS for example li.selected > p e.g. declares pattern of elements

MapReduce Querying mapReduce- programing model for processing large amount of data in bulk Not declarative nor fully imperative query API Based on map and reduce functions while specific filters that can be specified declaratively Map and reduce must be pure- no side effects

Graph-Like Data Models

Many to many relationships may be common in your data

Starts to bcome more ntural to model your data as a graph
e.g.
- social graph
- web graph
- road / rail network

Some different ways to structure and query dat in graphs Discussed here

Models
- Property graph
- Triple-store
Query languages
- Cypher
- SPARQL
- Datalog
Imperative graph query langauges (e.g. Gremline)
Graph processing frameworks (e.g. Pregel)

Property Graph

Each vertext

Unique ID
Set of outgoing edges
Set of incoming edges
Collection of properties (key-value pairs) Each edge
Unique ID
Vertex where edge starts
Vetex where edge ends
Label (kind of relationship)
Collection of properties

Intuition

2 relational tables- vertices + edges
Query edges table by head_vertex or tail_vertex

Important to note

Any vertex can connect with any vertex
Given any vertex, you can traverse both forward and backewards
Can store multiple kinds of info in a single graph while having a clean data model (different relationships, differnt labels)

Cypher Query Language

Declarative query laguage for property graphs (created for the Neo4j graph database) Each vertex given a symbolic name and other parts of query can use the names to create edges between the vertices using arrow notation

CREATE
(NAmerica:Location {name:'North America', type:'continent'}),
(USA:Location      {name:'United States', type:'country'  }),
(Idaho:Location    {name:'Idaho',         type:'state'    }),
(Lucy:Person       {name:'Lucy' }),
(Idaho) -[:WITHIN]->  (USA)  -[:WITHIN]-> (NAmerica),
(Lucy)  -[:BORN_IN]-> (Idaho)

When all vertices and edges are added to databse, can ask interesting questions

e.g. “find the names of all the people who emigrated from the US to Europe”- quivalent to returning name property for vertices with:
- BORN_IN edge to location in hte US
- LIVING_IN edge to location in Europe
- Query in Cypher language

MATCH
(person) -[:BORN_IN]->  () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
(person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'}) RETURN person.name

cos in https://substack.com/browse/recommendations/post/140918169

Pablo's Reference Notes

Explorer

Data models + Query Languages

Data models + Query Languages

Relational vs Document Model

Query Languages for Data

Graph-Like Data Models

Property Graph

Cypher Query Language

Graph View

Table of Contents

Backlinks