The quick concepts of databases (student)

Cours de Julie Chaumard

Transactional vs analytical

Databases are generally classified into two main categories: transactional and analytical.

📌

Data from transactional (operational) databases (OLTP) are copied and transformed into an analytical database (OLAP) to enable analysis and decision-making without impacting the performance of transactional systems.

Analytical Database: A type of database that stores static data and is used when there is a need to track trends, view statistical data over a long period of time, or make tactical or strategic business projections; it is typically associated with OLAP.

  1. Operational databases (OLTP - Online Transaction Processing)

    These databases are designed to manage real-time transactions and day-to-day operations of a business. They are used in systems requiring frequent reads and writes, such as:

    1. Order management systems
    1. Customer databases (CRM)
    1. Banking systems
    1. E-commerce applications

    They are optimized for fast transaction execution and guarantee data integrity with mechanisms like ACID (Atomicity, Consistency, Isolation, Durability).

  1. Analytical databases (OLAP - Online Analytical Processing)

    These databases are designed for analyzing and exploiting data for decision-making purposes. They are used in systems requiring complex queries on large volumes of data, such as:

    1. Business Intelligence (BI) systems
    1. Data Warehouses
    1. Trend analysis and forecasting
    1. Reporting and dashboards

    They are optimized for reading and analyzing data rather than for frequent updates. They often use techniques such as star modeling and tools for massive data processing.

  1. Examples of tools used
    1. OLTP Databases: MySQL, PostgreSQL, MS SQL Server, Oracle Database
    1. OLAP Databases: Amazon Redshift, Google BigQuery, Snowflake, Apache Hive
  1. 🎶🎶 Example with Spotify: OLTP vs OLAP

    Spotify uses both transactional databases (OLTP) and analytical databases (OLAP) to manage its streaming service and analyze user behaviors.

    1️⃣ Transactional databases (OLTP) at Spotify 🎵

    Spotify's OLTP databases are responsible for managing real-time interactions of users.

    They record millions of instantaneous and frequent transactions such as:

    ✅ Real-time actions:

    1. Playing a song → Which user is listening to which song at what time?
    1. Adding to a playlist → Instant recording of a track addition.
    1. Saving an album → Storage of user preferences.
    1. Premium subscription → Management of payments and subscriptions.
    1. Login and authentication → Verification of user access.

    ➡ This data is stored in fast OLTP databases optimized for real-time transaction management.

    📌 Problem: These OLTP databases are not designed to perform complex analyses on large volumes of data (e.g., which songs are most popular in each country?).

    ➡ This is why Spotify must transfer this data to an OLAP database for analysis and decision-making.

    2️⃣ Data transfer from OLTP to OLAP at Spotify

    Since OLTP databases cannot handle heavy analytical queries, the data is copied and transformed into a data warehouse (OLAP).

    OLAP databases allow Spotify to perform advanced analyses to optimize its service and make strategic decisions.

    ✅ Examples of OLAP analyses:

    1. Top 10 most listened songs by country
    1. Analysis of music trends (hip-hop vs pop vs rock) over the last 6 months
    1. Which artists generate the most engagement in France?
    1. Analysis of listening habits according to time of day
📌

We will begin with the study of Relational Database Management Systems. Later in this course, we will cover databases that are not relational, called NoSQL because SQL language is not used to retrieve or modify data in non-relational databases.

DBMS

DBMS (Database Management System) is a software that allows users to store, organize, manage and query data in a structured (relational RDBMS) or flexible (NoSQL) manner

Functional Dependency and Normalization 

We say that an attribute Y is functionally dependent on an attribute X(denoted X → Y) if, for each value of X, there exists only one corresponding value of Y.

  • Simple Example 
  • Let's take an Employees
    EmployeeIDNameDepartment
    001AliceIT
    002BobHR
    003CharlieIT

    In this table:

    • EmployeeID → Name: Each employee ID corresponds to only one name.
    • EmployeeID → Department: Each employee belongs to only one department.
    • But Name → Department is not always true (several employees can have the same name and be in different departments).
  • General Form
    • If a relation R(A, B, C, ...) contains attributes X and Y, we say that X → Y if and only if for each value of X, there exists only one value of Y.
  • Types of Functional Dependencies
    • Simple Functional Dependency:
      • ID Number → Name (Each ID number corresponds to only one person.)
    • Full Functional Dependency:
      • (EmployeeID, Year) → Salary (The salary depends on the combination of EmployeeID and Year, not just one of them.)
    • Transitive Functional Dependency:
      • EmployeeID → Department → Manager (If an employee ID determines a department, and a department determines a manager, then EmployeeID → Manager) (transitivity).
  • Why is it Important? Functional dependencies are essential for:
    • Avoiding data redundancy.
    • Improving data integrity.
    • Helping with database normalization (avoiding insertion, update, and deletion anomalies).

    For example, if a functional dependency is poorly managed, the same information could be stored multiple times, which can lead to inconsistencies.

  • Role in Normalization Normalization is a process that relies on functional dependencies to organize a database optimally.
    • 1NF (First Normal Form): Eliminates repeating groups.
    • 2NF (Second Normal Form): Eliminates partial functional dependencies.
      • partial functional dependency occurs when an attribute depends on only part of a composite primary key instead of the entire key.
      • Let's imagine an Enrollment table that stores courses taken by students: 
      StudentIDCourseIDStudentNameCourseNameProfessor
      101500AliceMathematicsDr. Dupont
      102500BobMathematicsDr. Dupont
      101600AliceComputer ScienceDr. Martin
      • 📍 Functional Dependency :
        • The primary key is (StudentID, CourseID).
        • StudentName depends only on StudentID → Partial functional dependency ❌
        • CourseName and Professor depend only on CourseID → Partial functional dependency ❌ 

      How to Fix This Problem? 

      📌 Apply 2NF (Second Normal Form) → Separate the data into multiple tables without partial dependency.

      ✅ New normalized structure: 

      Student Table

      StudentIDStudentName
      101Alice
      102Bob

      Course Table

      CourseIDCourseNameProfessor
      500MathematicsDr. Dupont
      600Computer ScienceDr. Martin

      Enrollment Table

      StudentIDCourseID
      101500
      102500
      101600

      ➡ Now, each column depends properly on its complete primary key.

    • 3NF (Third Normal Form): Eliminates transitive functional dependencies.
  • Example of a Problem if Functional Dependency is Poorly Managed

    If we store the following

    EmployeeIDNameDepartmentManager
    001AliceITPaul
    002BobHRSophie
    003CharlieITPaul

    Here, Department → Manager.

    If we change the manager of a department, we must change all concerned rows → risk of inconsistencies.

    ✅ Solution: Separate into two tables:

    • Employees (EmployeeID, Name, Department)
    • Departments (Department, Manager)
Tables, rows, and columns

A database is organized in the form of multiple tables. A table contains a collection of rows. A row is itself a sequence of values, each of a determined type. Generally, a row groups together information concerning an object, an individual, an event, etc., i.e., a concept from the real world (external to computing), which we will sometimes call an entity or a fact.

The CLIENT table containing eight rows [1], each describing a client. In each row, there are four values representing respectively the NAME and ADDRESS of the client, the CITY where they reside, as well as the status of their ACCOUNT. One of these rows represents the following fact:

There exists a client named AVRON, residing at 8, chaussée de la Cure in Toulouse, and whose account has a debit balance of 1,700.

Structure and content of the CLIENT table

'HANSENNE' is the value of the NAME column in the first row. A column will be defined by its name, the type of its values, and their length. The values of the NAME column consist of 1 to 32 characters and those of the ACCOUNT column of 9 numeric positions, with 2 after the decimal point. Specifying the nature of a column's values consists of specifying its type.

It is possible to add rows to a table and to delete them. It is also possible to modify the value of a column in a row. For example, as a bonus, one could add 5% to the ACCOUNT value of rows where ACCOUNT > 500.

The three tables SUPPLIER, OFFER, and PART constitute a small database. The first table, named SUPPLIER, describes suppliers, specifying their number (NUMF), name (NAMEF), and city of residence (CITYF). The second table, named PART, describes spare parts, characterized by their number (NUMP) and type (TYPE). The third table, OFFER, represents suppliers' offers for parts they can deliver. These offers are characterized by the number of the supplier (NUMFL) who can make the delivery, the number of the deliverable part (NUMPL), and the price (PRICE) at which this supplier offers this part.

The null value

Asserting that each row has a value for each column of the table implies that this value exists and is known at the time of recording this row in the table. This is not always the case. Particular circumstances may make it necessary to proceed with this recording even though certain data are not known.

The absence of a column value in a row, for whatever reason, will be indicated by assigning the conventional value nullto this column. Thus, if supplier 81 (Dumont)'s offer for part type 57 (NUT) exists but its price is not yet known, the row (81, 57, null) will be entered in the OFFER table.

OFFER
NUMFLNUMPLPRICE
8157null

We will see on several occasions that null values are delicate to manipulate and that it is preferable to avoid them when possible. null is not a value strictly speaking, but a marker indicating the absence of a value.

It is possible to impose a prohibition on assigning the null value to a column. Such a column will be called a required column. If the null value is allowed, this column will be said to be optional. The DBMS guarantees the required nature of a column. Any attempt to insert a row that does not have a value for a required column would be automatically rejected.

Identifiers

An identifier is a number or a text, or a concatenation of columns that is unique.

A column constitutes an identifier of its table if, at any time, there cannot exist more than one row possessing a determined value of this column. Such is the case for NUMF in the SUPPLIER table and NUMP in the PART table. Declaring that the NUMF column forms the identifier of the SUPPLIER table means imposing that at any time the rows of this table have distinct values of NUMF.

This definition is also valid for a group of columns (known as a composite identifier). If a supplier can make only one offer per type of part, then the columns {NUMF, NUMP} constitute a composite identifier of the OFFER table.

One can define multiple identifiers in a table.

Identifier and uniqueness constraint To correctly play this identification role, it is therefore necessary that the identifier possesses distinct values, a property called the uniqueness constraint. This is guaranteed by the DBMS, which will automatically reject any attempt to insert a row whose identifier value is already present in the table.

Foreign Keys

NUMFL and NUMPL play a reference role in the OFFER table. Indeed, each of their values references a row in another table. The value 152 of the NUMFL column of the row (152, 14, 62) in the OFFER table designates the row (152, MERCIER, Tours) in the SUPPLIER table.

It is through foreign keys that rows in different tables can be related. For each offer, represented by a row in OFFER, it is possible to know the information concerning the supplier (via NUMFL) and that concerning the part (via NUMPL).

Foreign Key and Referential Constraint To correctly play this reference role, it is necessary that the set of values of a foreign key be a subset of the values of the target identifier, and this at any time. This property is called a referential constraint. It is guaranteed by the DBMS as long as the foreign keys have been explicitly declared. Any operation that would lead to violating this constraint would be automatically rejected. For example, the insertion of the row (174, 97, 125) in the OFFER table would be rejected since there is no row in the PART table whose number is 97.

Schema and Content of a Database

Two distinct parts are distinguished in a database: its schema and its content. The schema of a database defines its structure in terms of tables, columns (with the type of values and the required or optional character of each), primary and secondary identifiers, and foreign keys. Its content at a given moment is the set of rows.

Schema (partial) and content of the database

A Representative Database Example
  • Table CLIENT: each row describes a customer; the columns successively describe the customer number (CUSTNUM), name (NAME), address (ADDRESS), location (LOCATION), category (CAT), and account status (ACCOUNT). The primary identifier consists of CUSTNUM. The CAT column is optional.
  • Table PRODUCT: each row describes a product specifying its number (PRODNUM), description (DESCRIPTION), unit price (PRICE), and quantity remaining in stock (QSTOCK). PRODNUM is the primary identifier.
  • Table ORDER: each row describes an order placed by a customer; the order number (ORDNUM), the number of the customer who placed the order (CUSTNUM), and the date of the order (ORDDATE) are specified. ORDNUM is the primary identifier of the table. CUSTNUM is a foreign key to the CLIENT table.
  • Table DETAIL: each row represents a detail of an order specifying the order number of this detail (ORDNUM), the number of the ordered product (PRODNUM), and the ordered quantity (QORD). The primary identifier consists of ORDNUM and PRODNUM. ORDNUM and PRODNUM are each also a foreign key to the ORDER and PRODUCT tables respectively.
  • 🧩 Drawing the tables
Graphical Representation of a Schema
  • an optional column is characterized by the symbol [0-1] which indicates that a row has 0 to 1 value for this column;
  • the primary identifier is specified by the "id:" clause which enumerates its components; in addition, these are underlined in the column compartment;
  • any secondary identifier is specified by a similar clause "id':";
  • a foreign key is specified by a "ref:" clause which enumerates its components; from this clause comes an arrow that points to the referenced identifier (this could be secondary, which we will ignore for now);
    In some cases, a non-graphical representation such as this one will suffice:
CLIENT (CUSTNUM, NAME, ADDRESS, LOCATION, CAT[0-1], ACCOUNT)

Modifications and Integrity Constraints

The constructions of the schema, particularly required columns, identifiers, and foreign keys, impose constraints on the data that must be satisfied at all times. These constraints will therefore come into play during any attempt to modify the data. Adding a row, deleting a row, or modifying a column value of a row are operations that are only allowed if these constraints will still be respected by the data after these operations. If these constraints are violated, the data is said to have lost its integrity. Hence the term integrity constraints by which they are generally designated.

  • Required Columns
    • If a column is declared required, each row must have a value for it. During row creation and modification operations, this column must receive a significant value, excluding the null value. For example, it is permitted to create a CLIENT row without a value for CAT, but not without a value for LOCATION.
  • Uniqueness Constraints (Identifiers)

    An identifier constitutes a uniqueness constraint imposing that at any time the rows of a table possess distinct values for one or more columns. For example, there cannot exist more than one CLIENT row having the same value of CUSTNUM.

    • Creating a row: allowed if there is no row possessing the same value of the identifier.
    • Deleting a row: no constraint.
    • Modifying the identifier of a row: allowed if there is no row possessing the new value of the identifier.
  • Referential Constraints (Foreign Keys)

    A referential constraint specifies that the columns of a table that constitute a foreign key must at all times, for each row, contain values that are found as a primary identifier of a row in another table. Thus, the DETAIL table is subject to two referential constraints. The first indicates that any value of ORDNUM must identify a row of ORDER. The second indicates that any value of PRODNUM must identify a row of PRODUCT.

    Schema for studying referential integrity

    Creating an ORDER row

    The CUSTNUM value of this row must be present in the CUSTNUM column of a CLIENT row

    Deleting an ORDER row

    Performed without restriction

    Modifying the CUSTNUM value of an ORDER row

    The new CUSTNUM value of the row must be present in the CUSTNUM column of a CLIENT row

    Creating a CLIENT row

    Performed without restriction, insofar as CUSTNUM is unique

    Deleting a CLIENT row

    The problem arises when this client has orders; there are then rows in ORDER that reference the CLIENT row to be deleted. There are several possible behaviors that leave the database in a correct state:

    • Blocking. The first consists of refusing to delete the CLIENT row in order to avoid leaving orphanedORDER rows in the database. This mode will be called no action in SQL.
    • Propagation or cascade. The second behavior consists of deleting not only the CLIENT row, but also all ORDER rows that reference it. This mode will be called cascade in SQL.

    Modifying CUSTNUM of a CLIENT row

    If no ORDER row references this CLIENT row, the referential constraint imposes no restriction. If, on the contrary, such rows exist, three behaviors similar to those of the delete operation are allowed: refusing modification (blockingno action), modifying the values of foreign keys that reference this row (propagationcascade) in order to preserve the relationships between the concerned rows.

    It can therefore be seen that the impact of a referential constraint is not unique and that, depending on the reality to be represented and the data management policy, one will be led to choose one or the other of the described behaviors. All, however, guarantee the consistency of the data.

Internal Redundancies in a Table

The tables we have encountered so far each represented a clearly identified set of entities: suppliers, clients, orders, accounts, or purchases. Some tables may present a more complex structure, generally considered undesirable. Such is the case with the BOOK table.

The owner of the BOOK table evidently wishes to record the books in the library. For each book, they have included the number, title, author, ISBN code, purchase date, and its location on the shelves. A book that is in high demand from readers may be acquired in several copies, each of which is the subject of a distinct row in the table.

The BOOK table records information on books available in a library

This representation seems natural but it imposes a very strict constraint: when a book exists in multiple copies, the rows describing these copies include the same values for title, author, and ISBN code.

  1. First of all, redundant data occupies unnecessary space.
  1. Then, subsequent modification of a book's title or author will require the same modification to the rows of all copies of this book, otherwise the data would become inconsistent.
  1. How to ensure that duplicate values will be spelled exactly the same way?
  1. If registering a first copy of a book can be done freely, registering subsequent copies must conform to information already present, which considerably complicates the procedure.

This phenomenon is called information redundancy, since the same information is recorded multiple times. Such a situation violates the founding principle of databases: any relevant fact of the application domain must be recorded once and only once. This redundancy residing in the table itself is qualified as internal.

This analysis therefore shows that the table is subject to an integrity constraint of a new type: if two rows have the same ISBN value, then they have the same values for TITLE and AUTHOR. We will say that the values of TITLE and AUTHOR depend directly on, or are a function of, the ISBN value. This constraint has been given the name functional dependency. It will be noted as:

Determinant ⇒ determined

identifier ⇒ determined

ISBN is the determinant of the dependency while the group (TITLE, AUTHOR) is the determined. There exists, of course, in each table, a functional dependency between the identifier of this table and each of its columns:

The origin of the redundancy we have observed is fairly simple to identify: the values of TITLE and AUTHOR can appear in multiple copies because the values of these two columns depend on those of a column (ISBN) which is not an identifier of the table. As it is normal for the same ISBN value to appear multiple times in the table, it is equally normal that certain values of TITLE and AUTHOR also appear in multiple copies, thus constituting a redundancy.

We draw a practical conclusion: one should avoid structuring a table in such a way that it is the seat of a problematic functional dependency, that is, one whose determinant is not an identifier of the table. If an attribute that is not an identifier (neither a primary key nor a candidate key) determines other values in the table, this can cause redundancies and update anomalies.

One question remains to be resolved: what to do when such a functional dependency exists in a table? We will proceed with normalization.

Normalization by Decomposition

Careful observation of the data contained in the BOOK table shows that it contains information on two categories of entities: the work and the multiple copies. For example, one should distinguish the work Mercury by A. Nothomb (ISBN 2 253 14911 X) from the three copies of it that constitute books 1032, 1067, and 1022. Data specific to a work are common to all its copies and are therefore recorded as many times as this work has copies.

The resolution of this problem involves the decomposition of the BOOK table into two distinct tables, to which we will give new names, to avoid any ambiguity. The first, WORK, contains the description of the works. It includes the ISBN code, title, and author of each of these. The second table, COPY, describes the copies, possibly multiple, of these works. It indicates, for each of them, the copy number, the ISBN code of the work (which constitutes a reference to WORK), the purchase date, and the location on the shelves.

We can draw from our small experience three rules:

  1. The WORK table consists of all columns involved in the problematic functional dependency (ISBN, TITLE, AUTHOR); its identifier is constituted by the determinant of this functional dependency, ISBN.
  1. The COPY table is the original table from which all columns appearing on the right side (determined) of the problematic functional dependency (TITLE, AUTHOR) have been removed.
  1. The columns of the determinant of the problematic functional dependency (ISBN) are found in both tables; it constitutes in COPY a foreign key to the new WORK table.

This process of decomposition to eliminate internal redundancies is called normalization. The WORK and COPY tables in the figure are said to be normalized, whereas the BOOK table is not.

It is clear that the decomposition has not resulted in any loss of information. It is indeed easy to reconstruct the exact content of the BOOK table from the contents of the WORK and COPY tables. These two databases therefore have the same information content. The second, however, has resolved the problems posed by the first. Indeed, we observe that:

  1. as each piece of information is present only once, the volume of data is minimized.
  1. modifying a title or an author requires modifying only a single row of the WORK table;
  1. introducing a copy, whether it is the first or a subsequent one, translates to the simple insertion of a row, provided that referential integrity is preserved;
  1. a work can be listed without corresponding to any copy.

In a normalized database, we distinguish the copies from the works of which they are the materialization. The redundancies present in the BOOK table are thus eliminated

Physical Structures

The tabular form of data corresponds to a relatively complex physical structure that guarantees good performance during query execution. We will briefly describe some of these structures: indexesstorage spaces. It is important to note that the user who consults and modifies the data is unaware of the presence of these physical structures when formulating a query. For example, any correct query can be executed by the DBMS, whether or not indexes exist to accelerate data access and sorting.

  • The term index refers to a mechanism whose objective is comparable to that of a book index such as this one. For a given significant term, the index gives us the list of page numbers where this term appears, without the need to go through the entire book to find these pages. Moreover, the index is ordered in such a way that one can quickly identify the term being searched.
    In the physical schema below, an index is represented by a group of column names prefixed by the symbol 
    acc (access key) or, if this group is an identifier or a foreign key, by the suffix acc. Thus, the CLIENT table has three indexes: (CUSTNUM), (NAME), and (LOCATION). There are several techniques for implementing an index. One of them involves a correspondence table that associates each value of the column, or columns, with the list of corresponding row numbers. In this correspondence table, the column values are arranged in ascending or descending order. Any number of indexes can be declared on each data table. An index can be added or removed at any time.

Physical schema of a database specifying indexes and storage spaces

💡

Accessing a row of a table via an index generally takes 10 to 20 milliseconds. Without an index, accessing this row may require reading the entire table, taking from 1 minute to 1 hour!

  • The rows of tables are stored in files, more specifically called storage spaces (dbspacetablespacespace), which constitute containers implemented in secondary memory. Such a container is divided into pages of fixed size (typically 4,096 bytes) and can accommodate rows from several tables. The reading time of any page is approximately 12 milliseconds. We will graphically denote a storage space by a cylinder containing the list of table names.
Working with Databases

In this Class we are going to use :

  • PHPMyAdmin

In addition you can look at :

  1. The SQLfast software allows very simple execution of SQL queries or scripts (sequence of queries) of any complexity without particular programming knowledge. SQLfast is developed in Python, uses the SQLite DBMS and is currently available on Windows, or Wine under Linux. Its installation, simple and quick, is described in Appendix 29, which also includes a tutorial of the SQLfast language. The software can be downloaded for free from the book's website. This is the approach we recommend. Figure 2.11 shows an elementary but complete example of a SQLfast script, which opens a database (openDB), asks the user (ask) for the name of a city (Figure 2.12), executes a query that extracts (select) and displays on screen information about clients in this city [11](Figure 2.13), then closes the database (closeDB).
    OpenDB CLICOM. db;
    ask
    ville = City name;
    select CUSTNUM, NAME from CLIENT where LOCATION = '$ville$';
    closeDB;
    
  1. Microsoft's Office suite includes the relational DBMS MS Access. The latter constitutes a very user-friendly sandbox for experimenting with database concepts. The SQL language is not immediately visible there but requires some manipulation. On the other hand, one cannot unfortunately recommend the Base module of LibreOffice,whose complex use and delicate behavior may discourage the beginning user.
  1. Lightweight and free DBMS can be found on the web, easy to install and use. For example, one would recommend Firebird (www.firebirdsql.org/), InterBase 6 (www.ibprovider.com/eng) or SQLite (www.sqlite.org).
  1. One will also consider one of the Open source stars: MySQL (www.oracle.com/us/sun/index.htmwww.mysql.com) or PostgreSQL (www.postgresql.org/).
  1. Finally, commercial DBMS often offer free lightweight versions for evaluation or private use. Oracle, DB2 from IBM and SQL Server from Microsoft are often more delicate to install but constitute the leaders of the market.
Exercise 1

We are considering the paper order form below, which we propose to encode in the form of data to be entered into the database below. What do you think about this?

Exercise 2

Check if the following schema is normalized. If necessary, decompose it into normalized tables:

  • SALE
    PRODNUMCUSTOMERDATEQUANTITYADDRESSREPRESENTATIVEREGION

    CUSTOMER ⇒ ADDRESS, REPRESENTATIVE (address and representative depend on customer)

    REPRESENTATIVE ⇒ REGION (region depends on representative)

💚

Agence digitale Parisweb.art
Tout savoir sur Julie, notre directrice de projets digitaux :
https://www.linkedin.com/in/juliechaumard/