1 - Data - Definition and uses

What is data?

According to Le Robert dictionary, data is a “conventional representation of information that allows for automatic processing.”
- "Conventional" means based on a shared agreement, standard, or rule.
- In computing, for a machine to process information, it must be encoded in a format the machine understands.
- The date "May 3, 2025" can be represented as:
  - 03/05/2025 (French format)
  - 05/03/2025 (US format)
  - 2025-05-03 (ISO 8601 international standard)

Data is thus an informational vector that comes in many forms.
- A data point is called an "informational vector" because it carries information.
- Data is the encoded and automatically processed format through which information can circulate, be stored, or analyzed.
- It acts as a transmission support between:
  - a human and a machine,
  - two software programs,
  - two people via a digital medium.
Information Data format (vector)
A name A string: "Marie"
An image A file: image.jpg
A sound An audio format: podcast.mp3
A temperature A number with unit: 22.5°C

Information	Data format (vector)
A name	A string: `"Marie"`
An image	A file: `image.jpg`
A sound	An audio format: `podcast.mp3`
A temperature	A number with unit: `22.5°C`

Data is essential for analyzing a situation; it is exchanged, shared, and distributed so that individuals or computers can function, communicate, and collaborate.

Image, video, audio, numbers, textual or computer language—data is everywhere, allowing us to quantify and qualify each process in modern systems.

In science, business, marketing, finance, or politics, data allows us to quantify and qualify information and make sense of indicators that can lead to strategic decisions.
- For instance, in a company: each day, it deals with masses of information—employee presence and absences, prospects, clients, reviews, review quality, revenue, margins, cash flow, capital... All of these elements form part of the data dimension.
  With this, a business leader can set a strategy, such as calculating profitability by collecting all related data, drawing conclusions, and taking action.

Over the past two decades, with the democratization of the Internet, data has become increasingly vital. The term "Big Data" refers to this explosion of digital data. From this, new roles and disciplines have emerged: Data Engineering, Data Mining, Data Centers, Data Analysis, Web Analytics, Data Processing…

The Uses of Data

Editable, reusable, and accessible, data has become the best friend of internet users. It’s available all over the web and anyone can use it. It appears in different forms. Let’s explore how it can be used in various contexts.

On a social level, data helps citizens make better decisions, find answers to their questions, and provides countless solutions to their problems. It also has the power to improve their quality of life. With all types of data available online, it’s possible to find a job, meet people, fill out tax forms, ask questions in forums, or book a cinema ticket.

In the transportation sector, data is essential and will continue to be so in the future. Drivers now use data-processing apps like Waze and Google Maps, which offer optimized routes while alerting users to incidents or speed checks. These systems harness the power of the crowd, as each user contributes to data validation and updates via alerts and corrections.

In the professional world, data is critical. It’s found in every area of business. In HR, for instance, HRIS (Human Resource Information Systems) qualify employees through data (attendance, leave, training...). In customer management, CRM tools (Customer Relationship Management) are used to analyze customer data and develop micro or macro-level commercial strategies.

From an economic standpoint, data helps measure the economic impact of a population or a country through indicators like HDI, purchasing power, GDP and GDP per capita, etc. Increasingly, data also supports economic forecasting using mathematical models to improve decision-making.

In healthcare, data allows medical professionals to profile patients, review their medical history, and coordinate care. In recent years, data has helped manage epidemics—identify clusters, isolate carriers, measure trends, and implement interventions.

There are many more examples of how data is used. For instance, in industries, we now have "Industry 3.0" and "Industry 4.0" smart factories, or adaptive learning platforms like LinkedIn Learning that tailor learning paths using data (and AI). Data usage is everywhere and is creating new disciplines.

Industry 3.0 and 4.0: Historical context of industrial evolution

Revolution	Period	Main characteristic
1.0	Late 18th century	Mechanization, steam engines
2.0	Late 19th – early 20th	Electricity, mass production
3.0	1970s–2000s	Automation via electronics and IT
4.0	Since ~2010	Connected factories, AI, big data, IoT

LinkedIn Learning
LinkedIn Learning is an online training platform launched by LinkedIn (owned by Microsoft since 2016). It offers thousands of video courses in various areas such as:
- personal and professional development,
- office tools and software (Excel, Photoshop, etc.),
- programming, design, marketing, project management, etc.
Courses are created by experts and accessible via subscription, with certificates that can be added to your LinkedIn profile.
Big Data = LinkedIn has a huge amount of data about its users.
AI = LinkedIn uses this data to recommend content and analyze trends.

Adaptive Learning
Adaptive learning is a personalized teaching method that uses technology to adjust to each learner’s pace, knowledge, and needs.
- Content adjusts automatically based on user responses and behavior.
- The learning path is non-linear and evolves.
- The goal is to optimize learning efficiency by delivering the right content at the right time.

LinkedIn Learning and Adaptive Learning
LinkedIn Learning integrates adaptive learning by:
- recommending courses based on LinkedIn profile skills,
- tracking progress and adjusting proposed content,
- offering personalized paths based on jobs or career goals.
This creates a tailored learning experience linked directly to individual needs and career growth.

Data Disciplines in Companies

Data is essential for a company to operate.

Data professionals carry out significant work before data can be fully utilized. Data-related jobs require deep skills and knowledge in computing, mathematics (probabilities, statistics…), marketing, and finance.

Let’s start with the Data Scientist. This is a data analysis expert with both technical and scientific skills to solve complex data-related problems. A Data Scientist is a blend of mathematician, statistician, and programmer. Their role is to build scientific models (algorithms) to make data usable and valuable.
They analyze, process, model, and interpret data to create action plans. Positioned between researcher and consultant, they are central to many companies and organizations.

Next is the Data Engineer, who programs the systems required for the data processes requested by teams. They develop algorithms to help colleagues collect and use data. This role demands technical skills in managing and designing SQL databases and using programming languages (Python, R...).

Then there's the Data Analyst, who interprets data, analyzes results, and presents them, thanks to expertise in statistics and some programming. They design data analysis processes (dashboards, BI and DataViz tools), interpret insights, and make data understandable. Acting as a bridge between the Data team and others, their strength lies in simplifying complex information.

The Data Miner is a data extraction tool (or web scraping) designed to help users automatically collect information from web pages and convert it into usable tables (CSV, Excel, Google Sheets…).

Other key roles include the Machine Learning Engineer,

and the Data Steward, a data quality manager who ensures data is consistent, well-documented, reliable, accessible, and compliant with internal policies and regulations. They play a crucial role in data governance projects.

Depending on the company’s digital maturity, other roles may exist: Chief Data Officer, Big Data Architect, BIM (Business Intelligence Manager), Master Data Manager, or DPO (Data Protection Officer—especially responsible for GDPR compliance).

These roles are still evolving. In many organizations, a single Data Analyst handles multiple responsibilities, coordinating with technical and financial teams to achieve data goals. In other companies, digital marketing teams take charge of data-related aspects. A new profession has emerged: the Web Analyst, who studies web activity and performance (Google Analytics, social media, SEO, etc.).

Data jobs will continue to evolve. As robotics and AI progress, new roles will emerge, and others will disappear. Some companies don’t have data teams but already know how to leverage data effectively. Let's now look at who the main data stakeholders are in companies.

The main data actors are obviously domain experts. They collect, analyze, and interpret data to meet company objectives.

Among these, one of the first key players is the Chief Information Officer (CIO). They provide the technical environment needed to launch data initiatives. The CIO chooses tools and systems while ensuring information security. Technical teams then implement all related systems.

Executives and decision-makers also play a role. They use data to steer the organization. The information collected helps guide decision-making and set new objectives to improve performance. Data supports both strategic and operational management.

Marketing teams rely on data to better understand customer expectations. Improving their strategies heavily depends on it. They also use data to assess company strengths and weaknesses and refine their approach.

Sales divisions use CRM systems to qualify prospects and customers and analyze results by product, category, region, or individual salesperson.

Lastly, the finance department leverages data with accounting tools to measure profitability, forecast results and budgets, and evaluate financial risks.

Data is everyone’s concern. Knowing how to use and understand data is as essential as knowing how to read, count, or walk. In small organizations, a data lead may be appointed to champion data use across all business processes.

Exercise 1 - 1

🏋️

Can you create a summary table of the main data-related role/jobs

The Data Process in Companies

To help make data more understandable, experts have defined a process. Here is the data processing workflow: from data collection to analysis.

The first stage of data is discovery. In other words, understanding the context. This often starts with an internal problem. For example: "We don’t know the type of customer who books trips to Hungary on our website." Teams then work to discover and define the problem.

The second stage is reflection. Teams assess the problem, analyze its internal and external impact, and consider the value of solving it. They also think about the tools and resources (human, financial, technological) needed to collect and analyze the data.

Once the problem is clearly defined and validated, we move on to the third stage: structuring. This means organizing data within a defined framework. Continuing our example, this could involve selecting a time period, online or physical context, price range, etc.—everything needed to guide data collection and analysis.

Despite structuring, data remains raw. It’s essential to clean the data to avoid distorted results. In our example, some clients may have canceled after purchasing, others may have bought for someone else, and some may have deleted their accounts. A cleaning phase is critical.

After cleaning, comes enrichment—adding filters and layers of detail, either manually or automatically. For instance, filters like nationality, age, gender, and average basket value could help better address the original question.

Once data is structured, cleaned, and enriched, it must be validated. This means checking that it aligns with the original goal and is logically consistent.

The final stage is publication. This is the most important step, where data is presented to stakeholders who initiated the process. Publication doesn't mean sending raw data—it means preparing a clear presentation that highlights insights, findings, and conclusions. It’s a step of teaching, storytelling, and connecting the data to company strategies.

Exercise 1-2

🏋️

Create a diagram of this process.

Store Data

❓

Where is data stored?

Hadoop (historical big data foundation)

Type: Distributed file system (HDFS: Hadoop Distributed File System).

Storage: Large files split across multiple machines (clusters).

Data formats: Parquet, Avro, ORC, JSON, CSV, images, videos…

Typical usage: Big data, large-scale analytics, unstructured or semi-structured data.

Access: Via Hive, Spark, Impala, Presto, or directly through MapReduce frameworks.

SQL RDBMS (e.g., MySQL, PostgreSQL, SQL Server…) – the most common

Type: Relational database.

Storage: Tables in a local or network file system, with indexes and a defined schema.

Data formats: Structured (tables with predefined columns and types).

Typical usage: Web apps, ERP, CRM, transactions (OLTP).

Access: SQL queries (SELECT, INSERT, etc.).

Other modern storage technologies:

Apache Spark

Can run locally, on a cluster, or in the cloud (EMR, Databricks).

Compatible with S3, Delta Lake, Hive…

Amazon S3 + Athena / Glue

Scalable cloud storage with query capabilities via Athena or ETL via Glue.

Snowflake / BigQuery / Redshift Spectrum

Auto-scalable cloud data warehouses.

Fast SQL queries on files or databases.

Ideal for BI, ad hoc analysis, data lakes.

Delta Lake / Lakehouse

Combines Data Lake + Data Warehouse.

ACID transactions on raw files (e.g., on S3).

Azure Data Lake

Microsoft’s cloud-based data lake platform.

Google Cloud Storage

Scalable storage service by Google Cloud, often used with BigQuery and Dataflow.

Query Data

There are many tools to query data. The choice depends on the database size, platform architecture (single server or distributed), and the platform itself.

Tool / Platform	Query Language Used	Key Notes
Apache Hive	HiveQL (SQL-like)	Simplified SQL for Hadoop
Apache Spark SQL	Spark SQL (standard SQL)	Can also be used via Python, Scala, etc.
Presto / Trino	Standard SQL	Very fast, often used with S3 or Hive
Amazon Athena	Standard SQL (Presto under the hood)	Serverless, direct queries on S3
Google BigQuery	Standard SQL (with Google extensions)	Very fast queries, supports JSON/struct
Snowflake	Standard SQL	Rich features (CTEs, pivot, UDF, etc.)
Redshift / Spectrum	SQL (PostgreSQL-like)	Compatible with S3 (Spectrum)
Databricks / Delta Lake	Spark SQL	Can be queried from SQL, Python, Scala
ClickHouse	ClickHouse-specific SQL	Very fast for aggregations
ElasticSearch	DSL (Domain-Specific JSON Language)	Not SQL, uses JSON query syntax
MongoDB	MongoDB Query Language (MQL, JSON-based)	Not SQL, uses aggregation pipelines
Apache Cassandra	CQL (Cassandra Query Language)	SQL-inspired but: no joins, no complex transactions, requires query-driven schema design
Neo4j (graphs)	Cypher	Specific language for graph data
Power BI / Tableau	SQL (via connectors) + internal language (DAX, etc.)	Depends on data source
Google Sheets / Excel	Formulas + sometimes SQL via external connector	For simpler use cases
AWS Glue	SQL (via Glue Data Catalog + Spark) or Python	Used in ETL jobs

Conclusion

Summary Table: Languages + Tools + Storage Technologies

Tool / Platform	Query Language	Storage Technology Used
Apache Hive	HiveQL (SQL-like)	HDFS (Hadoop)
Apache Spark SQL	Spark SQL	HDFS, S3, Delta Lake, JDBC, etc.
Presto / Trino	Standard SQL	S3, HDFS, Hive, Kafka, JDBC, etc.
Amazon Athena	Standard SQL (Presto)	Amazon S3
AWS Glue	SQL / Python (PySpark)	Amazon S3, JDBC, Redshift, RDS
Google BigQuery	Standard SQL	Native cloud data warehouse (Google infra)
Snowflake	Standard SQL	Cloud data warehouse (Snowflake-native)
Redshift / Spectrum	SQL (PostgreSQL-like)	S3 (Spectrum) + internal Redshift storage
Databricks / Delta Lake	Spark SQL	S3, ADLS, Delta Lake
ClickHouse	Custom SQL	Local disk (columnar format)
ElasticSearch	JSON DSL	Lucene index (NoSQL, full-text search)
MongoDB	MQL (Mongo Query Language)	NoSQL document store (BSON/JSON)
Cassandra	CQL (Cassandra Query Language)	Wide-column NoSQL (Cassandra SSTables)
Neo4j	Cypher	Graph database
Power BI / Tableau	SQL via connectors + DAX	Connects to SQL, BigQuery, S3, etc.
Excel / Google Sheets	Formulas + SQL via plugin	XLSX/CSV files or external connections

HDFS: Hadoop Distributed File System

S3 / ADLS: Cloud data lakes (Amazon, Azure)

JDBC: Connectors to traditional relational databases

NoSQL: Cassandra, MongoDB, Elastic, etc.

Cloud warehouse: BigQuery, Snowflake, Redshift

Graph: For complex relational data (e.g., networks)

Exercise 1-3

🏋️

Search online for the creators, companies, or brands and launch dates of all these tools. Make a chronological table.

Big Data

Big Data = Massive Data

The term Big Data refers to a set of data that is so large, varied, and fast to generate that traditional processing tools are no longer sufficient.

Date	Key Event
1997	First documented use of the term “big data” by NASA researchers (Michael Cox & David Ellsworth) to describe datasets too large for standard computers.
Early 2000s	Rise of NoSQL databases, grid computing, and early work on Hadoop at Yahoo/Google
2005	Launch of the Hadoop project, inspired by Google’s MapReduce paper
2008–2010	Explosion in data volume due to social media, cloud computing, IoT
2012	The term “Big Data” becomes a buzzword in media, tech conferences, and corporate strategies

The 5 "V"s of Big Data

V	Meaning	Concrete Example
Volume	Huge amount of data	Logs, videos, IoT, clicks, transactions
Velocity	Data generated continuously	Real-time streaming (Kafka, sensors)
Variety	Heterogeneous data types	Text, images, JSON, CSV, audio…
Veracity	Data quality and reliability	Noisy data, errors, uncertainty
Value	Ability to extract insights	Demand forecasting, fraud detection

Big Data Tools Overview

Category	Examples
Storage	Hadoop (HDFS), Amazon S3, Delta Lake
Processing	Apache Spark, Flink, MapReduce
Querying	Hive, Presto, BigQuery, Athena
Streaming	Apache Kafka, Kinesis
Visualization	Power BI, Superset, Tableau
AI & Machine Learning	TensorFlow, PyTorch, MLlib

Exercise 1-4

This exercise consists of summarizing a lesson in Excel format, then illustrating the key ideas with concrete examples found online, to be presented in a structured Word document or PowerPoint presentation.

Objective

Summarize a lesson as a structured table in Excel.

Identify main ideas.

Find real-world examples online.

Present everything in a clear and illustrated Word document.

Part 1 — Excel Summary

Open Microsoft Excel.

Create a table with the following columns:
Topic / Chapter Main Idea Summary (1–2 sentences) Keywords

For each course section:
- Write down the essential idea.
- Summarize it in one or two sentences.
- Add some associated keywords.

Apply clear formatting: colors, bold text, borders

Part 2 — Structured Word Document

Open Microsoft Word.

Write a document with the following structure:
- Title Page: title, first name, last name, date
- Introduction: topic of the course
- Body:
  - For each idea from the Excel table:
    - Create a section
    - Summarize the idea
    - Find a real-world example online (website, article, image, video, etc.)
      - Use Google or a search engine like Qwant, with the right keywords
      - Prioritize trustworthy sources
      - Cite your sources
    - Add a personal comment (Why is this useful? What do you take away from it?)
- Conclusion: what you’ve learned from the course

Use proper styles (headings, subheadings, paragraphs)

Internet Research

🔍

How to find reliable sources?

Here are some practical tips to help you identify and use reliable sources, useful for reports, presentations, or research work.

Favor recognized websites

Government websites: end in .gouv.fr, .gov, etc.Examples: data.gouv.fr, nia.nih.gov

Official organizations: INSEE, CNIL, WHO, UN, OECD…

Academic institutions: often end in .edu, .ac.ukExamples: Harvard, MIT, Sorbonne

Peer-reviewed journals:Examples: Nature, Science, IEEE Transactions, ACM Journal

Use specialized databases

Google Scholar – academic articles

PubMed – medical sciences

IEEE Xplore – engineering, AI

ACM Digital Library – computer science, AI

Cairn, Persée – humanities and social sciences

Be cautious with certain sources

Wikipedia: good starting point, but always verify cited sources.

Blogs, forums, unsourced YouTube videos: beware of reliability.

Commercial or sponsored websites: may be biased by advertising.

Check a source's reliability

Who is the author? Are they identifiable? Are they an expert?

What’s the date? Is it recent or outdated?

Does the source cite other references?

Is the content neutral? Or highly biased/polemical?

Is the style professional? Or sensationalist, with mistakes?

Tip: Targeted Google searches

Use site: to narrow your search to trustworthy domains.

Examples: machine learning in sports site:.edu , AI applications in healthcare site:.gouv.fr

💚

Agence digitale Parisweb.art
Tout savoir sur Julie, notre directrice de projets digitaux :
https://www.linkedin.com/in/juliechaumard/