📕

1 - Data - Definition and uses

CatégorieCours
Ordre d'apprentissage2
StatutPréparé
⭐

Cours de Julie Chaumard

ChaptersHomeworksExercisesWeeksDates
What is data?1May 7th
The uses of data1May 7th
Data disciplines in companiesExercise 1-11May 7th
The data process in companiesExercise 1-21May 7th
Store data1May 7th
Query dataExercise 1-31May 7th
Exercise 1-41May 7th
📖

BOOK : Google Data Studio, créer des rapports intuitifs grùce à la datavisualisation de Youssef Jlidi, éditions ENI, collection Objectif web, 2022

What is data?
  • According to Le Robert dictionary, data is a “conventional representation of information that allows for automatic processing.”
    • "Conventional" means based on a shared agreement, standard, or rule.
    • In computing, for a machine to process information, it must be encoded in a format the machine understands.
    • The date "May 3, 2025" can be represented as:
      • 03/05/2025 (French format)
      • 05/03/2025 (US format)
      • 2025-05-03 (ISO 8601 international standard)
  • Data is thus an informational vector that comes in many forms.
    • A data point is called an "informational vector" because it carries information.
    • Data is the encoded and automatically processed format through which information can circulate, be stored, or analyzed.
    • It acts as a transmission support between:
      • a human and a machine,
      • two software programs,
      • two people via a digital medium.
    InformationData format (vector)
    A nameA string: "Marie"
    An imageA file: image.jpg
    A soundAn audio format: podcast.mp3
    A temperatureA number with unit: 22.5°C
  • Data is a key element for reflection; it is exchanged, shared, and distributed so that individuals or computers can function, communicate, and collaborate.
  • Image, video, audio, numbers, textual or computer language—data is everywhere, allowing us to quantify and qualify each process in modern systems.
    • In computing, for instance, data is represented using binary information (0s and 1s). This coded language is understood by machines and enables them to perform tasks like logging into a Windows session or deleting a folder. This source code is then translated into a human-readable format using letters. Thus, data is matricial—it bridges different worlds and serves as an intermediary vector of information.
  • But data is not limited to computing. In science, business, marketing, finance, or politics, data allows us to quantify and qualify information and make sense of indicators that can lead to strategic decisions.
    • For instance, in a company: each day, it deals with masses of information—employee presence and absences, prospects, clients, reviews, review quality, revenue, margins, cash flow, capital... All of these elements form part of the data dimension.
      With this, a business leader can set a strategy, such as calculating profitability by collecting all related data, drawing conclusions, and taking action.
  • Over the past two decades, with the democratization of the Internet, data has become increasingly vital. The term "Big Data" refers to this explosion of digital data. From this, new roles and disciplines have emerged: Data Engineering, Data Mining, Data Centers, Data Analysis, Web Analytics, Data Processing

The Uses of Data
  • Editable, reusable, and accessible, data has become the best ally of internet users. It’s available all over the web and anyone can use it. It appears in different forms. Let’s explore how it can be used in various contexts.
    • On a social level, data helps citizens make better decisions, find answers to their questions, and provides countless solutions to their problems. It also has the power to improve their quality of life. With all types of data available online, it’s possible to find a job, meet people, fill out tax forms, ask questions in forums, or book a cinema ticket.
    • In the transportation sector, data is essential and will continue to be so in the future. Drivers now use data-processing apps like Waze and Google Maps, which offer optimized routes while alerting users to incidents or speed checks. These systems harness the power of the crowd, as each user contributes to data validation and updates via alerts and corrections.
    • In the professional world, data is critical. It’s found in every area of business. In HR, for instance, HRIS (Human Resource Information Systems) qualify employees through data (attendance, leave, training...). In customer management, CRM tools (Customer Relationship Management) are used to analyze customer data and develop micro or macro-level commercial strategies.
    • From an economic standpoint, data helps measure the economic impact of a population or a country through indicators like HDI, purchasing power, GDP and GDP per capita, etc. Increasingly, data also supports economic forecasting using mathematical models to improve decision-making.
    • In healthcare, data allows medical professionals to profile patients, review their medical history, and coordinate care. In recent years, data has helped manage epidemics—identify clusters, isolate carriers, measure trends, and implement interventions.
    • There are many more examples of how data is used. For instance, in industries, we now have "Industry 3.0" and "Industry 4.0" smart factories, or adaptive learning platforms like LinkedIn Learning that tailor learning paths using data (and AI). Data usage is everywhere and is spawning new disciplines.
      • Industry 3.0 and 4.0: Historical context of industrial evolution
        RevolutionPeriodMain characteristic
        1.0Late 18th centuryMechanization, steam engines
        2.0Late 19th – early 20thElectricity, mass production
        3.01970s–2000sAutomation via electronics and IT
        4.0Since ~2010Connected factories, AI, big data, IoT
      • LinkedIn Learning

        LinkedIn Learning is an online training platform launched by LinkedIn (owned by Microsoft since 2016). It offers thousands of video courses in various areas such as:

        • personal and professional development,
        • office tools and software (Excel, Photoshop, etc.),
        • programming, design, marketing, project management, etc.

        Courses are created by experts and accessible via subscription, with certificates that can be added to your LinkedIn profile.

      • Adaptive Learning

        Adaptive learning is a personalized teaching method that uses technology to adjust to each learner’s pace, knowledge, and needs.

        • Content adjusts automatically based on user responses and behavior.
        • The learning path is non-linear and evolves.
        • The goal is to optimize learning efficiency by delivering the right content at the right time.
      • LinkedIn Learning and Adaptive Learning

        LinkedIn Learning integrates adaptive learning by:

        • recommending courses based on LinkedIn profile skills,
        • tracking progress and adjusting proposed content,
        • offering personalized paths based on jobs or career goals.

        This creates a tailored learning experience linked directly to individual needs and career growth.

Data Disciplines in Companies
  • Data is one of the key elements for ensuring a company's operation.
  • Data professionals carry out significant work before data can be fully utilized. Data-related jobs require deep skills and knowledge in computing, mathematics (probabilities, statistics
), marketing, and finance.
  • Let’s start with the Data Scientist. This is a data analysis expert with both technical and scientific skills to solve complex data-related problems. A Data Scientist is a blend of mathematician, statistician, and programmer. Their role is to build scientific models (algorithms) to make data usable and valuable.
    They analyze, process, model, and interpret data to create action plans. Positioned between researcher and consultant, they are central to many companies and organizations.
  • Next is the Data Engineer, who programs the systems required for the data processes requested by teams. They develop algorithms to help colleagues collect and use data. This role demands technical skills in managing and designing SQL databases and using programming languages (Python, R...).
  • Then there's the Data Analyst, who interprets data, analyzes results, and presents them, thanks to expertise in statistics and some programming. They design data analysis processes (dashboards, BI and DataViz tools), interpret insights, and make data understandable. Acting as a bridge between the Data team and others, their strength lies in simplifying complex information.
  • The Data Miner is a data extraction tool (or web scraping) designed to help users automatically collect information from web pages and convert it into usable tables (CSV, Excel, Google Sheets
).
  • Other key roles include the Machine Learning Engineer,
  • and the Data Steward, a data quality manager who ensures data is consistent, well-documented, reliable, accessible, and compliant with internal policies and regulations. They play a crucial role in data governance projects.
  • Depending on the company’s digital maturity, other roles may exist: Chief Data Officer, Big Data Architect, BIM (Business Intelligence Manager), Master Data Manager, or DPO (Data Protection Officer—especially responsible for GDPR compliance).
  • These roles are still evolving. In many organizations, a single Data Analyst handles multiple responsibilities, coordinating with technical and financial teams to achieve data goals. In other companies, digital marketing teams take charge of data-related aspects. A new profession has emerged: the Web Analyst, who studies web activity and performance (Google Analytics, social media, SEO, etc.).
  • Data jobs will continue to evolve. As robotics and AI progress, new roles will emerge, and others will disappear. Some companies don’t have data teams but already know how to leverage data effectively. Let's now look at who the main data stakeholders are in companies.
  • The main data actors are obviously domain experts. They collect, analyze, and interpret data to meet company objectives.
  • Among these, one of the first key players is the Chief Information Officer (CIO). They provide the technical environment needed to launch data initiatives. The CIO chooses tools and systems while ensuring information security. Technical teams then implement all related systems.
  • Executives and decision-makers also play a role. They use data to steer the organization. The information collected helps guide decision-making and set new objectives to improve performance. Data supports both strategic and operational management.
  • Marketing teams rely on data to better understand customer expectations. Improving their strategies heavily depends on it. They also use data to assess company strengths and weaknesses and refine their approach.
  • Sales divisions use CRM systems to qualify prospects and customers and analyze results by product, category, region, or individual salesperson.
  • Lastly, the finance department leverages data with accounting tools to measure profitability, forecast results and budgets, and evaluate financial risks.
  • Data is everyone’s concern. Knowing how to use and understand data is as essential as knowing how to read, count, or walk. In small organizations, a data lead may be appointed to champion data use across all business processes.

Exercise 1 - 1

đŸ‹ïž

Create a diagram of these data-related jobs.

The Data Process in Companies

To help make data more understandable, experts have defined a process.

  1. The first stage of data is discovery. In other words, understanding the context. This often starts with an internal problem. For example: "We don’t know the type of customer who books trips to Hungary on our website." Teams then work to discover and define the problem.
  1. The second stage is reflection. Teams assess the problem, analyze its internal and external impact, and consider the value of solving it. They also think about the tools and resources (human, financial, technological) needed to collect and analyze the data.
  1. Once the problem is clearly defined and validated, we move on to the third stage: structuring. This means organizing data within a defined framework. Continuing our example, this could involve selecting a time period, online or physical context, price range, etc.—everything needed to guide data collection and analysis.
  1. Despite structuring, data remains raw. It’s essential to clean the data to avoid distorted results. In our example, some clients may have canceled after purchasing, others may have bought for someone else, and some may have deleted their accounts. A cleaning phase is critical.
  1. After cleaning, comes enrichment—adding filters and layers of detail, either manually or automatically. For instance, filters like nationality, age, gender, and average basket value could help better address the original question.
  1. Once data is structured, cleaned, and enriched, it must be validated. This means checking that it aligns with the original goal and is logically consistent.
  1. The final stage is publication. This is the most important step, where data is presented to stakeholders who initiated the process. Publication doesn't mean sending raw data—it means preparing a clear presentation that highlights insights, findings, and conclusions. It’s a step of teaching, storytelling, and connecting the data to company strategies.

Exercise 1-2

đŸ‹ïž

Create a diagram of this process.

Store Data
❓

Where is data stored?

Hadoop (historical big data foundation)

  • Type: Distributed file system (HDFS: Hadoop Distributed File System).
  • Storage: Large files split across multiple machines (clusters).
  • Data formats: Parquet, Avro, ORC, JSON, CSV, images, videos

  • Typical usage: Big data, large-scale analytics, unstructured or semi-structured data.
  • Access: Via Hive, Spark, Impala, Presto, or directly through MapReduce frameworks.

SQL RDBMS (e.g., MySQL, PostgreSQL, SQL Server
) – the most common

  • Type: Relational database.
  • Storage: Tables in a local or network file system, with indexes and a defined schema.
  • Data formats: Structured (tables with predefined columns and types).
  • Typical usage: Web apps, ERP, CRM, transactions (OLTP).
  • Access: SQL queries (SELECT, INSERT, etc.).

Other modern storage technologies:

Apache Spark

  • Can run locally, on a cluster, or in the cloud (EMR, Databricks).
  • Compatible with S3, Delta Lake, Hive


Amazon S3 + Athena / Glue

  • Scalable cloud storage with query capabilities via Athena or ETL via Glue.

Snowflake / BigQuery / Redshift Spectrum

  • Auto-scalable cloud data warehouses.
  • Fast SQL queries on files or databases.
  • Ideal for BI, ad hoc analysis, data lakes.

Delta Lake / Lakehouse

  • Combines Data Lake + Data Warehouse.
  • ACID transactions on raw files (e.g., on S3).

Azure Data Lake

  • Microsoft’s cloud-based data lake platform.

Google Cloud Storage

  • Scalable storage service by Google Cloud, often used with BigQuery and Dataflow.
Query Data

There are many tools to query data. The choice depends on the database size, platform architecture (single server or distributed), and the platform itself.

Tool / PlatformQuery Language UsedKey Notes
Apache HiveHiveQL (SQL-like)Simplified SQL for Hadoop
Apache Spark SQLSpark SQL (standard SQL)Can also be used via Python, Scala, etc.
Presto / TrinoStandard SQLVery fast, often used with S3 or Hive
Amazon AthenaStandard SQL (Presto under the hood)Serverless, direct queries on S3
Google BigQueryStandard SQL (with Google extensions)Very fast queries, supports JSON/struct
SnowflakeStandard SQLRich features (CTEs, pivot, UDF, etc.)
Redshift / SpectrumSQL (PostgreSQL-like)Compatible with S3 (Spectrum)
Databricks / Delta LakeSpark SQLCan be queried from SQL, Python, Scala
ClickHouseClickHouse-specific SQLVery fast for aggregations
ElasticSearchDSL (Domain-Specific JSON Language)Not SQL, uses JSON query syntax
MongoDBMongoDB Query Language (MQL, JSON-based)Not SQL, uses aggregation pipelines
Apache CassandraCQL (Cassandra Query Language)SQL-inspired but: no joins, no complex transactions, requires query-driven schema design
Neo4j (graphs)CypherSpecific language for graph data
Power BI / TableauSQL (via connectors) + internal language (DAX, etc.)Depends on data source
Google Sheets / ExcelFormulas + sometimes SQL via external connectorFor simpler use cases
AWS GlueSQL (via Glue Data Catalog + Spark) or PythonUsed in ETL jobs

Conclusion

Summary Table: Languages + Tools + Storage Technologies

Tool / PlatformQuery LanguageStorage Technology Used
Apache HiveHiveQL (SQL-like)HDFS (Hadoop)
Apache Spark SQLSpark SQLHDFS, S3, Delta Lake, JDBC, etc.
Presto / TrinoStandard SQLS3, HDFS, Hive, Kafka, JDBC, etc.
Amazon AthenaStandard SQL (Presto)Amazon S3
AWS GlueSQL / Python (PySpark)Amazon S3, JDBC, Redshift, RDS
Google BigQueryStandard SQLNative cloud data warehouse (Google infra)
SnowflakeStandard SQLCloud data warehouse (Snowflake-native)
Redshift / SpectrumSQL (PostgreSQL-like)S3 (Spectrum) + internal Redshift storage
Databricks / Delta LakeSpark SQLS3, ADLS, Delta Lake
ClickHouseCustom SQLLocal disk (columnar format)
ElasticSearchJSON DSLLucene index (NoSQL, full-text search)
MongoDBMQL (Mongo Query Language)NoSQL document store (BSON/JSON)
CassandraCQL (Cassandra Query Language)Wide-column NoSQL (Cassandra SSTables)
Neo4jCypherGraph database
Power BI / TableauSQL via connectors + DAXConnects to SQL, BigQuery, S3, etc.
Excel / Google SheetsFormulas + SQL via pluginXLSX/CSV files or external connections
  • HDFS: Hadoop Distributed File System
  • S3 / ADLS: Cloud data lakes (Amazon, Azure)
  • JDBC: Connectors to traditional relational databases
  • NoSQL: Cassandra, MongoDB, Elastic, etc.
  • Cloud warehouse: BigQuery, Snowflake, Redshift
  • Graph: For complex relational data (e.g., networks)

Exercise 1-3

đŸ‹ïž

Search online for the creators, companies, or brands and launch dates of all these tools. Make a chronological table.

Big Data

Big Data = Massive Data

The term Big Data refers to a set of data that is so large, varied, and fast to generate that traditional processing tools are no longer sufficient.

DateKey Event
1997First documented use of the term “big data” by NASA researchers (Michael Cox & David Ellsworth) to describe datasets too large for standard computers.
Early 2000sRise of NoSQL databases, grid computing, and early work on Hadoop at Yahoo/Google
2005Launch of the Hadoop project, inspired by Google’s MapReduce paper
2008–2010Explosion in data volume due to social media, cloud computing, IoT
2012The term “Big Data” becomes a buzzword in media, tech conferences, and corporate strategies

The 5 "V"s of Big Data

VMeaningConcrete Example
VolumeHuge amount of dataLogs, videos, IoT, clicks, transactions
VelocityData generated continuouslyReal-time streaming (Kafka, sensors)
VarietyHeterogeneous data typesText, images, JSON, CSV, audio

VeracityData quality and reliabilityNoisy data, errors, uncertainty
ValueAbility to extract insightsDemand forecasting, fraud detection

Big Data Tools Overview

CategoryExamples
StorageHadoop (HDFS), Amazon S3, Delta Lake
ProcessingApache Spark, Flink, MapReduce
QueryingHive, Presto, BigQuery, Athena
StreamingApache Kafka, Kinesis
VisualizationPower BI, Superset, Tableau
AI & Machine LearningTensorFlow, PyTorch, MLlib
Exercise 1-4

This exercise consists of summarizing a lesson in Excel format, then illustrating the key ideas with concrete examples found online, to be presented in a structured Word document or PowerPoint presentation.

Objective

  • Summarize a lesson as a structured table in Excel.
  • Identify main ideas.
  • Find real-world examples online.
  • Present everything in a clear and illustrated Word document.

Part 1 — Excel Summary

  1. Open Microsoft Excel.
  1. Create a table with the following columns:
    Topic / ChapterMain IdeaSummary (1–2 sentences)Keywords
  1. For each course section:
    • Write down the essential idea.
    • Summarize it in one or two sentences.
    • Add some associated keywords.
  1. Apply clear formatting: colors, bold text, borders

Part 2 — Structured Word Document

  1. Open Microsoft Word.
  1. Write a document with the following structure:
    • Title Page: title, first name, last name, date
    • Introduction: topic of the course
    • Body:
      • For each idea from the Excel table:
        • Create a section
        • Summarize the idea
        • Find a real-world example online (website, article, image, video, etc.)
          • Use Google or a search engine like Qwant, with the right keywords
          • Prioritize trustworthy sources
          • Cite your sources
        • Add a personal comment (Why is this useful? What do you take away from it?)
    • Conclusion: what you’ve learned from the course
  1. Use proper styles (headings, subheadings, paragraphs)
Internet Research
🔍

How to find reliable sources?

Here are some practical tips to help you identify and use reliable sources, useful for reports, presentations, or research work.

Favor recognized websites

  • Government websites: end in .gouv.fr, .gov, etc.Examples: data.gouv.fr, nia.nih.gov
  • Official organizations: INSEE, CNIL, WHO, UN, OECD

  • Academic institutions: often end in .edu, .ac.ukExamples: Harvard, MIT, Sorbonne
  • Peer-reviewed journals:Examples: Nature, Science, IEEE Transactions, ACM Journal

Use specialized databases

  • PubMed – medical sciences

Be cautious with certain sources

  • Wikipedia: good starting point, but always verify cited sources.
  • Blogs, forums, unsourced YouTube videos: beware of reliability.
  • Commercial or sponsored websites: may be biased by advertising.

Check a source's reliability

  • Who is the author? Are they identifiable? Are they an expert?
  • What’s the date? Is it recent or outdated?
  • Does the source cite other references?
  • Is the content neutral? Or highly biased/polemical?
  • Is the style professional? Or sensationalist, with mistakes?

Tip: Targeted Google searches

Use site: to narrow your search to trustworthy domains.

Examples: machine learning in sports site:.edu , AI applications in healthcare site:.gouv.fr

💚

Agence digitale Parisweb.art
Tout savoir sur Julie, notre directrice de projets digitaux :
https://www.linkedin.com/in/juliechaumard/