đź“•

1 - Data - Definition and uses

What is data?
  • According to Le Robert dictionary, data is a “conventional representation of information that allows for automatic processing.”
    • "Conventional" means based on a shared agreement, standard, or rule.
    • In computing, for a machine to process information, it must be encoded in a format the machine understands.
    • The date "May 3, 2025" can be represented as:
      • 03/05/2025 (French format)
      • 05/03/2025 (US format)
      • 2025-05-03 (ISO 8601 international standard)
  • Data is thus an informational vector that comes in many forms.
    • A data point is called an "informational vector" because it carries information.
    • Data is the encoded and automatically processed format through which information can circulate, be stored, or analyzed.
    • It acts as a transmission support between:
      • a human and a machine,
      • two software programs,
      • two people via a digital medium.
    InformationData format (vector)
    A nameA string: "Marie"
    An imageA file: image.jpg
    A soundAn audio format: podcast.mp3
    A temperatureA number with unit: 22.5°C
  • Data is essential for analyzing a situation; it is exchanged, shared, and distributed so that individuals or computers can function, communicate, and collaborate.
  • Image, video, audio, numbers, textual or computer language—data is everywhere, allowing us to quantify and qualify each process in modern systems.
    • In science, business, marketing, finance, or politics, data allows us to quantify and qualify information and make sense of indicators that can lead to strategic decisions.
      • For instance, in a company: each day, it deals with masses of information—employee presence and absences, prospects, clients, reviews, review quality, revenue, margins, cash flow, capital... All of these elements form part of the data dimension.
        With this, a business leader can set a strategy, such as calculating profitability by collecting all related data, drawing conclusions, and taking action.
    • Over the past two decades, with the democratization of the Internet, data has become increasingly vital. The term "Big Data" refers to this explosion of digital data. From this, new roles and disciplines have emerged: Data Engineering, Data Mining, Data Centers, Data Analysis, Web Analytics, Data Processing…
    The Uses of Data
    • Editable, reusable, and accessible, data has become the best friend of internet users. It’s available all over the web and anyone can use it. It appears in different forms. Let’s explore how it can be used in various contexts.
      • On a social level, data helps citizens make better decisions, find answers to their questions, and provides countless solutions to their problems. It also has the power to improve their quality of life. With all types of data available online, it’s possible to find a job, meet people, fill out tax forms, ask questions in forums, or book a cinema ticket.
      • In the transportation sector, data is essential and will continue to be so in the future. Drivers now use data-processing apps like Waze and Google Maps, which offer optimized routes while alerting users to incidents or speed checks. These systems harness the power of the crowd, as each user contributes to data validation and updates via alerts and corrections.
      • In the professional world, data is critical. It’s found in every area of business. In HR, for instance, HRIS (Human Resource Information Systems) qualify employees through data (attendance, leave, training...). In customer management, CRM tools (Customer Relationship Management) are used to analyze customer data and develop micro or macro-level commercial strategies.
      • From an economic standpoint, data helps measure the economic impact of a population or a country through indicators like HDI, purchasing power, GDP and GDP per capita, etc. Increasingly, data also supports economic forecasting using mathematical models to improve decision-making.
      • In healthcare, data allows medical professionals to profile patients, review their medical history, and coordinate care. In recent years, data has helped manage epidemics—identify clusters, isolate carriers, measure trends, and implement interventions.
      • There are many more examples of how data is used. For instance, in industries, we now have "Industry 3.0" and "Industry 4.0" smart factories, or adaptive learning platforms like LinkedIn Learning that tailor learning paths using data (and AI). Data usage is everywhere and is creating new disciplines.
        • Industry 3.0 and 4.0: Historical context of industrial evolution
          RevolutionPeriodMain characteristic
          1.0Late 18th centuryMechanization, steam engines
          2.0Late 19th – early 20thElectricity, mass production
          3.01970s–2000sAutomation via electronics and IT
          4.0Since ~2010Connected factories, AI, big data, IoT
        • LinkedIn Learning

          LinkedIn Learning is an online training platform launched by LinkedIn (owned by Microsoft since 2016). It offers thousands of video courses in various areas such as:

          • personal and professional development,
          • office tools and software (Excel, Photoshop, etc.),
          • programming, design, marketing, project management, etc.

          Courses are created by experts and accessible via subscription, with certificates that can be added to your LinkedIn profile.
          Big Data = LinkedIn has a huge amount of data about its users.
          AI = LinkedIn uses this data to recommend content and analyze trends.

        • Adaptive Learning

          Adaptive learning is a personalized teaching method that uses technology to adjust to each learner’s pace, knowledge, and needs.

          • Content adjusts automatically based on user responses and behavior.
          • The learning path is non-linear and evolves.
          • The goal is to optimize learning efficiency by delivering the right content at the right time.
        • LinkedIn Learning and Adaptive Learning

          LinkedIn Learning integrates adaptive learning by:

          • recommending courses based on LinkedIn profile skills,
          • tracking progress and adjusting proposed content,
          • offering personalized paths based on jobs or career goals.

          This creates a tailored learning experience linked directly to individual needs and career growth.

    Data Disciplines in Companies
    • Data is essential for a company to operate.
    • Data professionals carry out significant work before data can be fully utilized. Data-related jobs require deep skills and knowledge in computing, mathematics (probabilities, statistics…), marketing, and finance.
    • Let’s start with the Data Scientist. This is a data analysis expert with both technical and scientific skills to solve complex data-related problems. A Data Scientist is a blend of mathematician, statistician, and programmer. Their role is to build scientific models (algorithms) to make data usable and valuable.
      They analyze, process, model, and interpret data to create action plans. Positioned between researcher and consultant, they are central to many companies and organizations.
    • Next is the Data Engineer, who programs the systems required for the data processes requested by teams. They develop algorithms to help colleagues collect and use data. This role demands technical skills in managing and designing SQL databases and using programming languages (Python, R...).
    • Then there's the Data Analyst, who interprets data, analyzes results, and presents them, thanks to expertise in statistics and some programming. They design data analysis processes (dashboards, BI and DataViz tools), interpret insights, and make data understandable. Acting as a bridge between the Data team and others, their strength lies in simplifying complex information.
    • The Data Miner is a data extraction tool (or web scraping) designed to help users automatically collect information from web pages and convert it into usable tables (CSV, Excel, Google Sheets…).
    • Other key roles include the Machine Learning Engineer,
    • and the Data Steward, a data quality manager who ensures data is consistent, well-documented, reliable, accessible, and compliant with internal policies and regulations. They play a crucial role in data governance projects.
    • Depending on the company’s digital maturity, other roles may exist: Chief Data Officer, Big Data Architect, BIM (Business Intelligence Manager), Master Data Manager, or DPO (Data Protection Officer—especially responsible for GDPR compliance).
    • These roles are still evolving. In many organizations, a single Data Analyst handles multiple responsibilities, coordinating with technical and financial teams to achieve data goals. In other companies, digital marketing teams take charge of data-related aspects. A new profession has emerged: the Web Analyst, who studies web activity and performance (Google Analytics, social media, SEO, etc.).
    • Data jobs will continue to evolve. As robotics and AI progress, new roles will emerge, and others will disappear. Some companies don’t have data teams but already know how to leverage data effectively. Let's now look at who the main data stakeholders are in companies.
    • The main data actors are obviously domain experts. They collect, analyze, and interpret data to meet company objectives.
    • Among these, one of the first key players is the Chief Information Officer (CIO). They provide the technical environment needed to launch data initiatives. The CIO chooses tools and systems while ensuring information security. Technical teams then implement all related systems.
    • Executives and decision-makers also play a role. They use data to steer the organization. The information collected helps guide decision-making and set new objectives to improve performance. Data supports both strategic and operational management.
    • Marketing teams rely on data to better understand customer expectations. Improving their strategies heavily depends on it. They also use data to assess company strengths and weaknesses and refine their approach.
    • Sales divisions use CRM systems to qualify prospects and customers and analyze results by product, category, region, or individual salesperson.
    • Lastly, the finance department leverages data with accounting tools to measure profitability, forecast results and budgets, and evaluate financial risks.
    • Data is everyone’s concern. Knowing how to use and understand data is as essential as knowing how to read, count, or walk. In small organizations, a data lead may be appointed to champion data use across all business processes.

    Exercise 1 - 1

    🏋️

    Can you create a summary table of the main data-related role/jobs

    The Data Process in Companies

    To help make data more understandable, experts have defined a process. Here is the data processing workflow: from data collection to analysis.

    1. The first stage of data is discovery. In other words, understanding the context. This often starts with an internal problem. For example: "We don’t know the type of customer who books trips to Hungary on our website." Teams then work to discover and define the problem.
    1. The second stage is reflection. Teams assess the problem, analyze its internal and external impact, and consider the value of solving it. They also think about the tools and resources (human, financial, technological) needed to collect and analyze the data.
    1. Once the problem is clearly defined and validated, we move on to the third stage: structuring. This means organizing data within a defined framework. Continuing our example, this could involve selecting a time period, online or physical context, price range, etc.—everything needed to guide data collection and analysis.
    1. Despite structuring, data remains raw. It’s essential to clean the data to avoid distorted results. In our example, some clients may have canceled after purchasing, others may have bought for someone else, and some may have deleted their accounts. A cleaning phase is critical.
    1. After cleaning, comes enrichment—adding filters and layers of detail, either manually or automatically. For instance, filters like nationality, age, gender, and average basket value could help better address the original question.
    1. Once data is structured, cleaned, and enriched, it must be validated. This means checking that it aligns with the original goal and is logically consistent.
    1. The final stage is publication. This is the most important step, where data is presented to stakeholders who initiated the process. Publication doesn't mean sending raw data—it means preparing a clear presentation that highlights insights, findings, and conclusions. It’s a step of teaching, storytelling, and connecting the data to company strategies.

    Exercise 1-2

    🏋️

    Create a diagram of this process.

    Store Data
    âť“

    Where is data stored?

    Hadoop (historical big data foundation)

    • Type: Distributed file system (HDFS: Hadoop Distributed File System).
    • Storage: Large files split across multiple machines (clusters).
    • Data formats: Parquet, Avro, ORC, JSON, CSV, images, videos…
    • Typical usage: Big data, large-scale analytics, unstructured or semi-structured data.
    • Access: Via Hive, Spark, Impala, Presto, or directly through MapReduce frameworks.

    SQL RDBMS (e.g., MySQL, PostgreSQL, SQL Server…) – the most common

    • Type: Relational database.
    • Storage: Tables in a local or network file system, with indexes and a defined schema.
    • Data formats: Structured (tables with predefined columns and types).
    • Typical usage: Web apps, ERP, CRM, transactions (OLTP).
    • Access: SQL queries (SELECT, INSERT, etc.).

    Other modern storage technologies:

    Apache Spark

    • Can run locally, on a cluster, or in the cloud (EMR, Databricks).
    • Compatible with S3, Delta Lake, Hive…

    Amazon S3 + Athena / Glue

    • Scalable cloud storage with query capabilities via Athena or ETL via Glue.

    Snowflake / BigQuery / Redshift Spectrum

    • Auto-scalable cloud data warehouses.
    • Fast SQL queries on files or databases.
    • Ideal for BI, ad hoc analysis, data lakes.

    Delta Lake / Lakehouse

    • Combines Data Lake + Data Warehouse.
    • ACID transactions on raw files (e.g., on S3).

    Azure Data Lake

    • Microsoft’s cloud-based data lake platform.

    Google Cloud Storage

    • Scalable storage service by Google Cloud, often used with BigQuery and Dataflow.
    Query Data

    There are many tools to query data. The choice depends on the database size, platform architecture (single server or distributed), and the platform itself.

    Tool / PlatformQuery Language UsedKey Notes
    Apache HiveHiveQL (SQL-like)Simplified SQL for Hadoop
    Apache Spark SQLSpark SQL (standard SQL)Can also be used via Python, Scala, etc.
    Presto / TrinoStandard SQLVery fast, often used with S3 or Hive
    Amazon AthenaStandard SQL (Presto under the hood)Serverless, direct queries on S3
    Google BigQueryStandard SQL (with Google extensions)Very fast queries, supports JSON/struct
    SnowflakeStandard SQLRich features (CTEs, pivot, UDF, etc.)
    Redshift / SpectrumSQL (PostgreSQL-like)Compatible with S3 (Spectrum)
    Databricks / Delta LakeSpark SQLCan be queried from SQL, Python, Scala
    ClickHouseClickHouse-specific SQLVery fast for aggregations
    ElasticSearchDSL (Domain-Specific JSON Language)Not SQL, uses JSON query syntax
    MongoDBMongoDB Query Language (MQL, JSON-based)Not SQL, uses aggregation pipelines
    Apache CassandraCQL (Cassandra Query Language)SQL-inspired but: no joins, no complex transactions, requires query-driven schema design
    Neo4j (graphs)CypherSpecific language for graph data
    Power BI / TableauSQL (via connectors) + internal language (DAX, etc.)Depends on data source
    Google Sheets / ExcelFormulas + sometimes SQL via external connectorFor simpler use cases
    AWS GlueSQL (via Glue Data Catalog + Spark) or PythonUsed in ETL jobs

    Conclusion

    Summary Table: Languages + Tools + Storage Technologies

    Tool / PlatformQuery LanguageStorage Technology Used
    Apache HiveHiveQL (SQL-like)HDFS (Hadoop)
    Apache Spark SQLSpark SQLHDFS, S3, Delta Lake, JDBC, etc.
    Presto / TrinoStandard SQLS3, HDFS, Hive, Kafka, JDBC, etc.
    Amazon AthenaStandard SQL (Presto)Amazon S3
    AWS GlueSQL / Python (PySpark)Amazon S3, JDBC, Redshift, RDS
    Google BigQueryStandard SQLNative cloud data warehouse (Google infra)
    SnowflakeStandard SQLCloud data warehouse (Snowflake-native)
    Redshift / SpectrumSQL (PostgreSQL-like)S3 (Spectrum) + internal Redshift storage
    Databricks / Delta LakeSpark SQLS3, ADLS, Delta Lake
    ClickHouseCustom SQLLocal disk (columnar format)
    ElasticSearchJSON DSLLucene index (NoSQL, full-text search)
    MongoDBMQL (Mongo Query Language)NoSQL document store (BSON/JSON)
    CassandraCQL (Cassandra Query Language)Wide-column NoSQL (Cassandra SSTables)
    Neo4jCypherGraph database
    Power BI / TableauSQL via connectors + DAXConnects to SQL, BigQuery, S3, etc.
    Excel / Google SheetsFormulas + SQL via pluginXLSX/CSV files or external connections
    • HDFS: Hadoop Distributed File System
    • S3 / ADLS: Cloud data lakes (Amazon, Azure)
    • JDBC: Connectors to traditional relational databases
    • NoSQL: Cassandra, MongoDB, Elastic, etc.
    • Cloud warehouse: BigQuery, Snowflake, Redshift
    • Graph: For complex relational data (e.g., networks)

    Exercise 1-3

    🏋️

    Search online for the creators, companies, or brands and launch dates of all these tools. Make a chronological table.

    Big Data

    Big Data = Massive Data

    The term Big Data refers to a set of data that is so large, varied, and fast to generate that traditional processing tools are no longer sufficient.

    DateKey Event
    1997First documented use of the term “big data” by NASA researchers (Michael Cox & David Ellsworth) to describe datasets too large for standard computers.
    Early 2000sRise of NoSQL databases, grid computing, and early work on Hadoop at Yahoo/Google
    2005Launch of the Hadoop project, inspired by Google’s MapReduce paper
    2008–2010Explosion in data volume due to social media, cloud computing, IoT
    2012The term “Big Data” becomes a buzzword in media, tech conferences, and corporate strategies

    The 5 "V"s of Big Data

    VMeaningConcrete Example
    VolumeHuge amount of dataLogs, videos, IoT, clicks, transactions
    VelocityData generated continuouslyReal-time streaming (Kafka, sensors)
    VarietyHeterogeneous data typesText, images, JSON, CSV, audio…
    VeracityData quality and reliabilityNoisy data, errors, uncertainty
    ValueAbility to extract insightsDemand forecasting, fraud detection

    Big Data Tools Overview

    CategoryExamples
    StorageHadoop (HDFS), Amazon S3, Delta Lake
    ProcessingApache Spark, Flink, MapReduce
    QueryingHive, Presto, BigQuery, Athena
    StreamingApache Kafka, Kinesis
    VisualizationPower BI, Superset, Tableau
    AI & Machine LearningTensorFlow, PyTorch, MLlib
    Exercise 1-4

    This exercise consists of summarizing a lesson in Excel format, then illustrating the key ideas with concrete examples found online, to be presented in a structured Word document or PowerPoint presentation.

    Objective

    • Summarize a lesson as a structured table in Excel.
    • Identify main ideas.
    • Find real-world examples online.
    • Present everything in a clear and illustrated Word document.

    Part 1 — Excel Summary

    1. Open Microsoft Excel.
    1. Create a table with the following columns:
      Topic / ChapterMain IdeaSummary (1–2 sentences)Keywords
    1. For each course section:
      • Write down the essential idea.
      • Summarize it in one or two sentences.
      • Add some associated keywords.
    1. Apply clear formatting: colors, bold text, borders

    Part 2 — Structured Word Document

    1. Open Microsoft Word.
    1. Write a document with the following structure:
      • Title Page: title, first name, last name, date
      • Introduction: topic of the course
      • Body:
        • For each idea from the Excel table:
          • Create a section
          • Summarize the idea
          • Find a real-world example online (website, article, image, video, etc.)
            • Use Google or a search engine like Qwant, with the right keywords
            • Prioritize trustworthy sources
            • Cite your sources
          • Add a personal comment (Why is this useful? What do you take away from it?)
      • Conclusion: what you’ve learned from the course
    1. Use proper styles (headings, subheadings, paragraphs)
    Internet Research
    🔍

    How to find reliable sources?

    Here are some practical tips to help you identify and use reliable sources, useful for reports, presentations, or research work.

    Favor recognized websites

    • Government websites: end in .gouv.fr, .gov, etc.Examples: data.gouv.fr, nia.nih.gov
    • Official organizations: INSEE, CNIL, WHO, UN, OECD…
    • Academic institutions: often end in .edu, .ac.ukExamples: Harvard, MIT, Sorbonne
    • Peer-reviewed journals:Examples: Nature, Science, IEEE Transactions, ACM Journal

    Use specialized databases

    • PubMed – medical sciences

    Be cautious with certain sources

    • Wikipedia: good starting point, but always verify cited sources.
    • Blogs, forums, unsourced YouTube videos: beware of reliability.
    • Commercial or sponsored websites: may be biased by advertising.

    Check a source's reliability

    • Who is the author? Are they identifiable? Are they an expert?
    • What’s the date? Is it recent or outdated?
    • Does the source cite other references?
    • Is the content neutral? Or highly biased/polemical?
    • Is the style professional? Or sensationalist, with mistakes?

    Tip: Targeted Google searches

    Use site: to narrow your search to trustworthy domains.

    Examples: machine learning in sports site:.edu , AI applications in healthcare site:.gouv.fr

    đź’š

    Agence digitale Parisweb.art
    Tout savoir sur Julie, notre directrice de projets digitaux :
    https://www.linkedin.com/in/juliechaumard/