1 - Data - Definition and uses
Catégorie | Cours |
---|---|
Ordre d'apprentissage | 2 |
Statut | Préparé |
Cours de Julie Chaumard
Chapters | Homeworks | Exercises | Weeks | Dates |
---|---|---|---|---|
What is data? | 1 | May 7th | ||
The uses of data | 1 | May 7th | ||
Data disciplines in companies | Exercise 1-1 | 1 | May 7th | |
The data process in companies | Exercise 1-2 | 1 | May 7th | |
Store data | 1 | May 7th | ||
Query data | Exercise 1-3 | 1 | May 7th | |
Exercise 1-4 | 1 | May 7th |
BOOK : Google Data Studio, créer des rapports intuitifs grùce à la datavisualisation de Youssef Jlidi, éditions ENI, collection Objectif web, 2022
What is data?
- According to Le Robert dictionary, data is a âconventional representation of information that allows for automatic processing.â
- "Conventional" means based on a shared agreement, standard, or rule.
- In computing, for a machine to process information, it must be encoded in a format the machine understands.
- The date "May 3, 2025" can be represented as:
03/05/2025
(French format)
05/03/2025
(US format)
2025-05-03
(ISO 8601 international standard)
- Data is thus an informational vector that comes in many forms.
- A data point is called an "informational vector" because it carries information.
- Data is the encoded and automatically processed format through which information can circulate, be stored, or analyzed.
- It acts as a transmission support between:
- a human and a machine,
- two software programs,
- two people via a digital medium.
Information Data format (vector) A name A string: "Marie"
An image A file: image.jpg
A sound An audio format: podcast.mp3
A temperature A number with unit: 22.5°C
- Data is a key element for reflection; it is exchanged, shared, and distributed so that individuals or computers can function, communicate, and collaborate.
- Image, video, audio, numbers, textual or computer languageâdata is everywhere, allowing us to quantify and qualify each process in modern systems.
- In computing, for instance, data is represented using binary information (0s and 1s). This coded language is understood by machines and enables them to perform tasks like logging into a Windows session or deleting a folder. This source code is then translated into a human-readable format using letters. Thus, data is matricialâit bridges different worlds and serves as an intermediary vector of information.
- But data is not limited to computing. In science, business, marketing, finance, or politics, data allows us to quantify and qualify information and make sense of indicators that can lead to strategic decisions.
- For instance, in a company: each day, it deals with masses of informationâemployee presence and absences, prospects, clients, reviews, review quality, revenue, margins, cash flow, capital... All of these elements form part of the data dimension.
With this, a business leader can set a strategy, such as calculating profitability by collecting all related data, drawing conclusions, and taking action.
- For instance, in a company: each day, it deals with masses of informationâemployee presence and absences, prospects, clients, reviews, review quality, revenue, margins, cash flow, capital... All of these elements form part of the data dimension.
- Over the past two decades, with the democratization of the Internet, data has become increasingly vital. The term "Big Data" refers to this explosion of digital data. From this, new roles and disciplines have emerged: Data Engineering, Data Mining, Data Centers, Data Analysis, Web Analytics, Data ProcessingâŠ
The Uses of Data
- Editable, reusable, and accessible, data has become the best ally of internet users. Itâs available all over the web and anyone can use it. It appears in different forms. Letâs explore how it can be used in various contexts.
- On a social level, data helps citizens make better decisions, find answers to their questions, and provides countless solutions to their problems. It also has the power to improve their quality of life. With all types of data available online, itâs possible to find a job, meet people, fill out tax forms, ask questions in forums, or book a cinema ticket.
- In the transportation sector, data is essential and will continue to be so in the future. Drivers now use data-processing apps like Waze and Google Maps, which offer optimized routes while alerting users to incidents or speed checks. These systems harness the power of the crowd, as each user contributes to data validation and updates via alerts and corrections.
- In the professional world, data is critical. Itâs found in every area of business. In HR, for instance, HRIS (Human Resource Information Systems) qualify employees through data (attendance, leave, training...). In customer management, CRM tools (Customer Relationship Management) are used to analyze customer data and develop micro or macro-level commercial strategies.
- From an economic standpoint, data helps measure the economic impact of a population or a country through indicators like HDI, purchasing power, GDP and GDP per capita, etc. Increasingly, data also supports economic forecasting using mathematical models to improve decision-making.
- In healthcare, data allows medical professionals to profile patients, review their medical history, and coordinate care. In recent years, data has helped manage epidemicsâidentify clusters, isolate carriers, measure trends, and implement interventions.
- There are many more examples of how data is used. For instance, in industries, we now have "Industry 3.0" and "Industry 4.0" smart factories, or adaptive learning platforms like LinkedIn Learning that tailor learning paths using data (and AI). Data usage is everywhere and is spawning new disciplines.
- Industry 3.0 and 4.0: Historical context of industrial evolution
Revolution Period Main characteristic 1.0 Late 18th century Mechanization, steam engines 2.0 Late 19th â early 20th Electricity, mass production 3.0 1970sâ2000s Automation via electronics and IT 4.0 Since ~2010 Connected factories, AI, big data, IoT
- LinkedIn Learning
LinkedIn Learning is an online training platform launched by LinkedIn (owned by Microsoft since 2016). It offers thousands of video courses in various areas such as:
- personal and professional development,
- office tools and software (Excel, Photoshop, etc.),
- programming, design, marketing, project management, etc.
Courses are created by experts and accessible via subscription, with certificates that can be added to your LinkedIn profile.
- Adaptive Learning
Adaptive learning is a personalized teaching method that uses technology to adjust to each learnerâs pace, knowledge, and needs.
- Content adjusts automatically based on user responses and behavior.
- The learning path is non-linear and evolves.
- The goal is to optimize learning efficiency by delivering the right content at the right time.
- LinkedIn Learning and Adaptive Learning
LinkedIn Learning integrates adaptive learning by:
- recommending courses based on LinkedIn profile skills,
- tracking progress and adjusting proposed content,
- offering personalized paths based on jobs or career goals.
This creates a tailored learning experience linked directly to individual needs and career growth.
- Industry 3.0 and 4.0: Historical context of industrial evolution
Data Disciplines in Companies
- Data is one of the key elements for ensuring a company's operation.
- Data professionals carry out significant work before data can be fully utilized. Data-related jobs require deep skills and knowledge in computing, mathematics (probabilities, statisticsâŠ), marketing, and finance.
- Letâs start with the Data Scientist. This is a data analysis expert with both technical and scientific skills to solve complex data-related problems. A Data Scientist is a blend of mathematician, statistician, and programmer. Their role is to build scientific models (algorithms) to make data usable and valuable.
They analyze, process, model, and interpret data to create action plans. Positioned between researcher and consultant, they are central to many companies and organizations.
- Next is the Data Engineer, who programs the systems required for the data processes requested by teams. They develop algorithms to help colleagues collect and use data. This role demands technical skills in managing and designing SQL databases and using programming languages (Python, R...).
- Then there's the Data Analyst, who interprets data, analyzes results, and presents them, thanks to expertise in statistics and some programming. They design data analysis processes (dashboards, BI and DataViz tools), interpret insights, and make data understandable. Acting as a bridge between the Data team and others, their strength lies in simplifying complex information.
- The Data Miner is a data extraction tool (or web scraping) designed to help users automatically collect information from web pages and convert it into usable tables (CSV, Excel, Google SheetsâŠ).
- Other key roles include the Machine Learning Engineer,
- and the Data Steward, a data quality manager who ensures data is consistent, well-documented, reliable, accessible, and compliant with internal policies and regulations. They play a crucial role in data governance projects.
- Depending on the companyâs digital maturity, other roles may exist: Chief Data Officer, Big Data Architect, BIM (Business Intelligence Manager), Master Data Manager, or DPO (Data Protection Officerâespecially responsible for GDPR compliance).
- These roles are still evolving. In many organizations, a single Data Analyst handles multiple responsibilities, coordinating with technical and financial teams to achieve data goals. In other companies, digital marketing teams take charge of data-related aspects. A new profession has emerged: the Web Analyst, who studies web activity and performance (Google Analytics, social media, SEO, etc.).
- Data jobs will continue to evolve. As robotics and AI progress, new roles will emerge, and others will disappear. Some companies donât have data teams but already know how to leverage data effectively. Let's now look at who the main data stakeholders are in companies.
- The main data actors are obviously domain experts. They collect, analyze, and interpret data to meet company objectives.
- Among these, one of the first key players is the Chief Information Officer (CIO). They provide the technical environment needed to launch data initiatives. The CIO chooses tools and systems while ensuring information security. Technical teams then implement all related systems.
- Executives and decision-makers also play a role. They use data to steer the organization. The information collected helps guide decision-making and set new objectives to improve performance. Data supports both strategic and operational management.
- Marketing teams rely on data to better understand customer expectations. Improving their strategies heavily depends on it. They also use data to assess company strengths and weaknesses and refine their approach.
- Sales divisions use CRM systems to qualify prospects and customers and analyze results by product, category, region, or individual salesperson.
- Lastly, the finance department leverages data with accounting tools to measure profitability, forecast results and budgets, and evaluate financial risks.
- Data is everyoneâs concern. Knowing how to use and understand data is as essential as knowing how to read, count, or walk. In small organizations, a data lead may be appointed to champion data use across all business processes.
Exercise 1 - 1
Create a diagram of these data-related jobs.
The Data Process in Companies
To help make data more understandable, experts have defined a process.
- The first stage of data is discovery. In other words, understanding the context. This often starts with an internal problem. For example: "We donât know the type of customer who books trips to Hungary on our website." Teams then work to discover and define the problem.
- The second stage is reflection. Teams assess the problem, analyze its internal and external impact, and consider the value of solving it. They also think about the tools and resources (human, financial, technological) needed to collect and analyze the data.
- Once the problem is clearly defined and validated, we move on to the third stage: structuring. This means organizing data within a defined framework. Continuing our example, this could involve selecting a time period, online or physical context, price range, etc.âeverything needed to guide data collection and analysis.
- Despite structuring, data remains raw. Itâs essential to clean the data to avoid distorted results. In our example, some clients may have canceled after purchasing, others may have bought for someone else, and some may have deleted their accounts. A cleaning phase is critical.
- After cleaning, comes enrichmentâadding filters and layers of detail, either manually or automatically. For instance, filters like nationality, age, gender, and average basket value could help better address the original question.
- Once data is structured, cleaned, and enriched, it must be validated. This means checking that it aligns with the original goal and is logically consistent.
- The final stage is publication. This is the most important step, where data is presented to stakeholders who initiated the process. Publication doesn't mean sending raw dataâit means preparing a clear presentation that highlights insights, findings, and conclusions. Itâs a step of teaching, storytelling, and connecting the data to company strategies.
Exercise 1-2
Create a diagram of this process.
Store Data
Where is data stored?
Hadoop (historical big data foundation)
- Type: Distributed file system (HDFS: Hadoop Distributed File System).
- Storage: Large files split across multiple machines (clusters).
- Data formats: Parquet, Avro, ORC, JSON, CSV, images, videosâŠ
- Typical usage: Big data, large-scale analytics, unstructured or semi-structured data.
- Access: Via Hive, Spark, Impala, Presto, or directly through MapReduce frameworks.
SQL RDBMS (e.g., MySQL, PostgreSQL, SQL ServerâŠ) â the most common
- Type: Relational database.
- Storage: Tables in a local or network file system, with indexes and a defined schema.
- Data formats: Structured (tables with predefined columns and types).
- Typical usage: Web apps, ERP, CRM, transactions (OLTP).
- Access: SQL queries (SELECT, INSERT, etc.).
Other modern storage technologies:
Apache Spark
- Can run locally, on a cluster, or in the cloud (EMR, Databricks).
- Compatible with S3, Delta Lake, HiveâŠ
Amazon S3 + Athena / Glue
- Scalable cloud storage with query capabilities via Athena or ETL via Glue.
Snowflake / BigQuery / Redshift Spectrum
- Auto-scalable cloud data warehouses.
- Fast SQL queries on files or databases.
- Ideal for BI, ad hoc analysis, data lakes.
Delta Lake / Lakehouse
- Combines Data Lake + Data Warehouse.
- ACID transactions on raw files (e.g., on S3).
Azure Data Lake
- Microsoftâs cloud-based data lake platform.
Google Cloud Storage
- Scalable storage service by Google Cloud, often used with BigQuery and Dataflow.
Query Data
There are many tools to query data. The choice depends on the database size, platform architecture (single server or distributed), and the platform itself.
Tool / Platform | Query Language Used | Key Notes |
---|---|---|
Apache Hive | HiveQL (SQL-like) | Simplified SQL for Hadoop |
Apache Spark SQL | Spark SQL (standard SQL) | Can also be used via Python, Scala, etc. |
Presto / Trino | Standard SQL | Very fast, often used with S3 or Hive |
Amazon Athena | Standard SQL (Presto under the hood) | Serverless, direct queries on S3 |
Google BigQuery | Standard SQL (with Google extensions) | Very fast queries, supports JSON/struct |
Snowflake | Standard SQL | Rich features (CTEs, pivot, UDF, etc.) |
Redshift / Spectrum | SQL (PostgreSQL-like) | Compatible with S3 (Spectrum) |
Databricks / Delta Lake | Spark SQL | Can be queried from SQL, Python, Scala |
ClickHouse | ClickHouse-specific SQL | Very fast for aggregations |
ElasticSearch | DSL (Domain-Specific JSON Language) | Not SQL, uses JSON query syntax |
MongoDB | MongoDB Query Language (MQL, JSON-based) | Not SQL, uses aggregation pipelines |
Apache Cassandra | CQL (Cassandra Query Language) | SQL-inspired but: no joins, no complex transactions, requires query-driven schema design |
Neo4j (graphs) | Cypher | Specific language for graph data |
Power BI / Tableau | SQL (via connectors) + internal language (DAX, etc.) | Depends on data source |
Google Sheets / Excel | Formulas + sometimes SQL via external connector | For simpler use cases |
AWS Glue | SQL (via Glue Data Catalog + Spark) or Python | Used in ETL jobs |
Conclusion
Summary Table: Languages + Tools + Storage Technologies
Tool / Platform | Query Language | Storage Technology Used |
---|---|---|
Apache Hive | HiveQL (SQL-like) | HDFS (Hadoop) |
Apache Spark SQL | Spark SQL | HDFS, S3, Delta Lake, JDBC, etc. |
Presto / Trino | Standard SQL | S3, HDFS, Hive, Kafka, JDBC, etc. |
Amazon Athena | Standard SQL (Presto) | Amazon S3 |
AWS Glue | SQL / Python (PySpark) | Amazon S3, JDBC, Redshift, RDS |
Google BigQuery | Standard SQL | Native cloud data warehouse (Google infra) |
Snowflake | Standard SQL | Cloud data warehouse (Snowflake-native) |
Redshift / Spectrum | SQL (PostgreSQL-like) | S3 (Spectrum) + internal Redshift storage |
Databricks / Delta Lake | Spark SQL | S3, ADLS, Delta Lake |
ClickHouse | Custom SQL | Local disk (columnar format) |
ElasticSearch | JSON DSL | Lucene index (NoSQL, full-text search) |
MongoDB | MQL (Mongo Query Language) | NoSQL document store (BSON/JSON) |
Cassandra | CQL (Cassandra Query Language) | Wide-column NoSQL (Cassandra SSTables) |
Neo4j | Cypher | Graph database |
Power BI / Tableau | SQL via connectors + DAX | Connects to SQL, BigQuery, S3, etc. |
Excel / Google Sheets | Formulas + SQL via plugin | XLSX/CSV files or external connections |
- HDFS: Hadoop Distributed File System
- S3 / ADLS: Cloud data lakes (Amazon, Azure)
- JDBC: Connectors to traditional relational databases
- NoSQL: Cassandra, MongoDB, Elastic, etc.
- Cloud warehouse: BigQuery, Snowflake, Redshift
- Graph: For complex relational data (e.g., networks)
Exercise 1-3
Search online for the creators, companies, or brands and launch dates of all these tools. Make a chronological table.
Big Data
Big Data = Massive Data
The term Big Data refers to a set of data that is so large, varied, and fast to generate that traditional processing tools are no longer sufficient.
Date | Key Event |
---|---|
1997 | First documented use of the term âbig dataâ by NASA researchers (Michael Cox & David Ellsworth) to describe datasets too large for standard computers. |
Early 2000s | Rise of NoSQL databases, grid computing, and early work on Hadoop at Yahoo/Google |
2005 | Launch of the Hadoop project, inspired by Googleâs MapReduce paper |
2008â2010 | Explosion in data volume due to social media, cloud computing, IoT |
2012 | The term âBig Dataâ becomes a buzzword in media, tech conferences, and corporate strategies |
The 5 "V"s of Big Data
V | Meaning | Concrete Example |
---|---|---|
Volume | Huge amount of data | Logs, videos, IoT, clicks, transactions |
Velocity | Data generated continuously | Real-time streaming (Kafka, sensors) |
Variety | Heterogeneous data types | Text, images, JSON, CSV, audio⊠|
Veracity | Data quality and reliability | Noisy data, errors, uncertainty |
Value | Ability to extract insights | Demand forecasting, fraud detection |
Big Data Tools Overview
Category | Examples |
---|---|
Storage | Hadoop (HDFS), Amazon S3, Delta Lake |
Processing | Apache Spark, Flink, MapReduce |
Querying | Hive, Presto, BigQuery, Athena |
Streaming | Apache Kafka, Kinesis |
Visualization | Power BI, Superset, Tableau |
AI & Machine Learning | TensorFlow, PyTorch, MLlib |
Exercise 1-4
This exercise consists of summarizing a lesson in Excel format, then illustrating the key ideas with concrete examples found online, to be presented in a structured Word document or PowerPoint presentation.
Objective
- Summarize a lesson as a structured table in Excel.
- Identify main ideas.
- Find real-world examples online.
- Present everything in a clear and illustrated Word document.
Part 1 â Excel Summary
- Open Microsoft Excel.
- Create a table with the following columns:
Topic / Chapter Main Idea Summary (1â2 sentences) Keywords
- For each course section:
- Write down the essential idea.
- Summarize it in one or two sentences.
- Add some associated keywords.
- Apply clear formatting: colors, bold text, borders
Part 2 â Structured Word Document
- Open Microsoft Word.
- Write a document with the following structure:
- Title Page: title, first name, last name, date
- Introduction: topic of the course
- Body:
- For each idea from the Excel table:
- Create a section
- Summarize the idea
- Find a real-world example online (website, article, image, video, etc.)
- Use Google or a search engine like Qwant, with the right keywords
- Prioritize trustworthy sources
- Cite your sources
- Add a personal comment (Why is this useful? What do you take away from it?)
- For each idea from the Excel table:
- Conclusion: what youâve learned from the course
- Use proper styles (headings, subheadings, paragraphs)
Internet Research
How to find reliable sources?
Here are some practical tips to help you identify and use reliable sources, useful for reports, presentations, or research work.
Favor recognized websites
- Government websites: end in
.gouv.fr
,.gov
, etc.Examples:data.gouv.fr
,nia.nih.gov
- Official organizations: INSEE, CNIL, WHO, UN, OECDâŠ
- Academic institutions: often end in
.edu
,.ac.uk
Examples: Harvard, MIT, Sorbonne
- Peer-reviewed journals:Examples: Nature, Science, IEEE Transactions, ACM Journal
Use specialized databases
- Google Scholar â academic articles
- PubMed â medical sciences
- IEEE Xplore â engineering, AI
- ACM Digital Library â computer science, AI
Be cautious with certain sources
- Wikipedia: good starting point, but always verify cited sources.
- Blogs, forums, unsourced YouTube videos: beware of reliability.
- Commercial or sponsored websites: may be biased by advertising.
Check a source's reliability
- Who is the author? Are they identifiable? Are they an expert?
- Whatâs the date? Is it recent or outdated?
- Does the source cite other references?
- Is the content neutral? Or highly biased/polemical?
- Is the style professional? Or sensationalist, with mistakes?
Tip: Targeted Google searches
Use site:
to narrow your search to trustworthy domains.
Examples: machine learning in sports site:.edu
, AI applications in healthcare site:.gouv.fr
Agence digitale Parisweb.art
Tout savoir sur Julie, notre directrice de projets digitaux :
https://www.linkedin.com/in/juliechaumard/