The Data Literacy (200 Terms)
200 data terms I learned over the years, explained simply
Hey friends - Happy Thursday! 👋
The world of data is full of buzzwords.
If you’re learning data or preparing for interviews, it’s very easy to get confused or mix terms.
I didn’t learn data terminology in one go.
Over 17 years working with data, I picked it up step by step through real projects.
That’s why I created this cheat sheet.
It explains 200 core data terms in a simple, practical way and yes, I even made a YouTube movie about it 😄
A simple way to build data literacy: understanding data, speaking its language, and using it with confidence.
So before we jump into all these terms, let’s start with one simple question.
What is data literacy?
Data literacy is the ability to understand data, talk about it confidently, and make sense of it in real situations.
It’s knowing what data actually means, where it comes from, and how it’s used in real work.
Data Basics
Data Raw facts collected from systems, applications, and users before any processing.
Raw Data Data taken directly from source systems without any transformation.
Business Data Data prepared and structured to support business decisions. Structured Data Data organized in a fixed format of rows and columns.
Semi-Structured Data Data with partial structure using formats like JSON or XML.
Unstructured Data Data that does not follow a predefined structure.
Big Data Very large datasets that require specialized tools to process.
Data Volume The amount of data being generated or stored.
Data Velocity The speed at which data is generated and processed.
Data Variety The different types and formats of data being handled.
Data Roles
Data Analyst Answers business questions by analyzing data and creating reports and dashboards.
Business Analyst Translates business needs into data and reporting requirements.
BI Developer Builds automated dashboards and reporting systems for many users.
Data Engineer Designs and builds data pipelines that move and prepare data at scale.
Analytics Engineer Transforms raw data into analytics-ready models for BI tools.
Data Architect Designs the overall structure and flow of data systems.
Data Scientist Uses statistics and machine learning to predict future outcomes.
ML Engineer Turns models into scalable, production-ready systems.
AI Engineer Builds AI-powered systems using models, APIs, and tools.
ML Ops Engineer Automates deployment, monitoring, and lifecycle of machine learning models.
Data Product Manager Owns data products and aligns them with business goals.
Data Steward Ensures data quality, definitions, and standards are followed.
Data Owner Accountable for the correctness and usage of specific datasets.
Data Consumer Uses data products for analysis, reporting, or decision making.
Data Producer Generates data through systems, applications, or processes.
Data Quality Analyst Monitors and improves data accuracy and reliability.
Data Governance Lead Defines policies for data usage, privacy, and compliance.
Data Platform Engineer Builds and maintains shared data infrastructure and tools.
Chief Data Officer (CDO) Owns the data strategy across the organization.
Data Processing
Data Manipulation Modifying data through filtering, sorting, or aggregating.
Data Transformation Converting data into a suitable format for analysis or storage.
Data Cleanup Fixing data issues such as errors, duplicates, and missing values.
Data Enrichment Adding additional information to existing data.
Data Aggregation The process of summarizing data using functions like sum or average.
Data Modeling Designing how data is structured and related within a system.
Data Validation Ensuring data meets defined quality rules.
Data Quality Measuring accuracy, completeness, and consistency of data.
Data Standardization Converting data into a consistent format or convention.
Data Normalization Scaling numeric data into a common range or structure.
Data Deduplication Identifying and removing duplicate records.
Data Filtering Selecting only relevant data based on conditions.
Data Sorting Ordering data based on one or more fields.
Data Parsing Extracting structured values from raw or semi-structured data.
Data Encoding Converting categorical values into numeric representations.
Data Masking Obscuring sensitive data to protect privacy.
Data Sampling Selecting a subset of data for faster processing or analysis.
Data Architecture
Data Architecture The overall design of how data is collected, stored, integrated, and used across systems.
Data Platform The shared infrastructure, tools, and services used to manage data at scale.
Data Source The original system or application where data is created and collected, such as databases, APIs, or files.
Operational Data Store (ODS) A system that stores near real-time operational data.
Data Warehouse A central system for storing clean, structured data optimized for analytics.
Data Lake A storage system for raw data in any format, structured or unstructured.
Data Lakehouse A platform that combines data lake flexibility with data warehouse structure.
Data Mesh A decentralized architecture where domains own and share data as products.
Data Mart A subject-focused subset of a data warehouse designed for a specific business area.
Data Vault A modeling approach focused on historical tracking, scalability, and auditability.
Medallion Architecture A layered design using bronze, silver, and gold data layers.
Bronze Layer Stores raw ingested data with minimal transformation.
Silver Layer Contains cleaned and standardized data ready for business logic.
Gold Layer Holds curated, business-ready data for analytics and reporting.
Data Engineering
Data Engineering Building systems that collect, move, transform, and store data reliably at scale.
Data Pipeline An automated process that moves data from source to destination.
ETL Extract, Transform, Load data into analytics systems.
ELT Extract, Load, then Transform data inside the target system.
Data Extraction Copying data from source systems without changing it.
Data Transformation Converting data into the correct structure, format, and logic for analysis.
Data Load Writing extracted or prepared data into a target system.
Data Ingestion The process of collecting data from source systems.
Batch Processing Processing data in large chunks at scheduled intervals.
Streaming Processing Processing data continuously in real time.
Full Load Loading the entire dataset from a source system.
Incremental (Delta) Load Loading only new or changed data since the last run.
Schema Evolution Handling changes in data structure over time.
Partitioning Splitting data into smaller parts to improve performance.
Indexing Creating structures that speed up data access.
Orchestration Managing dependencies and execution of data pipelines.
Scheduling Running data jobs at defined times.
Monitoring Tracking pipeline health, failures, and performance.
Data Migration Moving data between systems while preserving accuracy.
Slowly Changing Dimension (SCD) A technique for managing changes in dimension data over time.
SCD Type 1 Updates data by overwriting old values without keeping history.
SCD Type 2 Preserves full history by creating a new record for each change.
SCD Type 3 Stores limited history by adding extra columns for previous values.
Schema Registry A centralized store for managing schema versions.
Schema Drift Unexpected changes in incoming data structure.
Change Data Capture (CDC) Capturing database changes in real time.
Data Freshness How up to date data is compared to when it was generated.
Data Snapshot A copy of data taken at a specific point in time.
Data Analytics
Data Analytics The practice of analyzing data to answer business questions and support decisions.
Business Intelligence (BI) Turning data into insights for decision making.
Descriptive Analytics Explains what happened using historical data.
Diagnostic Analytics Explains why something happened by analyzing patterns and relationships.
Predictive Analytics Estimates what is likely to happen in the future using data and models.
Prescriptive Analytics Recommends actions based on predicted outcomes.
Dashboard A visual interface showing key metrics and trends.
Report A structured presentation of data and insights.
KPI A key performance indicator used to measure success.
Metric A numeric measurement used to track performance.
Dimension A categorical field used to group and filter data.
Measure A numeric value that can be aggregated for analysis.
Slice and Dice Analyzing data from different perspectives.
Filter Restricting data to a specific subset.
Aggregation Summarizing data using functions like sum or average.
Trend Analysis Identifying patterns and changes over time.
Ad Hoc Analysis A one-time analysis performed to answer a specific question.
Data Visualization Using charts and visuals to communicate insights.
Insight A meaningful finding that supports better decisions.
Exploratory Data Analysis (EDA) Exploring data to understand patterns, distributions, and anomalies.
Drill Down Moving from summary data to more detailed data.
Drill Up Moving from detailed data to higher-level summaries.
Cohort Analysis Analyzing groups of users sharing a common characteristic.
Correlation Measuring how strongly two variables move together.
Outlier A data point that significantly differs from others.
Distribution How values are spread across a dataset.
Percentile A value indicating the relative position within a dataset.
Rolling Average An average calculated over a moving time window.
Self-Service Analytics Allowing users to explore data without technical help.
Storytelling with Data Communicating insights through narrative and visuals.
Data Science
Data Science Using data, statistics, and models to extract insights and make predictions.
Dataset A collection of data used for analysis or modeling.
Exploratory Data Analysis (EDA) Understanding data through statistics and visual exploration.
Machine Learning Algorithms that learn patterns from data automatically.
Supervised Learning Training models using labeled data.
Unsupervised Learning Discovering patterns without labeled outcomes.
Feature An input variable used by a model.
Feature Engineering Creating and improving features to boost model performance.
Model A mathematical representation learned from data.
Algorithm A method used to train a model.
Training Data Data used to teach a model patterns.
Test Data Data used to evaluate model performance.
Prediction An estimated output generated by a model.
Classification Predicting categories or classes.
Regression Predicting continuous numeric values.
Clustering Grouping similar data points together.
Evaluation Metric A measure used to assess model quality.
Accuracy How often a model makes correct predictions.
Precision How many predicted positives are actually correct.
Recall How many actual positives were correctly identified.
Overfitting When a model memorizes data instead of learning patterns.
Underfitting When a model is too simple to capture patterns.
Bias Error caused by overly simple assumptions.
Variance Error caused by sensitivity to training data changes.
Cross Validation Testing model stability using multiple data splits.
Hyperparameter A model setting defined before training.
Hyperparameter Tuning Finding optimal model settings.
Model Interpretability Understanding how a model makes decisions.
Model Drift Performance decline caused by changing data over time.
AI Engineering
AI Engineering Building production systems that apply AI models to real business problems.
AI Engineer A professional who designs, deploys, and maintains AI-powered systems.
AI Model A trained system that performs tasks like prediction, classification, or generation.
Model Inference Running a trained model on new input data to produce outputs.
Pretrained Model A model trained on large datasets and reused for new tasks.
Fine Tuning Adapting a pretrained model to a specific task or domain.
Large Language Model (LLM) A model trained on massive text data to understand and generate language.
Prompt An instruction or input given to an AI model.
Prompt Engineering Designing prompts to get accurate and reliable model responses.
Context Window The amount of information a model can consider at once.
Embedding A numeric representation of data used for similarity and search.
Vector Database A database optimized for storing and searching embeddings.
Semantic Search Finding results based on meaning rather than exact keywords.
RAG (Retrieval Augmented Generation) Combining external data retrieval with model generation.
AI Agent An AI system that can reason, decide, and take actions.
Tool Calling Allowing models to trigger functions, APIs, or external tools.
Orchestration Managing how multiple models, tools, and steps work together.
Inference Pipeline A workflow that processes input data through models and tools.
Latency The time it takes for an AI system to return a response.
Scalability The ability of an AI system to handle growing usage or data.
Model Serving Hosting models so they can be accessed by applications.
API Integration Connecting AI models to applications through APIs.
Evaluation Measuring the quality and reliability of AI outputs.
Hallucination When a model generates incorrect or fabricated information.
Guardrails Rules and controls that limit unsafe or incorrect AI behavior.
Monitoring Tracking performance, errors, and usage of AI systems in production.
Cost Optimization Managing compute and token usage to reduce AI costs.
Security Protecting AI systems from misuse or attacks.
Privacy Ensuring sensitive data is handled safely by AI systems.
Model Governance Policies and practices for managing AI models responsibly.
Data Governance
Data Governance The framework of policies and processes that ensure data is managed correctly across the organization.
Data Ownership Defining who is accountable for a dataset and its business meaning.
Data Stewardship Operational responsibility for maintaining data quality and standards.
Data Policy Formal rules that define how data should be created, used, and protected.
Data SLA An agreement defining data availability, quality, and timeliness expectations.
Data Completeness Whether all required data is present.
Data Consistency Ensuring data is the same across systems and reports.
Data Availability Whether data can be accessed when needed.
Data Standards Agreed formats, definitions, and conventions for data consistency.
Data Quality Management Processes to measure, monitor, and improve data quality.
Data Lineage Tracking where data comes from and how it changes across systems.
Data Catalog A centralized inventory describing datasets, tables, and columns.
Business Glossary A shared dictionary defining business terms and metrics.
Metadata Management Managing technical and business metadata for transparency.
Data Classification Categorizing data based on sensitivity and usage.
Access Control Defining who can view, edit, or use data.
Role Based Access Control (RBAC) Granting data access based on user roles.
Data Privacy Ensuring personal and sensitive data is handled legally and ethically.
Data Security Protecting data from unauthorized access or breaches.
Compliance Adhering to regulations such as GDPR or internal policies.
Auditability Ability to trace data changes for review and compliance.
Data Retention Rules defining how long data should be stored.
Data Archiving Moving inactive data to long-term storage.
I hope you like it and I wish you a wonderful day ❤️
Baraa
By the way, today I also released a very important Python video about functions.
This is a must-learn topic and one of the key things that separates professional developers from casual coders.
Also, here are 4 complete roadmap videos if you’re figuring out where to start:
📌 Data Engineering Roadmap
📌 Data Science Roadmap
📌 Data Analyst Roadmap
📌 AI Engineering Roadmap
Hey friends —
I’m Baraa. I’m an IT professional and YouTuber.
My mission is to share the knowledge I’ve gained over the years and to make working with data easier, fun, and accessible to everyone through courses that are free, simple, and easy!


