DECISION-DRIVEN ANALYTICS

Data Analysis & Insight Engineering

We wire marketing data into decision mechanisms, not dashboards. KPI tree, dbt modelling, Bayesian MMM, incrementality testing and self-serve analytics — the infrastructure of action, not measurement.

Analytics isn't 'building dashboards'; it's an operating system where every chart triggers a decision.

Most companies drown in 40+ dashboards yet get five different answers to the same question from five different sources. KPIs become debates, decisions get deferred, HiPPO wins. Roibase's analytics operation clears this uncertainty through six principles; every principle produces decisions, not dashboards.

Roibase perspective

METHODOLOGY

DIAGNOSE to MODEL to BUILD to AUTOMATE to VALIDATE to EDUCATE

Six layers of the analytics operation; each produces distinct artifacts and feeds the decision loop.

01

DIAGNOSE

Decision inventory + question map

The 30 questions decision-makers ask weekly are listed; source of answer, frequency, SLA and impact are made explicit.

02

MODEL

KPI tree + data model

dbt models + LookML or Metabase semantic layer; versioned, testable, documented.

03

BUILD

Dashboard + alert system

Dashboards organized by decision category (CAC, retention, revenue quality); threshold-based alerts + trigger templates.

04

AUTOMATE

Pipeline + refresh + monitoring

Refresh orchestration via Airflow / Dagster / dbt Cloud; pipeline health + data quality tests + Slack bot.

05

VALIDATE

A/B + incrementality + MMM validation

Model outputs are compared against experiments; calibration via incrementality testing + MMM scenario simulation.

06

EDUCATE

Data council + self-serve enablement

Monthly data council: which question went unanswered, which dashboard went unused, what self-serve training is needed.

— COMPARISON

Where we differ? Classic BI vs decision-driven analytics

A company can mistake 100 dashboards for 'analytics'. The real value emerges only when every dashboard is tied to a decision and every decision to an action.

DimensionIn-house BI aloneClassic reporting agencyRoibase decision-driven analytics
KPI definitionOverlaps across teamsAgency templateKPI tree + written ownership
Dashboard philosophyChart abundanceQuarterly PPT focusedEvery chart a decision
Data modelling layerAd-hoc SQL + ExcelPlatform-native reportingdbt + versioned + tested
Cohort + LTV engineeringLimited to average metricsNot delivered as a reportD1-D90 + segment + LTV curve
MMM + incrementalityNoneExcel-based attemptsBayesian MMM + geo-holdout
Anomaly / alert systemManual checksNoneML drift detector + Slack/email
Self-serve cultureData team bottleneckReport-drivenBusiness units self-query
Governance + PIINo policyUnawarePII tagging + retention + audit

PROOF

Outcomes, measured

30
Decision questions

Strategic questions that become answerable in the first sprint.

-40%
Reporting time saved

Hours reclaimed from the marketing team's manual dashboard prep.

3
MMM refreshes/year

Refresh cadence based on seasonality + channel mix changes.

18-24
Months of historic horizon

Minimum daily data window required for MMM + forecast.

99.2%
Pipeline uptime

dbt + Airflow + monitoring SLA; data quality tests included.

5 days
Dashboard publish time

Average time from brief to live for a new decision panel.

WHAT WE DO

Engagement scope

Every offering is an outcome-based work package. Roibase blends strategy and execution inside a single team — no hand-offs.

01 / 10

KPI tree architecture

Every marketing metric links directly to business output; every metric has an owner, a source, a threshold and a triggered decision.

02 / 10

Decision-tree dashboards

Not charts, decisions: panels designed with 'at this threshold, take this action' logic; each panel for a role, at a frequency.

03 / 10

dbt + warehouse + BI layer

Versioned + testable data models with dbt; on BigQuery / Snowflake / Redshift; surfaced through LookML / Metabase / Lightdash.

04 / 10

Cohort & retention engineering

D1/D7/D30/D90 cohort tables, LTV curves, segment-level churn and resurrection analysis — the real behavior under the average.

05 / 10

Bayesian MMM

Media, promo, seasonality and macro variables modelled together; Robyn + PyMC; quarterly refresh + confidence bands.

06 / 10

Attribution modelling

GA4 DDA + multi-touch attribution + Shapley value approaches; a decision model beyond platform-biased reporting.

07 / 10

Incrementality testing

Geo-holdout + matched-market tests; Meta Lift, GeoLift, in-house framework; the reference accuracy for budget decisions.

08 / 10

Anomaly detection

ML-based drift detector + forecast band + Slack/email alerts for silently deteriorating metrics; hourly, not morning-after.

09 / 10

Self-serve analytics

An environment where business units answer their own questions (Metabase, Lightdash, Hex) + training + mentoring.

10 / 10

Data governance

PII tagging, schema registry, retention policy, data access audit, documentation pack; KVKK + GDPR compliant operation.

— OUTCOME

The decision-side impact of a data operation

The faster, more data-grounded and more repeatable an organization's decisions are, the further ahead it stays in unpredictable market conditions.

3x speed

Decision speed

All 30 strategic questions have answers on the panel; meetings debate action, not data.

Data-driven

HiPPO reduction

Data triggers decisions, not the highest-paid person's opinion; debate is referenced to metrics.

-40% hours

Reporting time saved

The marketing team's manual Excel routines end; reclaimed hours go into strategic analysis.

Hours, not days

Early warning + action

ML drift detector + threshold-based alerts catch deteriorating metrics within hours.

50+ self-serve users

Self-serve culture

Business units answer their own questions without waiting on the data team; the data team focuses on strategic work.

±8% accuracy

MMM + forecast accuracy

With Bayesian MMM + incrementality calibration, forecast deviation stays within ±8%; budget decisions are safe.

DELIVERABLES

Monthly + quarterly outputs

Concrete artifacts of the analytics operation; each is handed over to your team, and by month 12 the runbook enables fully independent operation.

  • Decision inventory + 30-question map

    The list of questions decision-makers ask weekly, with source of answer, SLA and missing data needs.

  • KPI tree

    Every metric's source, owner, threshold and triggered decision — a single Miro / FigJam board, versioned.

  • dbt repo + models

    Versioned + testable dbt project; staging / intermediate / marts layers, documentation included.

  • Semantic layer (LookML / Metabase models)

    The shared metric definitions layer behind every question business units will ask.

  • Dashboard pack

    First 15-25 panels organized by decision category (CAC, retention, revenue quality); each by role + frequency.

  • Threshold-based alert system

    ML drift detector + forecast band + Slack/email integration; deteriorating metrics trigger alerts within hours.

  • Cohort + retention report

    D1/D7/D30/D90 tables + LTV curves + churn segment analysis + resurrection rate.

  • MMM model + report

    Bayesian MMM (Robyn/PyMC); channel contribution + saturation + adstock + confidence bands.

  • Incrementality test protocol

    Geo-holdout and matched-market test framework; planning + execution + analysis templates.

  • Data governance runbook

    PII tagging, schema registry, retention policy, access audit — KVKK + GDPR compliant.

  • Monthly data council summary

    Which questions got answered, which didn't, which dashboards got used, and a priority list for next month.

  • Self-serve training material

    Metabase / Lightdash / Hex training videos for business units + SQL / jargon glossary + practice dataset.

— SCOPE

What's included, what isn't?

The boundaries of the analytics operation are clear. Seeing scope up-front removes wrong expectations and scope creep.

What this service covers

  • Decision inventory + 30-question first sprint
  • KPI tree + written ownership + versioned document
  • dbt repo setup + staging/intermediate/marts layers
  • Warehouse integration (BigQuery / Snowflake / Redshift / Databricks)
  • LookML or Metabase semantic layer
  • First 15-25 dashboards + quarterly additions
  • ML-based anomaly detection + threshold-based alerts
  • Cohort + LTV + retention analytics — quarterly refresh
  • Bayesian MMM (3 refreshes per year)
  • Incrementality test protocol + execution
  • Data governance runbook (PII, retention, audit)
  • Monthly data council + self-serve training flow

What's not included (optional extensions)

  • Finance / accounting BI (ERP-side is separate consulting)
  • Warehouse compute / license costs (customer's contract)
  • Custom ML model training (beyond forecasting)
  • Real-time streaming infrastructure (Kafka, Kinesis — separate scope)
  • Data privacy / legal counsel (with a partner lawyer)
  • BI tool license renewals
  • Third-party data purchases (panel, survey)
  • Marketing operations themselves (PPC / SEO / CRO are separate services)

HOW WE WORK

Process: analytics operation from Week 1 diagnosis to Month 6+ governance

01

Weeks 1-2 — Decision inventory + audit

The list of 30 strategic questions, current dashboard inventory, data source health, and SLA diagnosis.

02

Week 3 — KPI tree + schema

Written KPI tree, metric definitions, ownership; warehouse schema + staging layer decisions finalized.

03

Weeks 4-5 — dbt models + first dashboards

dbt staging + intermediate + marts; first 5-8 dashboards publish; stakeholder review.

04

Weeks 6-8 — Alerts + cohorts + refresh

Threshold-based alert system, cohort + retention reports, dbt Cloud / Airflow refresh pipeline.

05

Month 3 — MMM train + first result

Bayesian MMM on 18 months of history; channel contribution + saturation + first budget revision recommendation.

06

Month 4 — Incrementality test protocol

Geo-holdout or matched-market framework; first test goes live, results in 4-6 weeks.

07

Month 5 — Data council + self-serve training

Monthly data council routine starts; Metabase / Lightdash self-serve training flow for business units.

08

Month 6+ — Quarterly refresh + governance

Quarterly MMM refresh, incrementality test cycle, data governance audit; full handover possible at month 12.

— TOOL STACK

From warehouse to BI — the analytics stack

We work tool-agnostic; but at every layer, there are clear picks that produce the most value. We adapt to your existing stack.

WAREHOUSE

BigQuery (economical, on-demand)Snowflake (enterprise, decoupled compute)Redshift (inside AWS stack)Databricks (ML-heavy workloads)Postgres (small to mid-scale)

MODELLING & TRANSFORM

dbt (core + cloud)Dataform (GCP native)Coalesce (visual)Airflow / Dagster (orchestration)Fivetran / Stitch / Airbyte (ingestion)

BI & VISUAL

Looker (LookML semantic layer)Metabase (self-hosted self-serve)Lightdash (dbt-native BI)Tableau (enterprise)Hex / Mode (notebook-driven)Looker Studio (quick-win)

ML & MMM

Robyn (Meta open-source MMM)PyMC / Pyro (Bayesian modelling)scikit-learn (drift detection)Prophet (forecasting)GeoLift (incrementality)Monte Carlo / Great Expectations (data quality)

QUESTIONS

Frequently asked

For some companies, yes; under 10 dashboards, no cross-table joins, single-channel operations make Looker Studio a practical choice. But once you need 30+ dashboards, versioned data models and role-based access, Looker / Metabase / Lightdash become necessary.

— GLOSSARY

Analytics terminology

When teams use the same term to mean the same thing, debate accelerates the decision; when they don't, doubt accelerates instead.

01
KPI Tree
The hierarchical tree of metrics that cascade down from a core business output; every node is a decision trigger.
02
dbt
Data build tool — an SQL-based, versioned, testable data transformation framework; the standard of analytics engineering.
03
Semantic Layer
The shared metric definitions + business logic layer behind the BI tool; implemented with LookML, Metabase models, Cube and similar.
04
Cohort
A group of users that share a defining property (signup date, acquisition channel); their behavior is analyzed over time.
05
LTV (Lifetime Value)
A customer's total lifetime value; gross margin x retention x order frequency x basket value.
06
Retention
The percentage of acquired users still active in a given time window (D1, D7, D30, M1, M3). In SaaS and mobile games it is a direct read on product-market fit; a cohort curve that flattens out is the signature of a healthy product.
07
Churn
The percentage of users leaving the active customer base in a given time window. In subscription businesses it hits MRR directly; in e-commerce it is the inverse of repeat rate. Split into voluntary (cancelled) and involuntary (payment failure); reduced via onboarding, pricing and lifecycle messaging.
08
MMM (Marketing Mix Modeling)
A Bayesian-statistics model that estimates channel contribution; requires 18-24 months of historic data.
09
Incrementality
The extra conversion that wouldn't have happened without a channel; measured via geo-holdout tests, independent of attribution.
10
Anomaly Detection
An umbrella for techniques that automatically flag values outside the expected range in time-series metrics (KPI, conversion, latency, fraud signal). Tools include STL decomposition, Prophet, isolation forests and neural OoD models; the brain behind alerting and observability dashboards.
11
Self-Serve Analytics
An analytics environment where business units answer their own questions without waiting on the data team; delivered via Metabase, Lightdash, Hex.
12
Data Governance
The combined policies for data quality, access control, PII management, retention and audit; KVKK/GDPR compliant.
13
ETL / ELT
Extract → Transform → Load (legacy) vs. Extract → Load → Transform (modern). Approaches to moving data from source to warehouse. ELT relies on cheap cloud-DW compute; dbt + BigQuery/Snowflake/Databricks is today's standard.
14
Data Lake
A central store for all structured and unstructured data (logs, images, video, raw events) without enforcing a schema. Built on S3, GCS or ADLS with Parquet/Iceberg/Delta Lake; complements the warehouse and forms the basis of the lakehouse architecture.
15
Stream Processing
Processing data as a real-time event flow rather than in batches. Common stacks: Kafka + Flink/Spark Streaming/Kinesis + ksqlDB; use cases include fraud detection, real-time personalisation, IoT telemetry and anomaly alerting.
16
Data Contract
A pre-agreed contract between data producers and consumers covering schema, semantics, SLA and ownership. Operated with dbt + Great Expectations + JSON Schema; the most reliable wall against the "a downstream model just broke" surprise.
17
LLM (Large Language Model)
A general-purpose language model with billions of transformer parameters, pre-trained on massive text corpora. GPT-5, Claude, Gemini, Llama; the workhorse for chat, code, summarisation, translation, retrieval and agent tasks — specialised via task-specific fine-tuning or prompt engineering.
18
Transformer
The neural-network architecture introduced in "Attention Is All You Need" (2017) that captures long-range relationships in sequential data via self-attention. The successor to RNN and LSTM; the substrate of every modern LLM (GPT, Claude, Llama, Gemini) and even vision models (ViT).
19
Embedding
A high-dimensional vector representation of a word, sentence, image or user — semantic similarity is measured by vector proximity. The common currency of recommendation, semantic search, RAG, clustering and anomaly detection; OpenAI ada, Cohere and sentence-BERT are common producers.
20
RAG (Retrieval-Augmented Generation)
An architecture where the LLM, before generating an answer, fetches relevant documents from an external knowledge base (vector DB, doc store) and injects them into the context. Reduces hallucination and is the standard way to give the model "open-book" access to fresh/private data — embedding + retriever + LLM triple.
21
Vector Database
A database that stores embeddings in a high-dimensional vector space and finds similar vectors in milliseconds via ANN (Approximate Nearest Neighbor) algorithms. Pinecone, Weaviate, Qdrant, pgvector, Chroma; the real engine of RAG's retrieval layer.
22
Fine-tuning
The process of retraining a pre-trained foundation model on extra (usually small) labelled data for a specific task or domain. Full fine-tune, LoRA/QLoRA and instruction-tuning are the common variants; the substrate of "custom assistant"-style use cases on top of ChatGPT and similar.
23
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning technique that adds small "adapter" matrices instead of updating all foundation-model weights. Trains ~0.1-1% of parameters, cuts GPU memory by 70%+; per-task adapter swap makes multi-task model serving practical.
24
RLHF (Reinforcement Learning from Human Feedback)
The final stage of an LLM training pipeline that aligns the model's outputs with human-rater preferences. A reward model + PPO/DPO algorithm steers the model toward "helpful, honest, harmless" outputs; the foundation of ChatGPT's alignment.
25
Hallucination
When an LLM confidently fabricates a non-existent source, fact or quote. Caused by the model answering questions outside its training-data distribution with the same confidence as in-distribution ones; mitigated by RAG, citation grounding and self-consistency checks — never fully eliminated.
26
Prompt Engineering
The discipline of systematically designing the prompt (instruction + context + examples + format) so the LLM produces the desired output. Few-shot, chain-of-thought, role assignment, output schema, system prompt; the "how to talk to it" layer of any production AI app.
27
Context Window
The number of tokens (input + output) an LLM can process in a single call. Ranges from 8K-128K (GPT-4) to 200K (Claude) and 1M+ (Gemini); critical capacity for long-document analysis, multi-turn conversation and agent state — RAG is the alternative way to "extend" context.
28
Function Calling / Tool Use
The ability for an LLM to invoke an external function (API, DB query, code runner) via structured JSON instead of free text. OpenAI tools, Anthropic tool_use; the official protocol that lets agents reach into the real world.
29
AI Agent
A software construct that uses an LLM as decision engine and runs multi-step tasks autonomously via tool calling + memory + a plan-execute loop. ReAct, AutoGPT, Claude/GPT agents, LangGraph; the "research → plan → run tools → reach goal" architecture.
30
Foundation Model
A large model pre-trained on broad, diverse internet-scale data and transferable to downstream tasks — LLMs, vision models (CLIP, ViT), multimodal models (GPT-4o, Gemini). Applications are built on top via fine-tuning, prompt engineering or RAG.
31
Multimodal AI
An AI system in which the same model understands and generates across more than one modality — text + image + audio + video. GPT-4o, Gemini, Claude 3.5 vision; the substrate of cross-modal use cases like OCR, image captioning, video Q&A, audio transcription and screen-aware agents.
32
NLP (Natural Language Processing)
The AI sub-discipline focused on a computer's ability to understand, generate and transform natural language (Turkish, English, etc.). Tokenisation, POS tagging, NER, sentiment analysis, machine translation; LLMs are now the most powerful general-purpose tools in this space.
33
Token
The smallest text unit an LLM processes — can be a word, subword or single character. A tokeniser (BPE, WordPiece, SentencePiece) converts text to tokens; OpenAI pricing and context-window limits are denominated in tokens (1 English word ≈ 1.3 tokens).
34
Temperature
The parameter that controls the "randomness" of an LLM's output distribution — 0 = always pick the most likely token (deterministic), 1+ = more creative/diverse. Common picks: 0-0.3 for code/JSON/numerical output, 0.7-1.2 for story/brainstorm; tuned alongside top_p.
35
Semantic Search
A search approach that returns meaning-based results by comparing query and document embeddings instead of matching keywords. Independent of spelling, captures synonyms; the retrieval engine of RAG — implemented with vector DBs + ANN.
36
Inference
The phase in which a trained AI model produces predictions/generations on live data (the opposite of training). Latency, throughput, cost-per-request and the model-serving stack (vLLM, TGI, Triton) are the levers; this is ~90% of the production side of MLOps.
37
OLTP (Online Transaction Processing)
A database approach optimised for high-volume, row-based, low-latency reads/writes. PostgreSQL, MySQL, MongoDB; the standard backing store for live application backends — e-commerce carts, user sessions, reservations.
38
OLAP (Online Analytical Processing)
A column-based database approach optimised for large-scale analytical queries. BigQuery, Snowflake, Redshift, ClickHouse; scans millions of rows in seconds for aggregation, GROUP BY and time-series — the infrastructure of BI and dashboards.
39
ACID
The four guarantees of transactional databases: Atomicity (all-or-nothing), Consistency (rules never break), Isolation (concurrent ops don't see each other), Durability (committed data persists). The core contract of RDBMS like PostgreSQL, MySQL and Oracle.
40
BASE
The relaxed guarantee set of distributed/NoSQL systems: Basically Available, Soft state, Eventual consistency. The opposite of ACID — accepts brief inconsistency in exchange for availability + scale. The DynamoDB, Cassandra, Riak philosophy.
41
Sharding
Splitting a database by some key (user_id mod 16, time range) and storing each shard on a separate server. The horizontal-scaling method; cross-shard JOINs become impractical, and the shard-key choice is an irreversible architectural decision.
42
Replication
Keeping a live copy of a database on multiple servers — to spread read load (read replicas) and provide failover. Async (Postgres streaming) is laggy but fast, sync is consistent but slow; every replication strategy is a tradeoff.
43
Eventual Consistency
In a distributed system, an update needs time to propagate to every replica — for a short window different nodes may return different values. The DynamoDB and Cassandra default; not suitable for banking, ideal for social media.
44
CDC (Change Data Capture)
A pattern that captures INSERT/UPDATE/DELETE events from a database in real time and ships them to downstream systems (warehouse, search index, cache). Debezium, Kafka Connect; built on replication slots + log tailing, the modern alternative to polling.
45
Star Schema
A warehouse-modelling approach where a central fact table (e.g. orders) is surrounded by dimension tables (customer, product, date) in a star shape. BI queries need few JOINs = fast; the canonical architecture for BigQuery, Snowflake.
46
Materialized View
A database object that physically writes a SELECT query's result to disk and caches it. Pre-computes a complex aggregation instead of recomputing it every time; the refresh strategy (manual, scheduled, incremental) is the tradeoff.
47
Normalization
The process of splitting a database schema into related tables to eliminate redundancy and update anomalies (1NF, 2NF, 3NF, BCNF). Standard for OLTP; guarantees every update happens in one place — at the cost of more JOINs.
48
Denormalization
Deliberately merging normalised tables and accepting redundancy in exchange for query performance. Standard for OLAP / data warehouse; cuts JOIN cost, manages the inconsistency risk via ETL/CDC.
49
Time-series Database
A database optimised for high-volume writes of timestamped metrics (CPU usage, IoT sensors, finance tickers) and time-range queries. InfluxDB, TimescaleDB, Prometheus, ClickHouse; downsampling + retention policy are the core features.
50
Iceberg / Hudi / Delta Lake
Open-source projects that add a "table format" layer over object storage (S3, GCS), bringing schema evolution, ACID, time-travel and concurrent-writer support. The three standard engines of the lakehouse architecture.
51
Data Quality
The discipline of measuring a dataset on accuracy, completeness, consistency, freshness and uniqueness. Tools like Great Expectations, Monte Carlo and Soda automate the tests; the only real defence against the "garbage in, garbage out" problem.
52
Data Lineage
A traceable graph of every transformation step a data point goes through, from source (raw event) to the end user (dashboard KPI). Atlan, OpenMetadata, dbt docs; the deterministic answer to "where does this KPI come from?" plus impact analysis.
53
Data Mesh
A structure of domain-based (marketing, finance, product) self-serve data products instead of a central data team. Built on domain ownership + product thinking + federated governance; the answer to the "data team is a bottleneck" problem at scale.
54
Data Catalog
A central catalogue that indexes every data asset in an organisation (table, dashboard, ML model, column) with search, descriptions and ownership info. Atlan, Collibra, OpenMetadata, Amundsen; the answer to "does this data exist, who owns it?"
55
Schema Evolution
The ability of a data format (Avro, Parquet, JSON) to change over time without breaking existing consumers when fields are added. Demands discipline around backward + forward compatibility, optional fields and default values; critical for CDC, event sourcing and lakehouse workloads.
56
AWS DynamoDB
AWS's serverless NoSQL key-value + document database. Single-digit-ms latency at billions of requests/sec, automatic partitioning, point-in-time recovery and global tables (multi-region). Ideal for game backends, IoT telemetry, session storage and leaderboards.
57
GCP Spanner
Google's globally scalable, ACID-compliant, horizontally scaling relational database. SQL syntax + DynamoDB-grade scale + PostgreSQL-grade transactions; multi-region 99.999% uptime; runs Google Ads/Maps and is ideal for fintech.
58
Azure Cosmos DB
Microsoft Azure's global-scale, multi-model NoSQL database. SQL, MongoDB, Cassandra, Gremlin (graph) and Table APIs on the same engine; five consistency levels (strong → eventual); SLA-bound latency and throughput.
59
Prometheus
The metrics layer of the cloud-native monitoring stack. Pull-based scraping collects /metrics from target endpoints; PromQL handles time-series queries; Alertmanager manages alert rules. The de-facto standard for Kubernetes and modern microservice architectures.
60
Grafana
Open-source data-visualisation and dashboard platform. Unifies 100+ data sources (Prometheus, Loki, Elasticsearch, CloudWatch, Postgres…) in a single pane; alerting, annotations, panel templating; an SRE-team staple for NOC screens.
61
Jaeger
A CNCF distributed-tracing platform. Captures every hop of a user request across microservices as spans; visualises latency bottlenecks, missing dependencies and error propagation. 100% compatible with the OpenTelemetry standard.
62
OpenTelemetry (OTel)
A CNCF project that unifies observability (metrics, logs, traces) under a single vendor-neutral standard. SDKs and auto-instrumentation make application code portable across Datadog, New Relic, Honeycomb, Jaeger and others — breaking vendor lock-in.
63
ELK Stack
Elasticsearch + Logstash + Kibana — the open-source log-aggregation, indexing and visualisation stack. Logstash ingests, Elasticsearch indexes for full-text search, Kibana dashboards. Loki + Grafana is gaining ground at high scale, but ELK is still ubiquitous.
64
SLI (Service Level Indicator)
A numeric indicator of a service's health — success rate, p99 latency, availability. The measurement base for an SLO; gives an objective answer to questions like "what percent of requests completed under 200 ms?". A core concept from Google's SRE Book.
65
SLO (Service Level Objective)
The internal target value you want an SLI to hit — e.g. "p99 latency under 200 ms for 99.9% of a 30-day window". The engineering team's answer to "how reliable is reliable enough"; the foundation for an error budget.
66
SLA (Service Level Agreement)
An external contract between a service provider and a customer; the legal reflection of an SLO. Breaching an SLA triggers penalties such as refunds or credits. Rule of thumb: SLA < SLO < SLI — engineering aims tighter than the public guarantee.
67
Error Budget
The "allowed amount of failure" that falls out of an SLO. 99.9% SLO = 0.1% error budget = ~43 minutes of downtime per month. Budget remaining: take risks (ship new releases). Budget burnt: switch to stabilisation. The SRE balance between innovation and reliability.
68
Diffusion Model
A family of generative models that learn to gradually add noise to data and then reverse the process. The core architecture behind modern image/video generators like Stable Diffusion, Midjourney, DALL-E 3 and Sora. Trains far more stably than GANs and produces far more varied output.
69
GAN (Generative Adversarial Network)
A generative model in which two neural networks — a Generator (fakes) and a Discriminator (real-vs-fake judge) — train by competing. Introduced by Ian Goodfellow in 2014; the technology behind early deepfakes, StyleGAN portraits and super-resolution. Today largely overshadowed by diffusion models.
70
CLIP (Contrastive Language-Image Pre-training)
OpenAI's 2021 model that aligns images and their captions in a shared embedding space — the embedding of "a photo of a cat" lands near actual cat-photo embeddings. The text-to-image conditioner inside Stable Diffusion, and the foundation of zero-shot image classification and image search.
71
ControlNet
A 2023 architecture that adds an extra conditioning signal to diffusion models. Steers generation with references like pose, depth map, canny edges or scribbles, enabling specific controls such as "this pose but different clothes". One of the most-used add-ons in the Stable Diffusion ecosystem.
72
Adapter Tuning
A fine-tuning approach where small "adapter" layers are inserted into a large language model instead of retraining all of its parameters. LoRA, QLoRA and IA³ are the popular variants; under 1% of the original parameters get trained, slashing GPU cost dramatically.
73
PEFT (Parameter-Efficient Fine-Tuning)
An umbrella term for approaches that train a small subset of parameters instead of full fine-tuning a 70B-parameter LLM. LoRA, prompt tuning, prefix tuning and adapter tuning are all flavours of PEFT. HuggingFace's peft library is the standard tooling.
74
Quantization (LLM)
A technique that compresses a model's float32/float16 weights down to int8, int4 or even int2. Memory drops 4-8×, inference speeds up 2-3× and quality loss is usually small. Llama.cpp, the GGUF format and the AWQ/GPTQ algorithms are the common tooling.
75
Knowledge Distillation
A technique that transfers the behaviour of a large "teacher" model into a small "student" model. By targeting the teacher's soft probability outputs, the student matches near-identical accuracy with far fewer parameters. The trick behind DistilBERT, TinyLlama and Phi-3.
76
Mixture of Experts (MoE)
An architecture that, instead of one monolithic model, routes each token through a sparse selection (one or two) of small "expert" sub-models. Used in Mixtral 8x7B, GPT-4 and DeepSeek; cuts active-parameter count, keeping capacity high while reducing inference cost.
77
Speculative Decoding
A technique that speeds up LLM inference: a small "draft" model proposes several tokens ahead, then the large "target" model verifies them in parallel and accepts the correct ones. Yields 2-3× speed-up with identical output quality. Standard in vLLM and llama.cpp.
78
KV Cache
An optimisation that keeps the Key and Value matrices computed for previous tokens in transformer attention layers in memory. Each new token only computes its own K/V instead of replaying the history. Speeds up inference 10-100×, but becomes the memory bottleneck on long contexts.
79
Attention Head
One of multiple small attention mechanisms running in parallel inside a Transformer. Each head focuses on a different aspect of the input — one captures syntax, another position, another long-range dependencies. Models like GPT-4 use 96+ heads per layer; the building block of multi-head attention.
80
BPE Tokenizer (Byte-Pair Encoding)
A tokenisation algorithm that splits text into the most frequent sub-word pieces — e.g. "tokenization" → "token" + "ization". The GPT family, LLaMA and Mistral all use BPE variants (tiktoken, SentencePiece); vocabulary size stays fixed (~32K-128K) and the OOV problem is resolved.
81
DPO (Direct Preference Optimization)
A simpler alternative to RLHF. Instead of the reward-model + PPO complexity, it does direct logistic regression on pairs of "preferred vs rejected" responses. Stanford 2023; more stable, fewer hyperparameters, and the alignment method of choice in many models including Llama 3.
82
Constitutional AI
A method introduced by Anthropic in 2022 that aligns a model with a written "constitution" (a list of ethical principles) instead of human reviewers. The model critiques and improves its own outputs against the constitution; the foundation of Claude's alignment, also known as RLAIF (Reinforcement Learning from AI Feedback).
83
Chain-of-Thought (CoT)
A prompting technique that asks an LLM to "think step by step" and write out intermediate reasoning before the answer. Introduced by a Google paper in 2022; dramatically improves performance on math, logic and multi-step questions. "Let's think step by step" is the magic phrase. Foundation of modern reasoning models (o1, DeepSeek-R1).
84
Few-Shot Prompting
A technique that provides 2-5 examples (input → output pairs) inside the prompt so the LLM applies the same pattern to a new input. Fast adaptation without fine-tuning — "answer like in these examples". The most practical solution for labelled text classification and formatted extraction.
85
Zero-Shot Prompting
A prompting approach where the task is described directly to the LLM with no examples. E.g. "Translate this text into German". Relies entirely on the model's pre-training knowledge; with frontier models (GPT-4, Claude) this is sufficient for most tasks.
86
Grounding (LLM)
A technique that "anchors" an LLM's answer in an external knowledge source — documents, a database or a web search. Retrieved context is used instead of pure parametric memory, dramatically reducing hallucination, enabling citations and keeping knowledge fresh in real time.
87
Structured Output (LLM)
The capability of forcing an LLM's output to conform to a defined JSON schema, Pydantic model or regex. OpenAI structured outputs, Anthropic tool use, vLLM grammar-constrained sampling. The key to moving from free-form text to a deterministic, production-ready data flow.
88
Tool Use (Agent)
An LLM's ability to call external tools — web search, a code interpreter, a calculator, custom APIs. Via the function-calling protocol the model returns "tool name + parameters", the runtime executes it and feeds the result back. The core of agent architectures (Claude Agent SDK, AutoGen, LangGraph).
89
Cross-Modal Embedding
Embeddings that represent different modalities (text, image, audio) in the same vector space. CLIP for image+text, ImageBind for text+image+audio+video+depth+thermal+IMU. Critical for multimodal search ("find marketing copy similar to this photo"), cross-modal retrieval and adding media to RAG.
90
Hybrid Search (BM25 + Vector)
A retrieval strategy that combines classic keyword search (BM25/lexical) with vector similarity. BM25 wins for exact-match queries (numeric IDs, product codes); vectors win on semantic ones ("how do I return this" → "return policy"). The gold standard of modern RAG.
91
Data Fabric
An integrated architecture that unifies distributed data sources (cloud, on-prem, SaaS) into a single logical data layer. Metadata-driven and AI-augmented; offers a "centralised integration" alternative to data mesh's "distributed ownership" model. Talend, Informatica and IBM Cloud Pak are key products.
92
Medallion Architecture
A data-lake organisation pattern popularised by Databricks — Bronze (raw), Silver (cleaned, conformed) and Gold (business-ready, aggregated) layers. Each layer builds on the previous one, cleanly separating data lineage, quality and reprocessing concerns.
93
Apache Spark
An in-memory distributed data-processing engine. The 10-100× faster successor to Hadoop MapReduce; combines SQL, streaming, ML (MLlib) and graph (GraphX) under one API. The core of Databricks, managed in AWS EMR, GCP Dataproc and Azure HDInsight; PySpark makes it a data engineer's primary tool.
94
Apache Flink
A true-streaming (event-by-event) processing engine. Versus Spark Streaming's micro-batch model it offers millisecond latency, exactly-once semantics and stateful processing. Powers real-time fraud and anomaly detection at Alibaba, Uber and Netflix.
95
Kafka Connect
Apache Kafka's source/sink connector framework. Brings CDC or batch ingestion from 100+ systems (Postgres, MySQL, S3, Elasticsearch, Snowflake…) into Kafka and streams data back out to external systems. Confluent's catalogue of 1,000+ connectors is the standard reference.
96
Singer
An open-source data-integration protocol from Stitch (now Talend) that moves JSON streams between "taps" (extract) and "targets" (load). A modular, vendor-neutral ELT framework; the core of open-source ELT platforms like Meltano.
97
Apache Airflow
A workflow-orchestration platform whose DAGs (Directed Acyclic Graphs) are defined in Python. Created at Airbnb in 2014 and donated to the Apache foundation. Scheduling, retries, dependency management and a web UI for observability; the de-facto standard in data pipelines.
98
Dagster
A modern asset-based data-orchestration framework. Where Airflow centres on tasks, Dagster centres on "data assets" — with data lineage, type checking, software-defined assets and integrated testing built in. First-class integrations with dbt, Fivetran and Snowflake.
99
Prefect
A modern, Pythonic data-orchestration tool with dynamic DAGs. Solves Airflow's static-DAG limitation — flows can change at runtime — and offers hybrid execution (cloud + self-hosted) plus granular retry policies. Popular for ML pipelines too.
100
Snowflake
A cloud-native managed data warehouse. Compute (warehouse) and storage are fully decoupled and scale independently. SQL-driven querying over semi-structured data (JSON, Parquet), secure data sharing and time travel (up to 90 days); a strong alternative to BigQuery and Redshift.
101
BigQuery
Google Cloud's serverless, columnar, petabyte-scale data warehouse. Pay-per-slot model; SQL-driven ML model training (BQML); the native export target for GA4; built-in geo, JSON and PARTITION/CLUSTER optimisations. The core of the GCP analytics stack.
102
Databricks
A lakehouse platform founded by the creators of Apache Spark. Bundles Bronze/Silver/Gold (medallion) layers, Delta Lake, MLflow, Unity Catalog and notebook-based workspaces in one product. Designed for data-engineer + analyst + ML-engineer collaboration; native on AWS, Azure and GCP.
103
Apache Iceberg
An open table format for petabyte-scale data (originally from Netflix). Adds ACID transactions, schema evolution, time travel, hidden partitioning and branching on top of Parquet. Supported by Snowflake, Databricks, BigQuery and Trino; the standard answer to data-warehouse lock-in.
104
Delta Lake
An open table format developed by Databricks and a rival to Apache Iceberg. ACID, time travel, schema enforcement, MERGE/UPDATE/DELETE; tightest integration is with the Spark ecosystem. The default format for the Databricks side of the lakehouse architecture.
105
Parquet
A columnar storage format — each column is stored in its own blocks. Only the columns needed are read, predicate pushdown is supported and Snappy/Zstd give strong compression. The default file format for Spark, Iceberg, Delta and Snowflake; 10-100× faster analytics than row-based CSV/JSON.
106
Apache Avro
A binary serialization format with JSON-defined schemas. Strong schema evolution (forward/backward compatibility); especially popular for Kafka message payloads. Used together with a Schema Registry; the row-oriented counterpart to Parquet.
107
Schema Registry
A service that stores, versions and compatibility-checks Avro/Protobuf/JSON schemas centrally. Part of Confluent's Kafka stack; enforces the producer-consumer schema contract and catches breaking changes before they hit production.
108
Window Function (SQL)
SQL functions that compute over a set of rows ("window"). ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM/AVG OVER (PARTITION BY…). Unlike GROUP BY, rows aren't collapsed — every row gets its own result. Indispensable for time series, rankings and running totals.
109
ELT (Extract, Load, Transform)
The reverse of classic ETL: raw data is loaded into the warehouse/lake first, then transformed there with SQL/dbt. With cheap cloud-DWH storage and powerful compute, ELT has become the default paradigm; it puts transformation logic closer to data analysts.
110
Feature Store
A platform that centrally stores and serves the features (historical + real-time) consumed by ML models. Solves the training-serving skew problem by deriving offline (batch) and online (low-latency) feature views from a single definition. Feast, Tecton and Hopsworks are the main tools.
111
MLOps
A discipline that automates the develop-train-deploy-monitor-retrain loop of machine-learning models. DevOps applied to ML — experiment tracking (MLflow), model registry, CI/CD for models, drift detection and retraining pipelines.
112
OpenLineage
An open standard for data-lineage events (LF AI & Data). Lets Airflow, Spark, dbt, Flink and others emit lineage events in the same format. Integrated by Marquez, Datakin and Astronomer; the vendor-neutral carrier of metadata flow.
113
Great Expectations
An open-source data-quality / validation framework. Thousands of built-in checks such as "expect_column_values_to_be_unique" and "expect_column_mean_to_be_between"; embeds into Airflow/dbt pipelines and auto-generates HTML data docs.
114
Apache Atlas
An open-source metadata-management and data-governance tool from the Hadoop ecosystem. Tag-based access control, lineage graphs, business glossaries and classifications (PII/PCI). Standard in the Hortonworks/Cloudera enterprise stack; modern alternatives include Amundsen and DataHub.
115
Lambda Architecture (Data)
A data architecture that fuses real-time and batch results. The speed layer (Storm/Flink) produces low-latency approximate results while the batch layer (Spark/Hadoop) computes accurate-but-slow results; the serving layer merges them. Not to be confused with AWS Lambda; today increasingly evolving into Kappa architecture.
116
Differential Privacy
A mathematical framework that allows safe access to population statistics while protecting individual records. Calibrated noise is added to query results so an attacker cannot tell whether a single person's data is in the set. Used by Apple's iOS keyboard, Google Play and the 2020 US Census.
117
Federated Learning
A technique that trains the model locally on user devices and only sends gradient or weight updates back to the central server, never the raw data. Google's Gboard auto-suggest, Apple's Siri and privacy-preserving ML on healthcare data are the canonical use cases.
118
On-Chain Analytics
The discipline of extracting insights from a blockchain's public transaction data — wallet activity, token-holder concentration, exchange flow, smart-money tracking, NFT volume. Dune Analytics (SQL on-chain), Nansen (labelled addresses), Glassnode and Arkham are key platforms.
119
Oracle (Blockchain)
A bridge service that delivers trusted off-chain data — prices, weather, sports results, IoT sensors — to on-chain smart contracts. Chainlink is the most widely used; Pyth, Band and RedStone are alternatives. Vital infrastructure for DeFi liquidations, insurance and prediction markets.
120
Brand Lift Study
A study that measures how an ad campaign moves brand metrics — ad recall, brand awareness, message association, purchase intent — by comparing a control group with an exposed group. Meta, YouTube and TikTok offer it natively; the CPM typically runs $5-15.
121
Incrementality Test
A test that compares ad-driven conversions to a "had-it-not-run" baseline to measure how many conversions are truly incremental. Methods include PSA placebo ads, ghost bidding and geo holdouts; it cures classic attribution's "every conversion is mine" illusion. The gold standard for modern paid-media ROI.
122
Geo Holdout Test
A quasi-experiment that measures incremental impact by switching ads off in a specific geography (e.g. New York) while keeping them on in others. Cookie-free, identifier-free and ATT-proof; the matched-markets / synthetic-control method is the modern marketing-science standard.
123
MTA (Multi-Touch Attribution)
A model that distributes weighted credit to every touchpoint (ad, email, organic, direct) that contributed to a conversion. Methods: linear, time-decay, position-based and data-driven. Cookie deprecation and ATT have weakened MTA accuracy; pairing it with MMM and incrementality is the healthier modern stack.
124
Data-Driven Attribution (DDA)
An attribution model that uses machine learning to learn each touchpoint's marginal contribution rather than awarding all credit to last click. The default in Google Ads + GA4; Shapley-value based; gives a fair comparison of channels at the same funnel stage. It has replaced classic rule-based models.
125
View-Through Conversion (VTC)
A conversion from a user who saw an ad — without clicking — and converted later. In display and video campaigns 30-60% of conversions can be VTC; misjudged, this either overstates or understates the channel. The difference from click-only attribution is critical.
126
Attribution Window
The time frame within which a conversion is credited to an ad after a click or view. The old norm was 7-day click + 1-day view; with iOS 14.5 the ATT default became 7-day click + 1-day view + same-day view. As the window shrinks, channels appear to take fewer conversions.
127
Retention Curve (S-Curve)
The expected pattern of a cohort's retention plateauing at some point. In a healthy app the curve flattens after ~90 days; in a viral or habit-forming app it stays horizontal; if it keeps falling, PMF is weak. Andrew Chen's "smiling curve" analysis is the modern reference.
128
Activation Rate
The share of newly registered users who complete the first valuable action. Slack tracks "the 40% who send a first message", Notion "the 50% who create a first page", Spotify "the 85% who play a first song". Activation is the most direct indicator of PMF + onboarding quality and correlates strongly with LTV.
129
TTV (Time-to-Value)
The time it takes a user to experience the first real value (the aha moment). Linear is 30 seconds, Figma is 5 minutes, Slack is one week. The shorter the TTV, the higher the retention; the single north star of modern onboarding.
130
Activation Metric (Aha-Moment Metric)
A data-driven threshold of the form "if the user does N actions within T time they retain". Facebook discovered "10 friends in 14 days", Slack "2K messages", Twitter "30 follows". The whole onboarding is then optimised against it; the growth team's north star.
131
pLTV (Predictive LTV)
Using machine learning on the first few events (sign-up, first purchase, day-1 session, an IAP) to predict 30/90/365-day LTV. The standard fix for iOS attribution after SKAdNetwork; AppsFlyer, Adjust and Singular have baked pLTV into their marketing-optimisation stacks.
132
Uplift Modeling
A machine-learning approach that finds in which user segments an intervention — a coupon, push or email — actually creates net extra impact. Identifies the "persuadable" segment so the rest aren't bothered for nothing. Algorithms include T-learner, X-learner and causal forest. Lifts CRM-campaign ROI 2-3×.
133
Crashlytics / Sentry Mobile
Platforms that capture mobile crashes, ANRs and JS errors, then cluster them with stack traces, device data and breadcrumbs. Firebase Crashlytics (Google, free), Sentry, Bugsnag and Embrace are the main options. Crash-Free Users target is 99.5%+; below 99% kills your App Store rating.
134
Mobile APM (Application Performance Monitoring)
A platform that measures real-device app performance: startup time, screen render, network requests, memory, battery use and ANRs. Firebase Performance, New Relic Mobile, Embrace and Datadog Mobile RUM are options. Surfaces UX problems that aren't crashes.
135
Headless BI
An analytics engine without its own visualisation layer that exposes every metric and dimension calculation via API and GraphQL. Cube, GoodData and AtScale lead; the output is consumed by Tableau, Looker, Notion, Hex, Excel or any custom React app. The modern paradigm that breaks BI-tool monogamy.
136
Metric Layer
A metric-only flavour of the semantic layer — an abstraction that holds the company's "single-truth" metric definitions in YAML or SQL. Slack's Spectacles, Airbnb's Minerva and dbt Semantic Layer are examples. If "active user" is 15% in marketing and 10% in finance, that drift starts here.
137
Data Activation
The process of pushing insights from the warehouse into operational systems — CRM, ad platforms, support tools, in-app messaging. Reverse ETL is the technical pipe; the bridge between "data analytics" and "marketing automation". Census, Hightouch and Polytomic are the leading tools.
138
Composable CDP
An approach that puts the warehouse (Snowflake, BigQuery) at the centre instead of buying a single-vendor CDP (Segment, mParticle), then bolts on only the layers you need — audience, real-time activation, identity resolution. Hightouch + Census + RudderStack + Snowplow is the typical composable-CDP stack.
139
Operational Analytics
The principle that analytical insights shouldn't live in a dashboard but trigger actions inside operational systems. "This user has been inactive for 7 days" surfaces inside a Klaviyo win-back flow rather than a chart. The business side of reverse ETL — the modern flavour of "actionable analytics".
140
Looker LookML
Looker's YAML-like data-modelling DSL. Tables become "views", relationships become "explores", metrics become "measures"; it's a code-centric BI approach that generates SQL. Every analyst speaks one language, version control and Git workflows just work — the lingua franca of modern data teams.
141
Mode Analytics
A BI platform that fuses SQL, Python notebooks and dashboards in one product (acquired by ThoughtSpot in 2023). The sweet spot for data analysts: SQL for queries, Python for ML, then a shareable dashboard. The power-user counterpart to Tableau's GUI-only approach.
142
Hex (Notebook BI)
A 2020-founded analytics platform that bundles SQL, Python and no-code interactive apps in one place. Notebook UI + Magic AI + shareable app builder; the shared space for data scientists, analysts and business stakeholders. The rising star of modern hybrid BI.
143
Sigma Computing
A modern BI platform that puts a spreadsheet-style interface on top of Snowflake or BigQuery. Users do Excel-grade pivots, formulas and what-if analysis without writing SQL — but the engine remains warehouse-native. A strong Looker rival inside finance and ops teams.
144
Streamlit
A Python-based open-source framework that lets you ship an interactive web app in 100 lines of script (acquired by Snowflake in 2022). The default way for data scientists to ship internal tools, prototypes and ML demos; Plotly Dash and Gradio are close competitors.
145
Snowflake Streams & Tasks
Snowflake's pairing of change-data-capture (Streams) and scheduled SQL execution (Tasks). A Stream queues inserts, updates and deletes from a table by offset; a Task processes them on a cadence. ELT pipelines pick up Snowflake-native automation without needing Airflow.
146
dbt Tests
Data-quality assertions written against dbt models — not_null, unique, accepted_values, relationships and custom SQL. Run in CI; they validate the data before each model build. The test suite can be extended with dbt-utils and Great Expectations integrations.
147
dbt Snapshots
A dbt-native implementation of Slowly Changing Dimension Type 2. For a mutable source table (e.g. orders.status changes), each snapshot run preserves history via dbt_valid_from/to columns. The foundation for audit history and "what did this look like on that date" queries.
148
Materialization Strategy (Table / View / Incremental / Ephemeral)
How a dbt model is stored in the warehouse. View: cheap but recomputes on every query — fits small data. Table: full rebuild — fits small to medium. Incremental: appends only new rows — fits big data. Ephemeral: inlined as a CTE with no persistent output.
149
SCD (Slowly Changing Dimension)
A pattern for storing the history of slowly changing dimensions like customer, product or employee. Type 1: keep only the latest value; Type 2: insert a new row on every change with valid_from/to (history preserved); Type 3: a single previous-value column. With a modern warehouse + dbt Snapshots, SCD2 is the default.
150
Idempotent Pipeline
An ETL/ELT pipeline that, run on the same input, produces the same output and has no extra side-effects when re-run. The guarantee that backfills, retries and late-arriving data don't corrupt the dataset. Achieved with MERGE, primary-key deduplication and transactions.
151
Backfill Strategy
A plan for re-running a pipeline against historical data. Date ranges are parameterised, partitions get recomputed in batches and idempotent pipelines + atomic writes + concurrency control are mandatory. A wrong backfill is a production data loss — always rehearse in staging first.
152
dbt Layers (Staging / Intermediate / Marts)
The recommended 3-layer modelling pattern for a dbt project. Staging: a 1:1 cleansed table per source (rename, cast, dedup). Intermediate: the building blocks of business logic. Marts: the business-ready dim/fact final layer. Earns consistency, reuse and a clean DAG.
153
Source Freshness
A dbt feature that monitors how long ago each source table was last updated. The "dbt source freshness" command fires warning and error thresholds — e.g. 12 h warn, 24 h error — and catches stale data even when the pipeline didn't break. The operational watchdog.
154
OBT (One Big Table)
A modelling alternative to star schema — denormalise every dimension into the fact table to produce a single wide table of 50-200+ columns. In columnar warehouses like Snowflake or BigQuery joins are expensive; OBT is faster for analysts and often optimal for performance.
155
Cube.js
An open-source headless-BI engine. It generates SQL, caches it, exposes REST/GraphQL APIs and sits on top of Snowflake, BigQuery or Postgres. Lets a front-end developer ship their own dashboards; the developer-friendly alternative to Tableau / Looker.
156
Snowpark
Snowflake's DataFrame API for Python, Scala and Java. Lets you run ML training, complex transforms, UDFs and stored procedures without moving data out of the warehouse. Modin and pandas-on-Snowflake give data scientists a familiar local feel; the modern move that takes data movement to zero.
157
Polars
A multi-threaded, columnar (Arrow-based) DataFrame library written in Rust. 5-30× faster than pandas with lazy evaluation and built-in query optimisation. A modern analyst's pandas replacement; ships with Python, R, JS and Rust bindings.
158
DuckDB
An in-process columnar OLAP database — the analytics counterpart of SQLite, with MotherDuck as the cloud extension. Single file, single process; queries pandas DataFrames or Parquet directly with SQL. Crunches a billion rows on a laptop in 30 seconds; the modern analyst's daily companion.
159
LLM Eval Harness
A test framework that automatically measures an LLM's performance across many tasks. Examples: HELM, lm-eval-harness, BigBench, HELM Lite — they run standard benchmarks like MMLU, HumanEval, GSM8K and ARC. Mandatory infrastructure for any new model launch or regression test.
160
Prompt Eval
A test set that systematically measures the quality of a specific prompt. 50-500 input × expected-output pairs scored automatically (LLM-as-judge, BLEU, ROUGE, exact match). Mandatory to catch regressions when production prompts change; PromptLayer, Langfuse and Braintrust are common tools.
161
Golden Dataset
A manually verified test set used as ground truth. Eval harness inputs and expected outputs live here; after every LLM update, the model is scored against it. A typical golden set has 200-2000 examples vetted by a domain expert.
162
Faithfulness (RAG)
A measure of how faithful a RAG system's answer is to the retrieved context. If the LLM hallucinates outside the context, faithfulness drops; an LLM-as-judge checks each sentence for "is there support in the context?". A core metric in the RAGAS and TruLens frameworks.
163
Answer Relevance (RAG)
A score for how relevant an LLM's answer is to the user's query. Catches answers that are correct but unrelated — "Nice weather today, but Paris is the capital of Paris". Measured with cosine similarity (answer embedding ↔ query embedding) or by an LLM-as-judge.
164
Context Precision / Recall (RAG)
The two metrics for retrieval quality in RAG. Precision: how many of the retrieved chunks were actually relevant. Recall: how many of the truly relevant chunks were retrieved. Low precision = noise, low recall = missing information. Measured automatically in RAGAS, ARES and others.
165
Model Routing
A smart layer that sends a question to a different LLM based on difficulty, latency or cost budget. Simple questions go to Haiku / 3.5-mini, hard ones to Opus / 4.5. OpenRouter, Portkey and Martian sell routing-as-a-service; lowers average cost 5-20×.
166
Cascading Models
A pipeline where a small/cheap model tries first; if confidence is below threshold or output fails validation, the request is escalated to a bigger/more expensive model. The fail-over variant of model routing; in real LLM apps, 80% of traffic gets resolved at 20% of the cost without quality loss.
167
RAG Reranker
A second stage that re-orders the top-50 chunks coming out of vector retrieval using an LLM-as-judge or a cross-encoder. Cohere Rerank, BGE-Reranker and Jina Reranker are common; lifts precision 20-40% and improves the retrieval-faithfulness metric.
168
Chunk Strategy
How a document is split for RAG. Options: fixed-size (e.g. 512 tokens), recursive character (paragraph and sentence boundaries), semantic chunking (embedding-based segmentation) and markdown-aware. Bad chunking equals low retrieval precision; chunk size and overlap directly drive RAG quality.
169
Embedding Drift
When the embeddings of real user queries in production drift over time from the embedding distribution of the RAG corpus. New slang, products and terms widen the drift, dragging retrieval recall down. The fix is quarterly embedding regeneration plus a new-data-aware reindex.
170
HNSW Index (Hierarchical Navigable Small World)
The ANN (Approximate Nearest Neighbor) index algorithm used by most vector databases. A multi-layer graph that delivers millisecond latency over trillions of embeddings. The default in Pinecone, Weaviate, Qdrant, Milvus and pgvector.
171
ANN (Approximate Nearest Neighbor)
A class of algorithms that finds "good-enough" nearest vectors rather than the exact match, trading accuracy for speed and memory. Examples: HNSW, IVF, PQ and ScaNN; with 95% recall, latency drops up to 1000×. The engine of vector search.
172
Model Card
A standard card (introduced by Google in 2019) documenting an AI model's purpose, training data, performance, limitations, ethical concerns and fair-use scenarios. Now mandatory at any foundation-model launch; the foundation of transparent AI development.
173
AI Observability
A platform that monitors production LLM applications across traces, cost, latency and quality metrics. Tools include Langfuse, LangSmith, Helicone, Arize Phoenix and WhyLabs; every LLM call (prompt, response, tokens, cost, eval score) is logged. The LLM-native successor to classic APM.
174
Matchmaking (ELO / MMR)
The algorithm that pairs players by skill in PvP games. Variants include ELO (chess heritage), Glicko, TrueSkill and MMR (Match-Making Rating). It trades smurf protection for new players against skill-relax for long queues; the heart of League of Legends, Valorant and Dota 2.
175
ARPDAU (Average Revenue Per Daily Active User)
Average revenue per daily active user. Casual mobile games sit at $0.05-0.20, mid-core at $0.20-0.80, hardcore RPGs at $1+. The north-star metric of live-ops decisions; paired with pLTV, it grounds the paid-acquisition budget.
176
Whales / Dolphins / Minnows
Spending segments in F2P games. Whales: top 1% spending $1,000+; Dolphins: 5-10% spending $50-1,000; Minnows: 15-30% spending $1-50; Free-riders: 60-80% who never pay. A Pareto distribution where whales drive 70%+ of revenue — losing them is fatal.
177
Scope 1 / Scope 2 / Scope 3 Emissions
The GHG Protocol's three-bucket classification of carbon emissions. Scope 1: direct emissions (factory boilers, company vehicles). Scope 2: purchased electricity, heat or cooling. Scope 3: supply chain plus product lifetime — typically the biggest slice at 75-85%. The skeleton of ESG reporting.
178
Carbon Footprint
The total greenhouse-gas emissions caused by a person, product, company or event over its lifetime (in CO₂-equivalent). Manufacturing an iPhone is ~70 kg CO₂e; a transatlantic flight is ~1.6 t. In ESG reporting it equals Scope 1 + 2 + 3.
179
Carbon Offset
A project investment that compensates for emissions — afforestation, renewable energy, methane capture, direct air capture. The voluntary carbon market sat at ~$2 B in 2024 but is heavily criticised for greenwashing; Verra, Gold Standard and ICVCM are the quality stamps. A controversial tool on the path to Net Zero.
180
CDP (Carbon Disclosure Project)
A global platform where companies disclose climate, water and forest emissions in a standard form. 24,000 companies and 1,100 cities reported in 2024; A-D scoring drives pressure from institutional investors and customers. Apple, Microsoft and Unilever lead; supply-chain disclosure mandates are spreading fast.
181
ESG Reporting (Environmental, Social, Governance)
Reporting a company's environmental, social and governance performance in a standard format. CSRD (EU), the SEC Climate Rule (US) and TCFD recommendations form the global umbrella; SASB, GRI and CDP are the working frameworks. From 2024, 50,000+ EU companies are mandated under CSRD.
182
CSRD (Corporate Sustainability Reporting Directive)
EU directive in force from 2024 that mandates sustainability reporting for 50,000+ large companies — banks, insurers, firms with 250+ employees and €40 M+ turnover. Built on ESRS, with double materiality (the company's impact on the environment plus the environment's impact on the company) and third-party assurance.
183
Net Zero
A company- or country-level goal to cut emissions to a minimum and balance the residual through offsets or removals. Validated by Science Based Targets (SBTi); the global target is 2050. Differs from carbon-neutral: Net Zero is stricter — it removes the residual rather than just compensating it.
184
Carbon Neutral vs Net Zero
Carbon-neutral zeroes emissions out via offsets without requiring real reductions; Net Zero first cuts emissions aggressively and then zeroes the rest via removals (not just offsets). Microsoft targets 2030 Carbon Negative, Apple 2030 Net Zero and Google 2030 24/7 carbon-free energy.
185
PUE (Power Usage Effectiveness)
A data-centre electricity-efficiency metric — total facility power divided by IT equipment power. Ideal value is 1.0; 2.0 means one extra unit of cooling or lighting for every unit of IT. Hyperscalers (Google, AWS, Azure) average 1.10-1.15, while on-prem enterprise data centres run 1.5-2.0. A key sustainability KPI.
186
Green Software Foundation
A Linux Foundation project, founded by Microsoft, Accenture, GitHub and ThoughtWorks, that standardises sustainable software development. Maintains the SCI (Software Carbon Intensity) standard, the Green Software Practitioner certification and a Green Software Patterns catalogue. The sustainability guide for any modern dev team.
187
SCI (Software Carbon Intensity)
The ISO/IEC 21031 standard that measures CO₂-equivalent emissions per functional unit of software. Formula: energy × region's carbon intensity + embodied emissions. The standard answer to "how much carbon does this API call cost?" — the foundation of modern green-software metrics.
188
Renewable Energy Credit (REC)
A tradable certificate that represents 1 MWh of renewable energy. Instead of installing rooftop solar, companies can buy RECs and report their electricity as renewable; Green-e in the US, GO (Guarantees of Origin) in Europe. The main vehicle behind RE100 commitments.
189
PPA (Power Purchase Agreement)
A direct, long-term (10-25 year) fixed-price contract to buy renewable electricity straight from the producer. The backbone of carbon-free-energy strategies at hyperscalers like Google, Amazon and Microsoft; corporate PPA volume in 2024 is estimated above 50 GW worldwide.
190
LCA (Life Cycle Assessment)
The ISO 14040 methodology that quantifies the full environmental impact of a product across raw materials → production → use → end-of-life. Scope can be cradle-to-grave or cradle-to-cradle. Apple's "iPhone has a 70 kg carbon footprint" figure is an LCA output.
191
Circular Economy
An economic model that replaces the linear "make-use-throw away" path by designing products to be reusable, repairable and recyclable from day one. Pioneered by the Ellen MacArthur Foundation; IKEA buyback, Patagonia Worn Wear and Apple Self-Service Repair are concrete examples.
192
Greenwashing
When a company appears greener through marketing than its real emissions performance warrants. The CMA (UK), FTC (US) and EU CSRD now regulate greenwashing legally; Shell, BP and Volkswagen have paid multi-million-dollar fines over the years. The ethical red line of sustainability communication.
193
Carbon Border Adjustment Mechanism (CBAM)
The EU's "carbon import tax", in full force from 2026. Importers of steel, cement, aluminium, fertiliser, hydrogen and electricity into the EU pay what those goods would have paid under the EU ETS if produced inside the EU. The first major tariff that reshapes supply chains by emissions intensity.
194
EPR (Extended Producer Responsibility)
A regulation that holds the producer responsible for the end-of-life waste and recycling costs of its products. EU Packaging Directive, France's LOM, Germany's VerpackG and Türkiye Zero Waste are examples. A producer of plastic bottles, garments or electronics pays an environmental fee on every unit sold.
195
Sustainable Procurement
Embedding environmental and social criteria into a company's purchasing decisions. A supplier Code of Conduct, EcoVadis sustainability rating, recycled-material requirements and fair-trade certification. Most Scope 3 emissions originate here; the operational heart of modern CSRD reporting.
196
TCFD (Task Force on Climate-related Financial Disclosures)
A framework released by the G20 Financial Stability Board in 2017 that integrates climate risks and opportunities into financial reporting. Four pillars: Governance, Strategy, Risk Management and Metrics & Targets. The UK PRA, New Zealand and Japan have made it mandatory. The climate leg of ESG reporting.
197
SBTi (Science Based Targets initiative)
An independent body that validates whether a company's emissions-reduction targets are aligned with the Paris Agreement's 1.5°C / well-below-2°C science-based pathway. 5,000+ companies have been validated — Microsoft, IKEA, Unilever, Nike and Maersk among them. The mandatory stamp behind any credible Net-Zero claim.
198
EV Charging Network (Tesla Supercharger / Ionity / Electrify America)
Infrastructure for fast-charging electric vehicles. Tesla's Supercharger network has 50,000+ stations worldwide and uses the NACS standard; Ionity (a BMW + VW + Mercedes consortium) covers Europe; Electrify America covers the US. From 2024 Tesla opened NACS to other EV brands, accelerating standard consolidation.
199
North Star Framework
A framework popularised by Sean Ellis and Amplitude that defines the single "value-for-customer" metric of a company. Spotify's is "time spent listening", Airbnb's "nights booked", Slack's "messages sent in active workspaces". The compass of every growth and product decision.
200
Driver Tree
An analysis that fans a target metric (e.g. revenue) into the drivers behind it. A close cousin of the KPI tree but more causal — answers structurally "to lift ARR, do we push new logos or expansion?". A classic problem-solving tool at McKinsey and Bain.
201
Executive Dashboard
A one-page dashboard built for the C-suite and board, featuring 7-12 top metrics. Carries business-decision-grade KPIs — MRR, NRR, CAC, magic number, runway, rule of 40 — and is reviewed weekly. Classic shapes ship in Tableau Executive, Looker C-suite and Mode Reports.
202
Operational Dashboard
A dashboard built for hour-by-hour or day-by-day operational decisions — marketing's CPM trend, support's ticket queue, ops' order backlog. Real-time or near-real-time refresh; alerting and pivot drill-downs are mandatory. Common in Looker Studio, Power BI and Grafana.
203
Drill-Down
A click-through analysis flow from an aggregated metric down to detail — "total revenue" → "by region" → "by product" → "by SKU" → "by transaction". The signature self-service-analytics behaviour of OLAP cubes and modern BI tools like Power BI, Tableau and Looker.
204
Slice & Dice
The act of cutting and inspecting multidimensional data along different dimensions. "Slice" fixes one dimension and analyses the rest; "Dice" filters two or more dimensions together to build a subset. The fundamental behaviour of a pivot table, borrowed from OLAP-cube terminology.
205
Pivot Table
Excel's 1993 invention that lets you drag-and-drop multidimensional data into rows, columns, values and filters. The ancestor of modern BI; Tableau, Power BI, Looker and Hex all carry the pivot-table mental model into their UX. The lingua franca of data analysis.
206
Funnel Visualization
Showing a conversion flow as a step-by-step narrowing funnel chart — Awareness → Consideration → Purchase → Retention — to spot drop-offs at each step. Mixpanel, Amplitude, Heap and GA4 ship native funnel reports; the core visual for CRO, product and marketing teams.
207
Cohort Heatmap
A matrix that visualises cohort retention (week 0 → week N) through colour intensity. The Y-axis is signup week, the X-axis post-signup week and the colour shows retention. Reveals PMF, onboarding quality and the impact of recent product changes at a glance.
208
Sankey Diagram
A visualisation that shows flows — user journeys, energy flow, conversion paths — as proportionally thick ribbons. Ideal for Google Analytics behaviour flow, churn analysis and attribution journeys. Built with d3.js, Plotly or Power BI Sankey custom visuals.
209
Bullet Chart
A minimal chart designed by Stephen Few that shows a KPI target, actual performance and tier bands on a single horizontal row. Far more readable than gauges or speedometers. A classic on executive dashboards; Tableau and Power BI offer custom-visual support.
210
Data Storytelling
A "tell a story, then back it up with data" approach instead of dropping numbers and charts on the audience. Cole Nussbaumer Knaflic's "Storytelling with Data" is the manifesto; closes the "so what?" gap with decision-makers. Implemented through Tableau Story, Power BI bookmarks and Notion narratives.
211
Self-Service Analytics
A model that lets business users build their own queries and dashboards without depending on an analyst. Looker LookML, Tableau Ask Data, Power BI Q&A and ThoughtSpot's search-driven UX lead; a semantic layer plus data governance plus training is required. The "democratisation" goal of modern BI.
212
Power BI
Microsoft's BI platform — deeply integrated with Excel and the most-used enterprise BI tool. Power Query handles ETL, DAX is the formula language and Power BI Service adds cloud collaboration. Microsoft Fabric strengthens its data-engineering and AI Copilot integration.
213
Tableau
The "visual gold standard" of BI — the most powerful drag-and-drop tool for striking charts. Spun out of Stanford in 2003 and bought by Salesforce in 2019 for $15.7 B. The Tableau Desktop + Server + Cloud trio is still more flexible and more artistic than Power BI.
214
ThoughtSpot
The pioneer of search-driven BI — users type a natural-language query like "show me revenue by region last quarter" and the platform builds the SQL and chart. SpotIQ provides ML-powered auto-insights, putting it at the front of AI-augmented BI. Acquired Mode Analytics for $200 M in 2023.
215
Microsoft Fabric
Microsoft's 2023-launched analytics platform that bundles Power BI, Synapse, Data Factory, Real-Time Analytics and Copilot into one SaaS. OneLake aims to be a "lakehouse for the masses" and is a direct rival to Snowflake and Databricks.
216
Real-Time Dashboard
A dashboard that refreshes within seconds, showing "what is happening right now". Built on WebSockets + streaming SQL + push notifications. Used in trading platforms, gaming live ops, real-time support queues and IoT device monitoring. Common stacks: Grafana, Tinybird, Materialize, and ClickHouse + Apache Pinot.
217
Embedded Analytics
Showing BI dashboards directly inside a SaaS application. Sigma, Mode, Looker Embedded and Cube + a custom React frontend lead the space. The infrastructure of any product that has to surface customer-specific data (Shopify analytics, Stripe Sigma, HubSpot reports); a modern PLG feature.
218
Slowly Refreshed Dashboard (Daily / Weekly)
A dashboard that doesn't need real-time and refreshes after a daily or weekly batch ETL — marketing weekly review, finance month-end close, retention cohort reports. The right pick for compute savings and analytical simplicity; the classic answer to the "premature real-time" anti-pattern.
219
Anomaly Alerting
An alert that fires when a metric statistically deviates from its seasonal pattern and trend. Prophet, Datadog Watchdog, Anodot, MonteCarlo and Sigma Anomaly Detection swap manual thresholds for ML-driven dynamic alerts. The central capability of modern data observability.
220
Forecasting (Prophet / SARIMA / LSTM)
Predicting future values from historical data. Tools include Prophet (Meta, business-friendly with built-in seasonality), SARIMA (classical statistics), LSTM and Transformer models (deep learning) and the Darts library. The core ML domain for sales forecasting, demand planning and capacity planning.
221
Data Catalog (Atlan / Alation / Collibra)
A platform that makes every data asset — tables, dashboards, ML models, metrics — discoverable and documented for the company. Lineage, tags, a business glossary, data quality and ownership land in a single interface. The "Wikipedia" of a modern data team.
222
AI-Powered BI (Copilot / Sigma AI / Tableau Pulse)
A next-generation BI feature set: natural-language queries, automated insights and chart-narrative explanations. Power BI Copilot, Tableau Pulse + Tableau GPT, Sigma AI and ThoughtSpot Sage all answer "why did revenue drop last week?" with automated root-cause analysis — and reshape the analyst role.
223
Edge AI
Running AI models on the device — phone, camera, drone, IoT sensor — instead of in the cloud. Yields low latency, preserved privacy and offline operation; needs a quantised model plus an NPU and a runtime. Powers self-driving cars, AR/VR and smart cameras.
224
TinyML
ML models small enough to fit on MCUs with kilobytes of RAM. Tooling: TensorFlow Lite Micro, Edge Impulse and the Arduino Nano 33 BLE Sense; covers keyword spotting, motion detection and anomaly detection. Delivers AI on battery-powered IoT devices that lasts for years.
225
Digital Twin
A virtual replica of a physical object — jet engine, factory, city, human body — kept in sync with real-time sensor data. Combines simulation, monitoring and predictive maintenance. Siemens, NVIDIA Omniverse, Microsoft Azure Digital Twins and Bentley iTwin lead the platforms.
226
People Analytics
A discipline that applies ML and statistics to employee data. Covers attrition prediction, hiring quality, manager effectiveness, DEI gap analysis and sentiment trends. Visier, ChartHop, Lattice, Culture Amp and Workday Adaptive Planning lead; the data-driven leg of HR.
227
eNPS (Employee Net Promoter Score)
The NPS-style score for "would you recommend this company as a place to work?". Runs from -100 to +100; above +30 is good, above +50 excellent. Delivered through annual surveys plus quarterly pulses on Culture Amp, Officevibe, 15Five and Lattice. The single-question thermometer of engagement.
228
Pulse Survey
The modern successor to the annual engagement survey — a short 5-10-question survey sent weekly or biweekly. A real-time engagement pulse that lands directly on a manager's dashboard. Tools include Officevibe, 15Five, Lattice and Culture Amp; an agile, actionable answer to the classic 80-question annual monster.
229
EHR (Electronic Health Record)
A digital and shareable record of a patient's health — medical history, lab results, imaging and prescriptions. In the US, Epic and Cerner hold 85%+ market share; Europe has DocPlanner and Doctolib, while Turkey runs e-Nabız and MEDULA. Interoperability plus privacy (HIPAA, GDPR, KVKK) sit at the heart of the industry.
230
ClimateTech
Technology solutions aimed at the climate crisis — both mitigation and adaptation. Includes carbon capture (Climeworks DAC), green hydrogen, fusion energy (Commonwealth Fusion, Helion), grid-scale batteries (Form Energy) and climate-risk modelling (Jupiter). Global ClimateTech investment topped $40 B in 2024; Sequoia, Lowercarbon and Breakthrough Energy are the leading funds.
231
Carbon Capture (DAC / CCS)
Technology that captures CO₂ from the atmosphere or directly from industrial flue gas. Direct Air Capture (Climeworks Orca, Carbon Engineering) and Carbon Capture & Storage (CCS) for factory exhaust. Costs run $300-1,000 per tonne; Frontier's $1 B advance market commitment targets bringing that down to $100.

— QUICK DIAGNOSTIC

Are you ready for an analytics operation?

A four-question interactive guide that points to the program level that fits you. Yes / no answers, result in 30 seconds.

01 / 04

Do you currently have more than 10 active dashboards or Excel reports?

Dashboard abundance is a classic symptom of decision deficit.

— LET'S BEGIN

Are your dashboards triggering decisions — or just decoration?

A 60-minute analytics diagnostic: your current KPI inventory, dashboard dependency graph, data source health and a 90-day roadmap — on one panel.