An organization is considered ‘agile’ if it is able to respond quickly and adapt to changing business needs. Agility is about executing on programs in an iterative fashion by breaking it down into logical components to deliver tangible outcomes. It prioritizes flexibility, collaboration, and incremental progress over perfection, addressing the limitations of the traditional approach.
An algorithm is a set of mathematical logic that involves taking inputs and processing them to achieve a desired outcome. Algorithms can be as simple as ‘linear search’, in which each element in a list or array is checked one by one until the target element is found. They can also be as complex as ‘neural networks’ (deep learning), which mimic human intelligence and are used for tasks such as image recognition and natural language processing. Such algorithms are key components of modern-day AI tools and applications.
Amazon Web Services (AWS) is a comprehensive cloud-computing platform offering a vast array of on-demand computing resources and services that include computing power, storage options, databases, networking, analytics, machine learning, and more. It also includes a marketplace to find and engage vendors and third-party providers.
Amazonification refers to the process of businesses adopting business models, strategies, and practices similar to those of Amazon, the multinational ecommerce giant. It often involves focusing on ecommerce, extensive product offerings, efficient logistics, customer-centric approaches, and technological innovations to streamline operations and deliver a seamless customer experience.
Analytics is a practice which involves using various statistical, mathematical, and computational techniques to examine data sets, identify patterns, trends, correlations, and relationships, to draw meaningful insights from data.
Apache Hadoop is an open-source software platform that manages data processing and storage for Big Data applications. It enables distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to multiple machines with local computation and storage capacity.
Apache Spark is an open-source unified analytics engine for large-scale data processing, including applications for batch processing, streaming, and machine learning. Spark works by breaking down data into smaller pieces that can be processed in parallel, allowing faster speed. Spark also supports a wide variety of data formats, hence can be used for data from a variety of sources.
An API (application programming interface) is a set of defined rules for how two pieces of software can interact with each other. APIs are used to allow different software applications to communicate with each other and share data. APIs enable companies to open their application data and functionality to external third-party developers and business partners, apart from their internal departments.
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think, learn, and perform tasks that typically require human intelligence. These tasks may include problem-solving, reasoning, decision-making, speech recognition, language translation, visual perception, and more. AI systems are designed to process large amounts of data, recognize patterns, and improve their performance over time through a process called machine learning. The goal of AI is to create intelligent machines that can perform tasks autonomously and efficiently, mimicking human cognitive abilities in various domains.
AutoML, short for automated machine learning, refers to the use of automated tools, techniques, and algorithms to streamline and simplify the process of building and deploying machine learning models. It is a comprehensive approach that automates various stages of the machine learning process, from data preprocessing to model deployment.
Behaviour analytics, also known as behavioural analytics or user behaviour analytics (UBA), is a method of data analysis that focuses on understanding and interpreting patterns of human behaviour within digital systems. It involves tracking and analysing user interactions, actions, and activities to gain insights into their behaviour and preferences.
Blockchain is a decentralized and secure digital ledger that records transactions across multiple computers in a network. Each transaction, or ‘block’, is linked to the previous one, forming a chain of blocks, hence the name blockchain.
Business intelligence (BI) refers to a technology-driven process of collecting, analysing, and transforming raw data into meaningful and actionable insights to support data-driven decision-making. It involves using various tools and technologies to gather data from multiple sources, process and consolidate it, and present it in the form of reports, dashboards, or visualizations.
Cluster analysis is a tool for exploratory data analysis, specifically used for identification of patterns and relationships in data that would not be apparent otherwise. Cluster analysis provides clusters or groups of data points that are more similar to each other than they are to data points in other groups. It is primarily used to segment customers, identify fraud, and make predictions.
A CRM (customer relationship management) database is a database containing customer information that is collected, governed, transformed, and shared across an organization. It traditionally has been known for its marketing and sales reporting tools that are useful for leading sales and marketing campaigns and increasing customer engagement.
A dashboard is a visual representation of important data, metrics, or key performance indicators (KPIs) presented in a single, consolidated view. It provides a user-friendly and real-time snapshot of essential information, by using charts, graphs, tables, and other visual elements to present data in a concise and easily understandable format, enabling quick and informed decision-making.
Data can be defined as any factual information in digital form, used as a basis for reasoning, discussion, or calculation. It includes facts, numbers, or any other form of information that is generated. Data predominantly can be classified as structured—organized and formatted data that is stored in a predefined mann er—and unstructured data or data that does not have a specific structure or format, such as documents, videos, audio files, posts on social media, and emails.
Data deluge is a situation where the sheer volume of data being generated is not only putting huge pressure on the technological capacities of organizations but is also becoming overwhelming for organizations to deal with, while leveraging it for decision-making.
Data democratization is the process of making data available to all individuals appropriately with seamless, anytime access while empowering them with the knowledge and tools to track and use it through self-service tools.
Data fabric is an architectural approach for integrating data from multiple sources without bringing all the data together at one place. These architectures operate around the idea of loosely coupling data through a strong data cataloguing layer, with applications that need it. Data is stored in the best suited storage system based on type of data and integrated based on the context of desired use case.
Data-first world refers to the data age where data takes the centre stage and is a source of significant opportunity and advantage. In a data-first world, winners have data at the core of their business model.
Data harmonization, also known as data standardization or data normalization, refers to the process of bringing together and unifying data from different sources or formats to create a consistent and standardized representation.
Data lake is a centralized storage repository that holds vast amounts of raw data in its original format until it is needed for analytics applications. It serves as a platform for Big Data analytics and other data-science applications that require large volumes of data, and involves advanced analytics techniques, such as data mining, predictive modelling and machine learning.
Data literacy is the ability to explore, understand, and interact with data in a meaningful way. It is the ability to ask questions about data, to interpret the answers, and to use the answers to make best possible decisions resulting in business value or outcome.
The data management value chain represents the end-to-end journey of data, encompassing a series of interconnected activities that contribute to maximizing the value of data. The activities can be broadly categorized into eight distinct stages, starting with data acquisition which involves capturing the right data followed by data ingestion which involves processing and transforming data for storage. The next stage is data enrichment that involves enhancing the data. The data storage stage ensures the data is stored in the right systems for retrieval. In the insights generation stage, business-intelligence tools are leveraged to generate insights from the data. Orchestration involves automated execution of tasks and workflows for utilization of insights. At the action stage, insights are translated into specific actions. Finally, the impact stage demonstrates the tangible business result achieved through the process.
A data mart is a type of data warehouse. It can be considered as a subset of a data warehouse that is focused on a particular line of business, department, or subject area.
Data mesh is a data-integration approach that promotes decentralizing of data ownership, by creating a connected network of domain-specific data products, managed by domain teams. This approach enables individual teams or domains to take ownership of their data and its entire lifecycle.
Data monetization is the process of transforming raw data into valuable products or services that organizations can package and deliver to generate revenue. This involves packaging data or converting data into actionable insights, analytics, or other forms of information that address specific market needs. By doing so, organizations can establish new sources of income, tap into previously unexplored market potentials, and strengthen their competitive edge. Data monetization maximizes the value of data assets by converting them into marketable products or services, ultimately driving financial gains and strategic advantages for businesses.
A paradox is a statement or proposition that appears contradictory at first glance, but when investigated may prove to be well-founded or true. The world of data is full of such paradoxes as well, both at organization and individual levels. One such key Data Paradox is data deluge vs insights drought, which refers to a situation where organizations or individuals are overwhelmed with the influx of data, but at the same time, are unable to draw meaningful insights from the data they already have. Another one is the paradox of data infrastructure vs problem solving, where the process of building an elaborate infrastructure to capture the explosively growing data becomes an unending exercise while problem-solving often gets ignored. Similarly, many other paradoxes exist, such as getting stuck either in the short term or the long term, democratizing data vs security and so on.
A data pipeline is a series of automated and interconnected processes that facilitate the smooth and efficient flow of data from various sources to its destination, often a data warehouse, database, or analytical system.
Data products are digital assets built as a vertical slice, integrating various elements across the data stack to deliver business use cases in a repeatable manner and accelerate the data to insights to action cycle. For example, a recommendation engine leverages vast amounts of varied data on customers, analyses it to identify patterns and provides relevant suggestions in real-time. This data product can be used for multiple use cases such as product recommendation, channel or content recommendation, etc.
Data quality refers to the extent to which data is fit for purpose. It should meet specific standards and criteria, ensuring its suitability and reliability for effective decision-making and analysis. Beyond the nine critical dimensions—that is, completeness, timeliness, consistency, validity, accuracy, precision, uniqueness, contemporariness, and conformity—data quality should also encompass the broader context or outcome that the business aims to achieve. This allows for a more comprehensive and adaptive approach to ensuring data quality.
Data-scarce world refers to the pre-internet age where there was limited data available for organizations. Access to data was considered a source of power.
Data science is a specialized field of analytics that involves extracting knowledge and insights from data through specialized programming, advanced techniques, artificial intelligence and machine learning with the aim of answering complex questions and solving intricate problems.
Data visualization is a way to represent information graphically, highlighting patterns and trends in data to help users gather quick insights. It also enables the exploration of data through manipulation of chart images, with the colour, brightness, size, shape and motion of visual objects representing aspects of the dataset being analysed.
Data warehouse is a relational database-management system intended to provide integrated, enterprise-wide, historical data. It stores data in a structured format and acts as a central repository of preprocessed data that helps in analytics and generating business intelligence. Its architecture is designed to hold data extracted from transaction systems, operational data stores and external sources.
Databricks is a cloud-based platform that simplifies and accelerates data-analytics and machine-learning tasks. It provides a collaborative environment for data professionals to work together on data-driven projects. Databricks offers powerful tools for data processing, data exploration, and building machine learning models. It enables users to efficiently process and analyse large volumes of data.
Deductive logic is a traditional method of analysis, which starts with comprehensive data collection, subsequently analysing that data, and drawing conclusions based on the analysis. This approach hinges on the assumption that consolidating all relevant data will likely yield meaningful insights to help solve the problem at hand.
Deep learning is a specialized subset of artificial intelligence that involves training artificial neural networks to learn and make predictions from vast amounts of data, to automatically identify patterns, features, and representations in data, allowing machines to perform complex tasks without explicit programming. It has been particularly successful in areas like computer vision, natural language processing, and speech recognition.
Descriptive analytics is a type of data analytics that looks at historical data to provide an understanding of past events and trends. Results are summarized and visualized in the form of statistics, graphs, charts, reports, or visualizations like dashboards, that are easily understood.
Diagnostic analytics is the area of data analytics that is concerned with identifying the root cause of problems or issues. It is focused on understanding why something has happened and is characterized by techniques such as drill-down, data discovery, data mining and correlations.
Digital age, also referred to as the fourth industrial revolution or Industry 4.0, is the recent phase in the digitization of businesses driven by disruptive trends including the rise of data and connectivity, analytics, human-machine interaction, and improvements in artificial intelligence. It is said to have begun in the 1970s with the introduction of the personal computer with subsequent technology introduced providing the ability to transfer information freely and quickly.
Digital channels are online platforms like social media, search engines, and websites used to reach and engage target audiences, enabling product sales, brand building, and industry positioning.
Digital detox is a deliberate and temporary period during which individuals consciously disconnect from their digital devices and platforms, such as smartphones, computers, tablets, and social media platforms, to reduce digital distractions and promote mindfulness and well-being.
Digital-native companies, also known as born-digital companies or digital-first companies, are organizations that were founded in the digital age and have grown and operated primarily in the digital realm. Unlike traditional companies that may have transitioned from offline operations to online platforms, digital-native companies were born with a focus on digital technologies and have digitalization at the core of their business models.
Digital technologies refers to a wide range of technologies, tools and applications which facilitate activities by electronic means, to create, store, process, transmit and display information. Broadly, digital technologies include the use of personal computers, digital television, radio, mobile phones, etc. Some of the more prominent and advanced digital technologies today are machine learning, artificial intelligence, augmented reality and 5G connectivity.
Edge computing is a distributed computing paradigm that brings computation and data storage closer to the location where it is needed, typically near the edge of the network or closer to the devices generating data. By processing and analysing data locally, edge computing reduces the need for transferring large amounts of data to centralized data centres or the cloud. This results in faster response times, reduced network latency, and improved efficiency for applications and services that require real-time data processing, such as Internet of Things (IoT) devices, autonomous vehicles, and industrial automation.
In computing and data integration, extract, transform, load (ETL) is a commonly used three-phase process for transferring and preparing data from its source to a destination. Sources can be varied, such as transactional databases, flat files, and social media platforms. Destination is typically a data warehouse or data lake.
In the context of machine learning, ‘features’ are pieces of data that are used to train and evaluate ML models. A ‘feature store’ is a repository for storing and managing pre-computed, reusable, and high-quality features. It helps to improve the efficiency and accuracy of machine-learning models.
The General Data Protection Regulation (GDPR) is a regulation in European Union (EU) law on data protection and privacy in the European Economic Area. The GDPR is widely considered a landmark regulation that changed the world's mindset on how companies worldwide collect and use the personal data of EU citizens.
Generative AI is a transformative branch of artificial intelligence that uses neural networks to produce outputs that resemble human-generated content. It is focused on creating systems that generate original content, like images, text, music, and virtual environments. Unlike traditional AI models, which focus on pattern recognition and classification, its adoption offers opportunities in various domains, including customer support, creative content generation, and generative engineering. One key advantage is its ability to help organizations scale AI efforts to truly become AI-first.
Traditional AI needs large and diverse data for training and validation, making data acquisition and preparation challenging. In contrast, Generative AI models are built on massive publicly available internet data, like ChatGPT 3 that was trained on a staggering 45 terabytes of data. Using publicly available internet data, on which the generative AI models are built, reduces the need for extensive data acquisition and processing.
Geospatial data refers to information about objects or events tied to specific physical locations on or near the earth's surface, including details like coordinates, addresses, cities, or zip codes. Geospatial technologies encompass tools like GPS, remote sensing, and geofencing, enabling precise mapping, spatial analysis, and real-time location-based services.
Google Cloud Platform is a suite of public cloud computing services offered by Google. The platform includes a range of hosted services for compute, storage and application development that run on Google's own hardware such as data centres and server hubs that host virtual resources (such as virtual machines), among many other offerings
Hockey stick growth is a pattern in a line chart that shows sudden and extremely rapid growth after a long period of linear growth. The line connecting the data points resembles the shape of a hockey stick.
Hyper-personalization takes personalization a step further by leveraging real-time and dynamic data to offer highly individualized experiences on a one-to-one basis. It considers multiple data points, behaviours, and contextual factors to deliver highly relevant and customized experiences. This approach enables a ‘segment of one’, making each engagement highly relevant to the individual’s specific context.
Hyperscalers are companies that operate large-scale cloud computing platforms and have the ability to rapidly and efficiently scale their infrastructure to meet high-demand requirements. These companies, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, IBM, Oracle and Alibaba Cloud, provide services and resources for storing, processing, and analysing massive amounts of data.
Inductive logic is a method of analysis or problem-solving that is iterative in nature. It starts with formulating an initial set of hypotheses followed by data collection to test that set of hypotheses iteratively and arrive at a conclusion. Multiple iterations lead to convergence towards the best solution.
Internet of Things (IoT) is the network of physical objects that contain embedded technology to sense and communicate or interact with their internal states or the external environment. These devices range from ordinary household objects to sophisticated industrial sensors and/or tools.
Apache Kafka is a tool that enables real-time data streaming and scalable data integration. It is widely used in various industries for building data pipelines, event streaming, and building real-time data processing applications. Apache Kafka is an open source solution for collection, processing, and analysis of real-time, streaming data. It is designed to handle data streams from multiple sources and deliver them to multiple consumers.
Key performance indicator or KPI is a measurable and quantifiable metric used to track progress towards a specific goal or objective.
A large language model (LLM) is a type of artificial intelligence algorithm that uses deep-learning techniques and massively large data sets to understand, summarize, predict and generate new content. These models can process and analyse text, learn patterns, and generate coherent and contextually relevant responses. Generative AI solutions have LLMs as their foundational model.
The law of large numbers is a theorem in probability and statistics that states that as a sample size grows, its mean gets closer to the average of the whole population. This is because as the sample size grows, it becomes a better representative of the population, as it reflects the characteristics of the entire population more accurately.
Legacy data stack is the collection of tools and technologies in an enterprise that are used to gather, store, process, and analyse data, which are not well-equipped to handle the complexities of the Big Data world (volume, velocity and variety of data). They lack the flexibility and scalability required to handle Big Data and are based on a monolithic architecture. They may also lack in other aspects of a modern data stack.
Linux, a computer operating system created in the early 1990s by Finnish software engineer Linus Torvalds and the Free Software Foundation (FSF), is one of the most prominent examples of free and open-source software collaboration.
Machine learning is a subset of artificial intelligence that focuses on development of algorithms and statistical models that enable computers to learn from data without being explicitly programmed. The algorithms can also adapt in response to new data and experiences to improve their efficacy over time. Machine learning has numerous advanced applications such as speech recognition, natural language processing and recommendation systems.
The metaverse is a virtual and interconnected 3D digital universe that encompasses a vast, shared virtual space where users can interact with each other and digital objects in real-time. It is an immersive and persistent virtual reality environment that merges elements of augmented reality, virtual reality, and the internet.
Microsoft Azure is a cloud-computing platform which offers a set of services, including virtual machines, storage, databases, networking, artificial intelligence, and more. It helps businesses and developers build, deploy, and manage applications and services on the cloud.
ML Ops, short for machine learning operations, is a set of practices and methodologies that focus on streamlining and optimizing the end-to-end machine-learning lifecycle, which ensures the reliability, scalability, and efficiency of machine-learning models in production environments.
A modern data stack is a cloud-hosted, comprehensive set of tools and technologies that brings together the different layers of the data management value chain, starting from acquiring data from various sources to translating them into meaningful insights, delivered to the end user for consumption. These layers are loosely coupled yet tightly connected, enabling rapid scaling, adaptability and seamless interoperability, driving the data to insights to action cycles at speed.
Mutually-exclusive-collectively-exhaustive (MECE) is a principle of grouping information (ideas, topics, issues, solutions) into elements that are mutually exclusive—that is, groups with no overlapping and each item having a place in one group—and collectively exhaustive—that is, all groups, thus created, would include all possible items relevant to the context.
Natural language processing (NLP) refers to the branch of artificial intelligence that gives computers the ability to understand text and spoken words in much the same way human beings can. The technology involves the ability to turn text or audio speech into encoded, structured information, based on an appropriate ontology. It powers tools such as voice-operated GPS systems, digital assistants, speech-to-text dictation software, customer service chatbots, etc.
Netflixization is a term used to describe the impact of Netflix’s innovative streaming model on the entertainment and media industries. It refers to the trend of companies and platforms adopting similar streaming services to provide on-demand content to consumers, disrupting traditional media distribution models.
Neural networks are a type of artificial intelligence model inspired by the human brain’s neural structure. They consist of interconnected nodes (also called neurons) organized in layers to process and learn from data. Each node receives inputs, processes them using mathematical functions, and produces outputs that contribute to making predictions or classifications. They are particularly well-suited for tasks that involve recognizing patterns, classifying objects, making predictions, or solving problems based on input data. They power technologies such as natural language processing and text-to-speech recognition.
NoSQL (not only SQL) is a type of database-management system that is designed to handle and store large volumes of unstructured and semi-structured data. NoSQL databases use flexible data models that can adapt to changes in data structures and are capable of scaling horizontally to handle growing amounts of data.
Omnichannel experience refers to a seamless and integrated customer experience across all channels and touchpoints, allowing customers to interact and transition between various channels (example, online, mobile, in-store) effortlessly.
OLAP (online analytical processing) is a computing technique that allows users to extract and query data in a way that facilitates analysis from multiple perspectives. It is commonly used in business intelligence for tasks like trend analysis, financial reporting, sales forecasting, budgeting, and planning, etc.
An operational data store (ODS) is a real-time database used to store data from transactional systems like customer relationship management (CRM) and enterprise resource planning (ERP) systems. It serves as a type of data warehouse but with the distinction of being smaller and updated more frequently, making it suitable for real-time operational needs.
Oracle Autonomous Data Warehouse (Oracle ADW) is a fully managed cloud database service released in 2019. It automates administration of a data warehouse, allowing users to completely focus on data applications.
Personalization is a process that creates a relevant, individualized interaction between two parties designed to enhance the experience of the recipient. The term is often used for businesses curating and delivering customized products or services to other businesses (B2B) or end customers (B2C).
Personally identifiable information (PII) is any data or combination of data that can be used to identify, contact, or locate an individual. It includes both direct identifiers (example, name, social security number, email address) and indirect identifiers (example, birthdate, location data) that, when combined, can link the information to a specific person.
Microsoft Power BI is a business-intelligence tool that helps organizations analyse and visualize their data in a user-friendly way. It allows users to connect to various data sources, such as databases, spreadsheets, and cloud services, and transform that data into interactive reports and dashboards. Power BI also provides collaboration and sharing capabilities.
Predictive analytics is a branch of advanced analytics that makes predictions about future outcomes and trends using historical data combined with statistical modelling, data mining techniques and machine learning. Predictive analytics is utilized to find patterns in data to help organizations make informed predictions about future events, behaviours, or probabilities.
Prescriptive analytics is the use of advanced processes and tools to analyse data and content, to recommend the optimal course of action or strategy moving forward. Often the techniques used are graph analysis, simulation, complex event processing, neural networks, recommendation engines, heuristics, and machine learning.
Proprietary knowledge encompasses the accumulation of unique assets, codified tacit knowledge, and specialized methods, processes, techniques, and algorithms developed over time by organizations. For individuals, it translates into wisdom which is acquired from accumulated information and experiences over time. This valuable knowledge can become a source of durable competitive advantage for both organizations and individuals.
Real-time data refers to information or data that is generated, processed, and made available for use immediately or with minimal delay as it occurs or is collected.
A recommendation engine is a machine learning algorithm that gives customers recommendations based upon their behaviour patterns and similarities to people who might have shared preferences. These systems use statistical modelling, machine learning, and behavioural and predictive analytics algorithms to personalize the user experience.
Snowflake is a cloud-based data platform that helps organizations store, manage, and analyse large amounts of data efficiently. It is designed to handle structured and semi-structured data, and provides a flexible and scalable solution for data storage and processing. It eliminates the need for upfront hardware investment and offers on-demand scalability.
SQL stands for structured query language which is a computer language for storing, manipulating and retrieving data stored in a relational database. SQL was developed in the 1970s by IBM computer scientists and became a standard of the American National Standards Institute (ANSI) in 1986, and the International Organization for Standardization (ISO) in 1987.
Structured data is a dataset that has a standardized format (also known as schema) that is helpful for efficient access, either by software or a manual user. It is tabular, with rows and columns that clearly define data attributes.
Triangulation refers to validation of data through cross verification from more than two sources. It involves utilization of multiple datasets, methods, theories, and/or investigators to answer a research question. It’s a research strategy that enhances the validity and credibility of findings and reduces the effect of research biases.
Two-speed execution is an agile approach focused on delivering short-term business impact while ensuring that long-term capabilities are built in the process. While Speed 1 is about identifying high-impact use cases that solve an immediate or specific customer or business problem and deliver quick and direct impact. Speed 2 focuses on building long-term capabilities, typically by bringing together Speed 1 initiatives. It enables organizations to quickly experiment, learn and iterate through Speed 1 initiatives, to build capabilities that are more comprehensive and critical in the long run.
Unstructured data is information with no set data model, or data that has not yet been ordered in a predefined way. A few common examples of unstructured data are text files, video files, email and images.
VUCA is an acronym listing qualities that make a situation difficult to analyse, respond to or plan for. They are:
First used in the US Army War College in 1987 and publicly published in 1991, VUCA was applied to the conditions following the end of the Cold War. However, this term is nowadays applied to current business scenarios that are significantly disrupted by digital advancements, especially data technologies.
A web API is an application programming interface (API) for either a web server or a web browser.
Zero trust is a security approach that verifies every user and device accessing an organization’s resources, irrespective of location or network. Access is continuously monitored and strictly controlled based on multiple factors like user identity and device health, preventing unauthorized access and enhancing security even within trusted networks.