Genesis: the Google Knowledge Graph
Every one of us had a chance to experience graph technology in action using google search. That search panel with the answer including all supporting information and related topics could feature, for example, an actress with her origins, previous work, future projects, co-stars, or a location in Canada, with population, weather, geographical information, and even some travel suggestions. This view of interconnected people, objects, virtual goods, geographical places, etc., is so natural to us, resembling how humans think daily. Google was among the pioneers to base their whole product, search engines, on a graph structure early on.
The Era of Graphs
Graph concepts, ontologies, taxonomies, and triplets have accompanied us for a very long time, first as sketches and later embedded into data. Suddenly, graphs are everywhere, from visualization of web 3.0 spider nets to analyzing the COVID19 pandemic spread. So, why did everyone use graphs to analyze and visualize data and events?
Similar to the explosion of Machine Learning a decade ago, the reason is volumes of diverse data, a new generation of graph stores, and new algorithms.
- Data: vast kinds of newly collected data take into account not only the values but also the relations between data streams and the dynamics of interconnections between them.
- Graph stores: New Graph databases and lakes can store relational data with significant compression factors over multidimensional space, while traditional relational databases usually only capture one dimension of relations. Graph data stores do not assume any limitation in the model’s size, allowing more flexible relation and entity management.
- New Gen Models: The novel algorithms, for example, Graph Neural Networks, can detect similar patterns and alignment in a much more straightforward way.
The 5 “W”
Graphs can help us answer any data-related question – what, who, where, why, and even how: what data do you have internally or could source externally, where can you find it, how it is being used, and by whom, how could it be used and why. Additional traits of graphs that give us even more value include analyzing clusters, cohorts, and groups. This means monitoring data signal with its context, groups of signals, and the dynamics within and with peer processes.
Search over Knowledge Graph
The most immediate value derived from graphs lies in data discovery. It is natural that analysts only look for things they are familiar with. From our experience, the most advanced usage of data sources does not go over 20% of internal data available. Imagine what could happen if we would use 100% of our data? Graph structure is contextual and integrated, enabling us to see the connections and relationships of data we know to “use the underutilized”. It allows understanding of what was hidden and what questions could be asked around it.
Remember users’ gut feeling telling them the analytics were wrong? They can tell from their domain expertise that something isn’t right, but the data and analytics they see don’t justify that claim. With graphs, all data and processes captured as domain expertise and tribal knowledge could be finally materialized in a coherent and comprehensive structure:
- The data discovery graphs become intuitive and surface inefficiencies in the existing processes, such as procurement, supply chain, and manufacturing, across silos. Instead of capturing domain-specific data in separate databases or schemas or tables, all entities are interconnected. In graph structure, it is much easier to perform multi-domain complex query analysis and detect inefficiencies, often already on the stage of graph construction. Connecting silos into one semantic layer is a major principle of the data fabric in comparison with data mesh.
- Another aspect is that every element of the graph is accessed in a cost of one step – instead of data stored in a table, where the access to the element is measured in steps relative to the size of the data.
- Exploring graphs comes naturally to employees with a wide range of skills. It is very intuitive and doesn’t require coding experience. On the other hand, users with an operational background can spot logic faults right away or “reverse engineer” analysis they were presented within a recent dashboard but couldn’t understand.
The case for a unified data and analytics language, or the end of “lost in translation”
All of the above adds orders of magnitude to improvements in data management and analytics. Still, the most crucial aspect of graphs is the ability to serve as the ground truth of the terminology and logic.
If graphs are well designed, they will not allow ambiguities in definitions. Knowledge bases finally connect. Entities, their properties, relations, metrics, and processes embedded in those relations are unique. If you define an “inactive user” differently, the graph will show it as a conflicted node or relation.
So if graph hygiene is maintained, organizations would have a unified, unique data and analytics language all departments can use, both as data producers and data consumers. Business users can easily define the calculations they would like to perform on top of this terminology and technical teams would have a clear picture of data streams to facilitate with ETL pipelines, knowing how to maintain data lakes and what data to deprecate.
Centralized management, decentralized consumption
This concept of data and analytics spider net encyclopedia allows centralized data management and decentralized data consumption when analysts, developers, business operations, and citizen analysts can self-service their needs by using graphs as data models, data panels, feature stores, or metric stores.
Centralized data management should include:
- A data model – the graph ontology is a data model across all your data sources. That means you do not need to model all the relations and hierarchies- just use a subset of the ontology as a data model for your specific analysis, or use it in full underneath your BI.
- Data governance – it is much easier to certify and maintain the unique unified terminology rather than disperse non-unique data stores;
- Alerting – natural language processing (NLP) and graph techniques, like graph alignment, allow finding cases of conflicting entities or relationship definitions.
- Impact analysis – if part of the graph is compromised – the data was deleted by mistake (i.e. the 1st degree, 2nd degree etc.). The impact is detected immediately – explaining which logic was hurt by the change and which applications and users were affected.
As suggested in “Top Trends in Data and Analytics for 2021: Graph Relates Everything” by Gartner (2021), by 2025, graph technologies will be used in 80% of data and analytics innovations, up from 10% in 2021, facilitating rapid decision-making across the enterprise.
Now that we have graph-facilitated central data and analytics management, what does it mean to have decentralized consumption? The answer is divided into the following 5 categories:
- In terms of onboarding: we already mentioned the self-service visual graph exploration, especially for business users, which could serve as an onboarding vehicle for new employees to understand their domain better.
- In terms of explainability: users can track the logic and reverse-engineer different metrics they see as a result of a number on the dashboard to make sure they understand these completely, and it match it with their expertise.
- In terms of no-code logic development: no-code tools and editors could be used by non-technical users to model new logic using existing graph entities and structures.
- In terms of observability: knowledge graphs that include usage patterns could be the perfect observability vehicle for trends in data and analytics: which areas are covered, which are compromised, what is the impact, and what is the velocity.
- In terms of application development: code-based application development can also leverage a graph as a data model- no need to define new data structures in a database, just using the logic concepts from the graph abstracted from the data.
- In terms of analytics: the graph-based logic discovery can prevent reinventing the wheel and rewriting already existing queries. At the same time, implementing a new business question in a contradicting way does not align with governed and certified terms, queries, metrics etc. Fast and more focused search and reusable logic components speed up analytics creation.
Graphs form the foundation of many modern data and analytics capabilities, led by maturing graph solutions and the need to answer increasingly complex business questions. Data and analytics leaders must plan to adopt and raise graph technology awareness to respond to such opportunities.