March 1, 2023
As more and more businesses move their applications and associated data to the cloud, managing all that information becomes more complicated.
IT no longer has complete control and insight over every aspect of the datastore; instead as multiple cloud providers are implemented and endpoint data is served and collected from widely-flung users and workstations, you’re likely to run into compatibility and versioning issues between various databases and storage platforms. The data management problem grows even larger as multicloud, the Internet of Things, and Big Data initiatives rise in popularity and real-world applicability.
Three ways to get all your ever-growing databases and datastores on the same page are data federation, data hubs, and data lakes. What are the differences between each, and what are some pros and cons of their use?
Also known as virtual databases, a federated database uses a software abstraction layer to combine various database sources into a single view. The system can accept queries and essentially presents itself as a single database, while the software is actually querying the production databases or data warehouses in real-time. Each sub-database source is a “federate.”
With a virtual database, the source data is not moved or copied into a central repository, so operations can be more costly and time-consuming when using federation. However, it is relatively easy to configure a federated database on top of existing databases in different environments and locations.
A federated database can harmonize data, but only as it is processed upon return. (Data harmonization is taking all data from all sources, comparing similar records and combining them where possible, throwing out bad data, and presenting the most accurate items as a whole.)
Virtual databases do not index data (indexing involves the creation of an Index, which is saved separately to the storage media, allowing quicker record retrieval and analytics). Federated databases instead rely on the destination siloed databases to index. Any retrieval or analytics request is performed on the host database.
Federated databases can run into trouble if you are running a query that is not recognized by one of the source systems. Because your federation platform is tied to the source databases, it can take a lot of effort to integrate them, and making changes is also development-heavy. If any of the source systems are degraded, the entire federation might be inaccessible.
One advantage of federation is real-time access to all of your data, rather than waiting for it to be moved into a central location and then harmonized and indexed. Scaling can be difficult, but for real-time web service type workloads, federation can be a good option.