Leading Big Data Innovations in the 5G and AI Era

In the new era of 5G and AI, information collection became ubiquitous and data has been generated at an unprecedented pace. Thus, more and more industries are adopting big data technologies to unleash the full power of data. For example, to streamline operations and to make more accurate investment decisions, operators deployed big data infrastructures to carry out operational analytics. In the financial industry, with the help of big data analytics, institutions can respond to customer needs more effectively with personalised recommendations and precision marketing. Big data technologies have also been used by governments across the globe to optimise city administration, saving tax payers billions of dollars.

The challenges of conventional big data architectures in the new era
These new systems and novel ways to explore data have already created tremendous value for customers. However, as business grows and new use cases are discovered daily, the previous generation of architecture needs to evolve as well. Thus, new requirements are proposed and there are new challenges to be solved by system designers.

Through years of evolution, many operators have built industry-leading big data systems, including operational analytics, network optimisation and planning, CDRs, and log retention. As time goes by, many IT organisations have begun to observe inconveniences or hit roadblocks, such as data silos, inflexible expansion, and low disk utilisation ratio.

Many of the challenges can be traced back to the conventional Hadoop architecture, which has a tightly coupled storage-compute relationship. The main drawbacks are:

  • Many big data vendors require dedicated HDFS clusters for their subsystems. This means that each big data subsystem (often by a single vendor) can only connect to its own HDFS. Different subsystems from the same vendor must co-locate and be deployed on the same nodes, thereby forming a closed architecture and data silos.
  • Computing and storage resources must be expanded at the same time. However, it is difficult to predict computing and storage usage ahead of time, and resource usage growth may be different for computing and storage. As a result, if computing and storage resources are expanded at the same pace, many computing or storage resources may be wasted.
  • Disk utilisation ratio is low. Currently, open source HDFS implementations mainly use the traditional three-replica technology to store data, which translates to less than 33 per cent disk utilisation ratio. This low ratio means more physical drives are required, thereby increasing the overall storage cost.

The storage industry is seeking new ways to solve these problems, and the decoupled storage-compute architecture is a very promising solution.

OceanStor distributed storage decouples compute and storage

Huawei has long been promoting the applications of big data. Globally, it ranked third in terms of code contribution to the Hadoop community and first among all IT device vendors. In 2019, Huawei launched its Decoupled Storage-Compute Big Data Solution powered by OceanStor distributed storage, to lead big data innovations in the cloud and AI era.

The OceanStor D series next-generation intelligent distributed storage sits at the core of this new solution. It provides remote HDFS interfaces to replace local HDFS storage in Hadoop. Compute nodes and storage nodes can form resource pools separately, as shown in the following figure.

The following four enhancements show the decoupled storage-compute solution improves the efficiency and reduce costs in multiple areas:

  1. Independent expansion of storage and computing resources

The decoupled storage-compute architecture allows storage and computing resource expansion at their own pace. In other words, computing and storage resources can be expanded as needed independently. No resources will be wasted after expansion like in the coupled scheme.

  1. Independent cloud resource pools for improved resource utilisation and data sharing efficiency

In the decoupled storage-compute big data solution, the computing and storage resources can be burst to the cloud separately. This enables computing and storage resources to be effectively utilised, and one set of big data storage can support multiple applications simultaneously. Fig. 2 shows a use case in which siloed data were combined into a big storage pool, and the pool is used by many applications including cloud applications.

  1. Elastic EC algorithm for huge storage utilisation boost

The OceanStor distributed storage leverages the EC algorithm for data protection. The D series storage platform supports a maximum of 22+2 EC, improving storage utilisation from 33 per cent to 91 per cent, while providing more enterprise-grade features, such as automatic tiering of hot, warm, and cold data. Compared to traditional HDFS, this feature can bring in huge savings for D series users. Fig. 3 illustrates the difference between 3-replica and 22+2 EC. Besides 22+2, a spectrum of EC selections is available to provide users a trade-off between cost and performance.

  1. Native HDFS interface, no plug-in required, and fully compatible with mainstream big data platforms

OceanStor distributed storage provides native HDFS interfaces and is fully compatible with mainstream big data platforms, such as FusionInsight, Cloudera, Hortonworks, and Transwarp. If a customer already has an existing coupled storage-compute environment, the D series can be added alongside existing storage. There is no service disruption to the existing environment during the process.

Case Study: China Telecom combined decoupled storage-compute and local HDFS solutions for higher effective capacity

China Telecom Hebei adopted HUAWEI OceanStor decoupled storage-compute solution for capacity expansion of its operational analytics platform. Disk utilisation was improved from 33 per cent to 91 per cent with Elastic EC, which led to a more than 60 per cent increase in effective capacity (aka usable capacity). With ViewFS, distributed storage and local HDFS together can achieve balanced reading and writing of data. There is no need to upgrade the existing big data platform version or migrate the existing data.

Case Study: A Singapore ISP adopted decoupled storage-compute architecture to replace open source Hadoop software

The customer built an Open Source Hadoop based environment to save the R&D log data. One of the key requirements from the customer is to improve the data density as much as possible, so that the operation cost can be minimised.

After several rounds of evaluation, they chose Huawei OceanStor distributed storage instead of open source HDFS software, thus promoted disk utilisation from 66 per cent to 91 per cent. As a result, the customers increased the usable capacity (aka available capacity) of a single cabinet by 140 per cent, reduced the number of cabinets from 15 to 8, and achieved a 40per cent OPEX saving.