The vision of a highly flexible smart factory producing customer-specific products at low additional cost in a short time to market is becoming a reality. Big data is the fuel of this fourth Industrial Revolution. The driving force behind this is the ever-increasing capability of analyzing data and the interaction with cyber-physical systems for commercial gains. This article defines the term in context of the steel industry and explores challenges as well as potential benefits.


Everything we do is increasingly leaving a digital trace. When we browse the internet or partake in online shopping, our location and payment information is tracked and recorded, creating a profile of who we are and what we do. The same is true for material we produce. During production, a vast amount of data is captured from sensors generating a digital twin of the physical piece of material. Relating data from individual process steps generates even bigger data sets describing not only the current state but also the entire genealogy of the product. Considering the huge number of products that are manufactured, the amount of data aggregated over a given timeframe is larger than what can be analyzed by humans or commonly used software tools, and this is when the label “big data” is used.

Definition: Big Data — Big data describes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable elapsed time.


The main benefit that can be gained from big data analysis is the detection of patterns and a better understanding of correlations and dependencies as well as the derivation of predictive models. Applications in the steel industry are, e.g., the root-cause analysis of defects detected by a surface inspection system at the hot mill and tracing it back to events at the caster. Training artificial intelligence (AI) algorithms on historical big data also enables predictive analytics. Monitoring incoming data in real time can trigger alarms and allows for corrective action once such a pattern is detected again. High-speed networks and integrated long-term data storage make plantwide big data integration feasible.


Structured data is located in a fixed field within a defined record, e.g., in a spreadsheet or a relational database. Order, customer and financial data are examples. As the name suggests, this kind of data is usually stored according to a predefined data model and this kind is also used in traditional data analysis.

Unstructured and semi-structured data and its analysis is one of the main characteristics of the term big data. An estimated 80% of business-relevant information is unstructured. Examples are images, videos, uncategorized websites and documents.

Another way of categorizing is by data that the business currently owns or generates and therefore has and controls access to, denoted as internal data, and data that is generated and exists outside of the business, denoted as external data. Sales statistics, human resources records, bank account transactions but also closed-circuit television data that is recorded on-premise are examples of internal data. External data is all data outside of the business and the amount is almost infinite. It can be either public (anyone can obtain free of charge with little effort) or private (behind a paywall/restricted access and usually must be obtained through a third party). Examples of external data are weather data, social media posts, geolocation and navigation services, as well as government census data.1

The most common data in steel mills is internal and structured: order information, setpoints of equipment and data captured from sensors. Often unstructured data is transformed into structured data. Images (unstructured) from surface inspection systems are analyzed to detect, classify and categorize defects on coils, and stored in relational databases according to a data model (structured).


Challenges for big data analytics are the capture, aggregation, validation, storage and provisioning of large amounts of data. The results of data mining and data analytics become better with the quality of data, but the larger the amount of data that is available, the more susceptible it is to flaws.

Every day, live data is captured from credit card transactions, smartphones and fitness trackers that come equipped with location tracking, microphones that allow recording of conversations, cameras to take photos, and videos as well as gyroscopes and biometric sensors.

In manufacturing, the capturing computer system needs to be able to connect to a variety of data sources from different vendors (databases, sensors, programmable logic controllers, etc.). Data validation rules can help to keep the data free from flaws and avoid “garbage-in/garbage-out” scenarios in the analytics. Devices for data storage must be large and fast. Recently such systems became affordable and capable data warehouses can be implemented as an on-premise solution. Cloud storage or cloud-on-premise hybrid solutions are other options.2


Traditional data analytics (data analytics not using big data) usually relies on human expert knowledge in combination with statistical methods. The four characteristics of big data make the analysis of big data vastly different:

  •     Variety (structured, semi-structured and unstructured).

  •     Velocity (batch, streaming and real time).

  •     Volume (terabytes to zettabytes).

  •     Veracity (cleanliness or messiness).

Definition: Big Data Analytics — Big data analytics is the process of analyzing larger data sets with the aim of uncovering useful information, test models and hypotheses. The results can lead to new revenue opportunities, improved operational efficiency, more efficient marketing and other business benefits.

Definition: Data Mining — Data mining is the process of analyzing data from different viewpoints and summarizing it into useful information. These include detecting abnormalities in records, cluster analysis of data files and sequential pattern mining using machine learning, statistical models A/B testing (also known as split testing), deep learning, natural language processing, and image/video analytics to uncover clandestine or hidden patterns.


Visualization of the results of data mining and analytics helps in understanding the insight it creates. Reports based on traditional data analytics use (one-dimensional) line charts, pie charts, scatterplots and heat maps. To visualize results from big data, a vast variety of software tools supporting all kinds of charts were created. One now established method is the implementation of management dashboards, concise decision supporting user interfaces that display all mission-critical information.


  1.     B. Marr, Big Data: Using SMART Big Data, Analytics and Metrics to Make Better Decisions and Improve Performance, John Wiley & Sons, 2015.
  2.     F. Provost and T. Fawcett, Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking, 1st Ed., O’Reilly Media Inc., 2013.
  3.     M. Kleppmann, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable and Maintainable Systems, O’Reilly Media, 2017.



Big data refers to the handling of extremely large data sets, which require a scalable architecture to make the storage, manipulation and analysis of this information efficient. 

Analytics is the systematic computational analysis of data or statistics. Organizations may apply analytics to business data to describe, predict and improve business performance. Specifically, areas within analytics include predictive analytics, prescriptive analytics, enterprise decision management, descriptive analytics, cognitive analytics and big data analytics.

These data come from countless sources: smartphones and social media posts, sensors, point of sale terminals, cameras, computers and programmable logic controllers (PLCs), among others.

In these data, great potential and opportunities for different industrial sectors are hidden. An in-depth analysis of this data can provide companies with a large amount of information that gives them a competitive advantage and improves their decision-making. In order to access these benefits, companies need qualified professionals with the necessary skills to extract valuable information, in the process known as data mining.

Big data is characterized by what is known as the 3 Vs:

  1.  Volume. With the current growth in data generation, it is estimated that by 2025 the digital universe will reach 175 zettabytes; that is, 175 followed by 21 zeros. The main challenge with data volume is not so much storage, but how to identify relevant data within giant data sets and make good use of them.

  2.  Velocity. Data is being generated at an increasingly rapid rate. The challenge for data scientists is to find ways to collect, process and use large amounts of data.

  3.  Variety. The data comes in different forms, mainly classified as structured and unstructured. Structured data is data that can be arranged in an orderly manner within the columns of a database. This type of data is relatively easy to enter, store, consult and analyze. Unstructured data is more difficult to sort and extract value from. Examples of unstructured data include emails, social media posts, word processing documents, audio, video and photo files, webpages, etc.


The term “big data” first appeared around 2005, when it was coined by O’Reilly Media. However, the use of big data and the need to understand all available data have been around for much longer.

In 2006, Hadoop was created by Yahoo!, which was built on Google MapReduce. Its goal was to index the entire World Wide Web. Today open-source Hadoop is used by many organizations to process large amounts of data.


Hadoop is an open-source framework for storing data and running applications on basic hardware clusters. It provides massive storage for any type of data, enormous processing power, and the ability to handle virtually unlimited tasks or jobs. Among its characteristics are:

  • The ability to store and process large amounts of any type of data quickly. With constantly increasing volumes and variety of data, especially when it comes to social media and the Internet of Things, this is a key consideration.

  • Processing power. Distributed computing model quickly processes big data. The more computing nodes are used, the more processing power.

  • Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure that distributed computing doesn’t fail. Multiple copies of all data are automatically stored.

  •  Flexibility. Unlike traditional relational databases, there is no need to pre-process the data before storing it. The user can store as much data as they like and decide how to use it later. This includes unstructured data such as text, images and video.

  • Different open-source computing frameworks can be used. Its developers present it as “a general and fast engine for large-scale data processing.

  • Low cost. The open-source framework is free and uses basic hardware to store large amounts of data.

  • Scalability. The user can easily grow the system to handle more data simply by adding nodes. Little administration is required.


Industrial sensors allow large amounts of information to be obtained from industrial processes. The following case is a development for a steel company where the creation of machine-learning models was sought for the prediction of defects in the product at the exit of a certain line. For this project, 18-month historical information was required from different sources of information, including information collected by the plant supervisory control and data acquisition (SCADA) system, process logs, level 2, personal digital assistants and information from other systems. The total of number of variables considered was in excess of 700, so a big data work scheme was proposed as the best one to do the analysis.

The architecture used is shown in Fig. 1. A cluster of three machines was used in a parallel processing environment. Apache Spark was used for the analysis that led to the selection of variables and information necessary for the training of machine learning models.

The selected variables and the corresponding data were used to train a machine-learning system that finally predicts defects on the line. Variables used on the analysis correspond to a different area before the actual line and variables of that line.

The majority (70%) of the data was used to train the AI system leaving a portion (30%) to be used to test the model with data that was not part of the learning and tuning of the system.

A basic conceptual schema of this process in shown in Fig. 2.

One of the key aspects is to combine business or process experts with data scientists. That combination improves the chance to reach better results.


  1. B. Purcell, “The Emergence of ‘Big Data’ Technology and Analytics,” Journal of Technology Research.
  2.     B. Thakur and M. Mann, “Data Mining for Big Data: A Review,” International Journal of Advanced Research in Computer Science and Software Engineering.
  3.     Apache Hadoop, http://hadoop.apache.org.
  4.     Wikipedia Analytics.









Michael F. Peintinger
Managing Director, QuinLogic (SMS group), Cincinnati, Ohio, USA

Ed LaBruna
Partner, Janus Automation, Bridgeville, Pa., USA