RAULI POIKELA, SENIOR BUSINESS CONSULTANT, September 6, 2021
In modern systems, the detailed monitoring of the supply chain is quintessential. In the real world, however, we cannot rely on modern systems alone. In complex and large environments, monitoring is an art that saves money and time by reducing disruptions and speeding up the resolution of any issues.
In logistics, the interface between an information system and the physical world culminates in loading waybills. If the relevant waybills are not available for an order, they can’t be loaded. But to print out the proper documents, you don’t just need a working printer; you need the correct information at the right time from your sales, warehousing, and logistics systems. If the cargo documents fail to be released, trucks will not start their journey, working days will be stretched out, and deliveries will become tangled.
The problem may lay in the data of an individual truck, in the systems of a particular terminal, or a nationwide disruption. When documents cannot be printed, a race against time begins. Actions need to be quickly taken because, in the case of extensive operations, unplanned supply disruptions cause huge problems. The damage can be in the range of hundreds of thousands of euros in a worst-case scenario.
The monitoring of faults is often limited to a single system and its functionality. When an interruption is detected in a chain, and a vital process no longer works, looking for the problem often means checking several systems, one by one. This takes a tremendous amount of time. Sometimes no fault is found in any of the systems, but the process still does not work.
While analyzing major system failures in an environment of logistics operations last year, we discovered that, in addition to correcting the fault, it took an unacceptable amount of precious time to determine where the exact fault was located. Not to mention the length of time it took to distinguish the extent of the fault, whether it was a minor issue or a substantial disruption, hence delaying the alert. Fortunately, such situations can be significantly improved by monitoring the state of business processes throughout a customer’s information system.
With such monitoring, faults can be automatically detected before they are visible to users, thus preventing large-scale disturbances. Not every potential fault in a complex environment can be eliminated, but effective monitoring significantly reduces the time spent on detection and repair works.
The challenges of implementing effective interruption monitoring in a complex environment are obvious. There are many different systems, and their degrees of maturity vary. Some systems might be nearing the end of their life cycle, and their renewal is underway, but it is slow and meticulous work. Old systems often work properly despite their peculiarities because the most significant faults have already been fixed over the years. Some systems have their unique quirks; for example, they may only work on specific servers and, thus, they bring their own challenges.
“Not all potential faults in a complex environment can be eliminated, but effective monitoring significantly reduces the time spent on detection and repair works.”
The challenge of implementing monitoring is to find the flaws in the process without affecting the process itself and changing the systems involved.
The solution we developed uses an existing product and tools to deploy tracking agents. These agents collect data into a Centralized Controller, which can be further processed and analyzed, triggering warnings and alerts.
Figure: Central Management is a SaaS component implemented as a cloud service where the collected data is stored. Different types of agents allow data to be collected from the available sources.
Ensuring that a process works by monitoring individual error situations is very laborious as it is difficult, if not impossible, to prepare for every case. However, monitoring messaging traffic has proven to be an effective way to detect system problems.
Problems with the receiving system are easily identified: incoming traffic gets stuck or ends up in an error. The tracking system monitors the contents of key directories and detects when files accumulate in them. Ten messages waiting in the folder at peak hours of the workday is usually not a sign of a problem, but in the middle of the night, it is cause for alarm.
With the help of our tools’ statistical algorithms, we can consider the typical volumes of business traffic on different days of the week and at various times of the day. This way, single deviations are ignored, and we only deal with actual error situations. For example, we can determine that a minimum of four departures from the standard baseline is cause for alarm and that alarming situations must meet a certain threshold, based on the most recent measurements, to require a response.
Problems with the sending system can be detected by identifying a lack of traffic. Here, the collection of statistics on the normal state of traffic is essential. When we identify sent messages, whether based on archive folders, log files, or database rows, our tool determines the base level of traffic. This allows us to then identify situations where there is no traffic when there should be.
The key to identifying the severity of an error situation is the classification of the errors. Inevitable mistakes happen here and there: someone’s fingers may have slipped on the keyboard, and now, they are trying to sell a non-existent product or a product that is not ready to be shipped. In some older systems, it is difficult to distinguish these types of errors from system errors, such as faulty message formats or an unresponsive database. We can solve this issue by collecting the error frequency rate and setting a baseline for it. Once the baseline has been determined based on a set period of data, it is easy to detect abnormal peaks in the error rate and get the relevant parties to review the situation.
The end-user rarely finds out immediately where and why a process was interrupted. Sometimes, the cause can be an error in the user’s own data; sometimes, it is a system error. To alleviate the situation, we have created an “order tracking” solution that supports the user by providing visibility to the state of the process and to the stage where it was interrupted. This enables the user to understand whether the issue is a user-specific problem or a broader failure.
Our solution identifies which messages are relevant to the process and configure the middleware software to produce copies of those. This allows us to freely process messages when modeling event chains without compromising the business process itself.
Once the process’s phases, messages, and disturbances have been modeled in this way, and the relevant events have been collected in the Central Management, they can also be presented through analytic tools, bringing a new kind of visibility and understanding to the whole.
Thanks to this solution, we have managed to reduce serious incidents and false alarms. It has also made it possible to use common views and alerts and exchange data between different actors when dealing with problem situations. This enables the smooth cooperation of a network of established contacts and significantly reduces the resources spent on troubleshooting.
We monitor business processes where the need is the greatest. As this type of environment has been in operation since computers were first introduced to the industry, much needs to be done. A typical case is always iterative. First, we need to understand what is happening, what key issues we need to monitor, and how to watch them. Test environments rarely have enough real-life information, so we usually must work directly in the production environment.
After a preliminary analysis, we implement the prototype to obtain the most critical measurements. Once the measurements work, we start getting information about traffic flow, and primary data collection begins. Next, we need to define the triggers: what is expected and what is unusual? What kind of errors require examining the system, and what level signifies that something is badly wrong?
Finding the right balance is difficult. When the mailboxes and phones of an on-duty team are flooded with unnecessary warnings, the real alarms drown in the flood. On the other hand, raising the alarm threshold too high may cause real dangers to be missed. Our on-duty team regularly discusses the latest alerts and close calls, and changes the rules, if needed, to ensure the best response.
Through tracking, we have been able to look at the state of business processes with new eyes and have learned to understand them better. We find problems now and then: messages that no one needs, folders that collect messages just to fill up disk space, or message receivers that act suspiciously slowly. False positives must be removed to keep monitoring reliable. Often, our actions lead to correcting all kinds of small problems, making the customer environment more stable and unambiguous.
Needless to say, the development of business process monitoring will never be fully complete. Process monitoring is not just about the tools to be installed and configured. We need to understand the processes and analyze their systems to identify the best methods and places to study the message traffic. After all, the situation is constantly evolving. Processes change over time, as do the underlying systems. Modern systems will replace old ones and, with that, monitoring must change as well. The most important thing is to stay up to date.