Article dated 03/04/2023 - Written by Di Muro Francesco
The importance of knowing and using the right tools when carrying out a job is essential to guarantee the correctness of the result and facilitate its execution for those who carry it out.
Let's take the processing of a text file as an example;
in case the text file contains little information, you can opt to process it manually, extrapolating, reformatting and modifying the information necessary to produce the desired result.
But what happens when we have a file with a lot of complex information, for which it would take us much more time to reformat it?
For this, it is important to know the right means to use based on the result we are looking for.
Let's talk about a practical example.
A few days ago, a customer had a malfunction on services responsible for logging alarms and events in their monitoring system, which are essential for analyzing anomalies and keeping track of the various events that have occurred in the system.
Fortunately, text file logging of alarms and events was enabled, thus allowing the system to record what was happening during the anomaly, which lasted for approximately three days.
In order for this data to be present in the alarm and event database, it was necessary to extrapolate the useful information and format it according to certain structures so that the result produced (another file) could be recognized by the service responsible for the import of the above output file.
Text file before formatting
What tools have been used so that this process could be as simple, fast, and above all could produce an optimal result?
First of all, when dealing with large data sets, in every area it is important to apply a process thanks to which useless data is eliminated, thus reducing the size of the data set itself and allowing us to work with "clean" data : data cleaning.
In this specific case, data cleaning was applied to remove all those data (alarms and events) that had no use within our dataset. There are a number of tools to apply data cleaning, each used according to the scenario in which you are.
Having to work with a text file, in which each line corresponds to a record with a well-defined structure, I thought of using a tool as powerful as it is scary due to its apparent cryptic syntax: regular expressions.
Thanks to RegEx, I was able to remove all that non-critical data for an audit trail report or historical alarm report, thus having a smaller and "cleaner" dataset to work with.
Data cleaning con le espressioni regolari
Using a feature of Notepad++, a file editor with advanced features compared to Notepad (default text file editor of Windows), I was able to remove all the empty lines of the file (the lines that contained useless data) in just one click, reducing the amount of total rows from 1011 to 13.
Text file after the first regular expression for data cleaning
Removing trailing spaces at the end of each line
Removing trailing spaces at the end of each line
Replacing at least two spaces in a single tab for the timestamp
Timestamp formatting
Highlighting of acknowledged alarms, capturing information about who acknowledged the alarm and the alarm tag
Once the data cleaning was done in the text file, I applied a further principle present in many programming environments (especially in Data Science), namely data formatting, the practice that defines the structure of data in a database or other "container" , within which it is possible to describe the data in more detail.
Even to apply data formatting there are a number of tools, and in this case I wanted to use one of the most used tools to give shape to data, namely Microsoft Excel.
Thanks to the table-like structure of the worksheets, I was able to view the information to be processed in a more structured and clean way, defining the procedures thanks to which I was able to proceed with the processing of the file.
Data formatting in Microsoft Excel
After a rapid division of the alarms from the events, I was able to order the necessary information according to the structures required by the service responsible for importing the files into the alarm and event database, thus generating two files in CSV format (one for alarms, one for events).
File formatted in Microsoft Excel
Since the CSV format depends on the additional settings of the Geographic Area set in your OS (at least for Windows), I decided to transform the CSV file into an XML file, which structure always remains human-readable, but which does not have strict rules such as in the case of the CSV format.
Alarm CSV file
Alarm XML file
For this task, I decided to use the Python programming language, thanks to which simple syntax and the multitude of methods available for each type of data, in a few lines of code I was able to transform a CSV file into an XML file.
Python script for converting file from CSV to XML
Once this process was completed, I imported the XML file into the alarm and event database through dedicated services, restoring the non-historical data and allowing the customer to generate the reports necessary for their business purposes.
Historical Alarm Report containing imported data