The response to these drawbacks is the proposal of interconnecting clouds, whose collaboration would ensure availability of services by aggregating idle resources of partners. While the approach improves the availability and enhances scalability, it also lacks standardization and introduces high heterogeneity and complexity. A problem of one component is further propagated to the entire system and might lead to different types of failures. Detecting issues and finding their source represent a challenging task to be fulfilled by a robust monitoring solution, which needs to observe if the execution of system components follows their blueprint. The correctness of the monitors influences the reliability of the whole system. False notifications can trigger unnecessary adaptation plans, while ignoring faulty execution can lead to larger downtime and misbehavior. ^The main contribution of the thesis is the elaboration and evaluation of a formal specification of a monitoring solution for Cloud-Enabled Large-Scale Distributed Systems (CELDS), which is responsible for identifying failures of system components and reporting them to the adaptation services. I address the problem of monitoring failures of the aforementioned systems with a specific interest on crash and performance failures. I acknowledge the possibility of misbehavior of the monitoring component itself and therefore emphasize the need of redundancy, reflected in the structure of the monitoring services. By following a formal approach for elaborating my solution, I target the subsequent main objectives: to define a solid structure for the monitoring layer able to detect problems of the execution nodes and to overcome own failures, to elaborate a formal model for the monitoring processes, to validate and verify the model according to a rigorous software engineering method. While there are many implementations of monitoring solutions, I focus on the analysis and design phase of the software systems development life cycle and on the delivery of a sound model. Formal methods for CELDS are still not well documented, leading to the request of improving this aspect and enhancing the current state-of-the-art in the area. The preciseness and correctness of the models are essential especially for critical systems deployed in CELDS. My work focuses on two main directions. Firstly, monitors carry out specific processes (ping requests, data collection and analysis, logging) to identify abnormal execution of system components. The information to collect and process is correlated with a small monitoring ontology which serves as the knowledge base component of my approach. If a problem is detected by the monitors, the monitoring layer will submit the issue and the corresponding information to the adaptation component, which would execute reconfiguration plans. The workflow of the monitors is expressed in terms of control state diagrams and follows closely the requirements of the system. Secondly, I propose a set of design and control decisions to overcome misbehavior of the monitoring services. Each node of the system is assigned a set of monitors, whose number depends on how critical the system is, but which is higher than a minimum imposed value. Each set of monitors is assigned a leader component to coordinate them. The leader handles issues reported by monitors and consults them all for building a collaborative assessment. Monitors are assigned a confidence value, which reflects their accuracy. The deviation of the data provided by a monitor in comparison with the information provided by the other monitors reflects its correctness. Thus, the higher the number of false reports or inaccurate data submitted, the lower the accuracy. When the value of the confidence degree of a monitor falls below a minimum set value, the monitor stops its executions and is reinitialized by the system. The goal of these operations is to increase the reliability of the monitoring solution. The dynamic setting of CELDS and their high complexity and heterogeneity represent an important challenge for following a formal approach. Among the existing techniques, I chose the Abstract State Machine (ASM) method to specify the monitoring related processes. The proposal follows the requirements of the system and translates them to ground models and control state diagrams. The multi-agent specification is then further refined to encompass more information. The model is validated through a set of scenarios and its properties are verified with the aid of NuSMV model checker integrated by the ASMETA toolset.