Fondazione GRINS
Growing Resilient,
Inclusive and Sustainable
Galleria Ugo Bassi 1, 40121, Bologna, IT
C.F/P.IVA 91451720378
Finanziato dal Piano Nazionale di Ripresa e Resilienza (PNRR), Missione 4 (Infrastruttura e ricerca), Componente 2 (Dalla Ricerca all’Impresa), Investimento 1.3 (Partnership Estese), Tematica 9 (Sostenibilità economica e finanziaria di sistemi e territori).



Open Access
THEMATIC AREAS
RESOURCES
Data quality is a critical prerequisite for the effective use of Energy Performance Certificates (EPCs), called Attestati di Prestazione Energetica (APE) in Italian regulations, for research purposes, monitoring activities, and policy evaluation. Large administrative EPC datasets often exhibit heterogeneous quality, with potential inaccuracies or inconsistencies that may compromise subsequent analyses.
To address this issue, we introduce a data quality control framework designed to evaluate the reliability of each certificate through a comprehensive set of checks applied at the individual-record level. The outcome of this process is a global penalty index that quantifies the extent of detected issues: higher values denote poorer data quality. This approach offers a systematic and reproducible way to summarize data reliability and to support the identification of certificates that warrant correction or additional review.
Within the proposed framework, we distinguish between two broad families of data quality checks. The first group consists of deterministic, rule-based validation checks, which rely on explicit logical constraints derived from the regulatory definition of EPCs and from basic physical relationships. These checks verify that required fields are present, that reported values lie within admissible domains, and that predefined relationships between fields are exactly satisfied. The second group comprises statistical, data-driven diagnostics, which assess the plausibility of the reported information in relation to the empirical distributions and multivariate patterns observed in the dataset. These diagnostics are designed to flag records that are formally admissible according to the deterministic rules but appear highly atypical or improbable, and are therefore at increased risk of measurement, coding, or transcription errors. Taken together, deterministic and statistical checks provide complementary perspectives on data quality, enabling a more comprehensive characterization of potential issues at the level of each individual certificate.
Within both the deterministic and statistical components, we distinguish between univariable and multivariable checks. Univariable checks focus on individual fields, such as detecting values outside admissible ranges or identifying unusually extreme observations relative to the empirical distribution. Multivariable checks evaluate the joint coherence of two or more variables, for example verifying mutually exclusive categories that should admit only one valid option, or detecting atypical combinations that deviate from expected multivariate patterns.
The underlying rationale of the proposed system is to examine each certificate through multiple complementary perspectives, enabling the detection of a wide range of data quality issues. Simple rule-based checks are designed to capture explicit inconsistencies—such as violated logical constraints or values falling outside admissible domains—that typically signal coding errors or clearly implausible entries. In parallel, both univariable and multivariable statistical diagnostics are used to identify more subtle anomalies that cannot be revealed through logical rules alone. These include values that are individually plausible but highly atypical in the context of the empirical distribution, as well as combinations of fields that deviate from expected multivariate patterns. By integrating these different layers of validation, the framework is capable of identifying issues that span from straightforward incongruities to complex, hard-to-detect anomalies arising from the joint behavior of multiple fields.
It is important to emphasize that the resulting penalty score does not represent a deterministic measure of data quality. Because the framework combines rule-based validations with statistical diagnostics, the score should be interpreted as an indicator of the likelihood that a certificate may contain inaccuracies, rather than a definitive classification of records as valid or invalid. Deterministic checks contribute directly to the score by flagging explicit inconsistencies, whereas statistical checks reflect deviations from empirical patterns that may arise from genuine heterogeneity as well as from potential errors. As a consequence, higher scores identify certificates that warrant closer examination, but they do not imply that the underlying data are necessarily incorrect. The penalty score therefore functions as a screening tool that prioritizes potentially problematic records and supports more informed decisions on data cleaning, verification, or subsequent exclusion.
KEYWORDS
AKNOWLEDGEMENTS
This study was funded by the European Union - NextGenerationEU, in the framework of the GRINS - Growing Resilient, INclusive and Sustainable project (GRINS PE00000018). The views and opinions expressed are solely those of the authors and do not necessarily reflect those of the European Union, nor can the European Union be held responsible for them.
CITE THIS WORK