Applications of machine learning in animal and veterinary public health surveillance

slaughtering or the mining of free text in electronic health records from veterinary practices for purpose of sentinel surveillance. However, ML is also being applied to tasks that had usually been tackled with traditional statistical data analysis. Statistical models have extensively been used to infer relationships between predictors and disease to inform risk-based surveillance and increasingly, ML algorithms are being used for prediction and forecasting of animal diseases in support of more targeted and efficient surveillance. While ML and inferential statistics can accomplish similar tasks, they have different strengths making one or the other more or less appropriate in a given context.

slaughtering or the mining of free text in electronic health records from veterinary practices for purpose of sentinel surveillance. However, ML is also being applied to tasks that had usually been tackled with traditional statistical data analysis. Statistical models have extensively been used to infer relationships between predictors and disease to inform risk-based surveillance and increasingly, ML algorithms are being used for prediction and forecasting of animal diseases in support of more targeted and efficient surveillance. While ML and inferential statistics can accomplish similar tasks, they have different strengths making one or the other more or less appropriate in a given context.

Keywords
Animal health -Infectious disease -Machine learning -Surveillance -Veterinary public health.

What is machine learning?
The advancement in computing technology and power and the explosion of data generation and storage capability in the last decades have seen the increased use of machine learning (ML) in many areas.
ML is a collection of methods built upon statistics, mathematics and computer science that enable automated pattern discovery and model building at scale. Many introductory articles describing the various ML techniques have been produced targeting researchers and scientists in different fields [1,2,3,4,5,6,7,8,9,10,11]. We do not intend to reproduce those efforts but aim to put the ML methods in context of their techniques and purposes in comparison to traditional statistical data analysis and to present ML solutions to specific surveillance tasks that cannot effectively be addressed by traditional statistical data analysis. In this section we will contrast unsupervised ML with the use of descriptive statistics, and supervised ML with the use of statistical modelling (inferential statistics) to highlight the similarity in the approaches they use and the differences in purposes. 41_2_24_Guitian_preprint 3/29 between continuous variables and between categorical variables while, similarity measures such as Euclidean distance or Manhattan distance summarise likeness between observations. Although it is possible to comprehend these descriptive statistics in smaller settings, the information can quickly become difficult to synthesise with increased number of variables and observations. Unsupervised ML techniques basically explore and process further these descriptive statistics to discover hidden patterns and groupings in the data and to extract useful features from the data. The main tasks unsupervised ML are used for are dimension reduction, clustering and association rule mining.
Dimension reduction techniques such as principal components analysis are frequently used to reveal hidden patterns in high-dimensional interrelated data [12]. They are used, for example, to summarise a large number of correlated bioclimatic variables or to assist visualising population structure in genetic variation [13,14].

Machine learning in animal and veterinary public health
The scope of artificial intelligence (AI) in the context of public health has recently been reviewed by Schwalbe and Wahl [23], who identified four categories of AI-driven health interventions: 2) mortality and morbidity risk assessment 3) disease outbreak prediction and surveillance 4) health policy and planning.
It is possible to identify recent contributions of ML to animal and veterinary public health that broadly fall within these categories, as well as others that would not clearly fit within any of them.
As in healthcare medical applications, signal processing methods in combination with ML can be used to enhance the performance of diagnostic or classification systems in animals or herds. Promising results have been obtained for example when convolutional neural networks were used to recognise and quantify specific lesions on digital images captured during routine slaughtering of pigs [24].
Improvements to diagnostic performance by applying ML are not limited to imaging data, classification tree analysis has been shown to be able to enhance the sensitivity of the classification regime on which the eradication programme for bovine tuberculosis in the United Kingdom (UK) is based [25]. Decision trees are a method of supervised learning that can be used for regression or classification tasks. They consist of a tree-like structure where each node represents a single input feature and, for numeric features, their split value. The final nodes after which no further splits take place are referred to as the leaves of the tree and represent the output that is used to classify or predict. Identification of the best feature and threshold value to split the data is carried out in order to generate the most homogeneous sub-nodes with respect to the outcome of the tree. Decision trees are one of the most widely used ML methods and the key component of other algorithms such as random forests.
With regard to the second domain of application, ML has been used, for example, to attempt to predict cases of lameness in dairy cows based on milk production and conformation traits [26]. The predictive performance of the classifiers built in this study was suboptimal, but as 41_2_24_Guitian_preprint 6/29 acknowledged by the authors, it could possibly be improved by expanding the spectrum of data with which the models were trained.
Indeed, this study illustrates how the capacity of ML algorithms to accurately predict presentation of a multifactorial condition, such as lameness in dairy cattle, relies on them being trained on data that captures the wide array of disease determinants. The ability of ML algorithms to generate real-time risk predictions based on a broad range of risk factors was the motivation for the use of ML to expand conventional risk prediction approaches and generate daily predictions for highly pathogenic avian influenza risk for poultry farms in the Republic of Korea [27], an application that falls within the third category of AI-driven interventions listed above.
As for applications for health policy and planning, we are not aware of the use of ML algorithms to support allocation of resources for animal disease surveillance in the same way they have been used in public health resource allocation [28]. On the other hand, ML has been used to generate information in order to support animal health surveillance planning and outbreak response. In a recent example, to address the lack of comprehensive and accurate poultry population data in the United States of America (USA), Patyk et al. developed an automated machine learning process to locate commercial poultry operations and predict their size and type in the USA. The authors used a supervised ML algorithm to detect poultry operations from aerial imagery [29].
In recent years, there has been a rapid expansion in the application of ML to very diverse challenges in animal health, some of which do not entirely fall within the above areas of application, which mostly refer to the use of supervised algorithms for purpose of prediction or classification. Unsupervised ML methods have been used, for example, to discover underlying structure in poultry condemnation data to uncover potential indicators for broiler chicken health and welfare surveillance (cluster detection and association rule mining) [21, 30] and to classify cattle herd types to inform control and surveillance of endemic diseases (dimension reduction) [31]. Supervised ML methods for regression/classification have also been applied to animal and veterinary public health challenges beyond the four domains identified 41_2_24_Guitian_preprint 7/29 by Schwalbe and Wahl [23], a recent example being the identification of carnivore and bat species not recognised as reservoirs of rabies with trait profiles suggesting their capacity to be or become reservoirs [32].
An important emerging area of application of ML algorithms in the context of animal health surveillance is the analysis and extraction of information from clinical records for the purpose of syndromic surveillance [33]. Recent studies have shown the potential of applying machine learning algorithms to automate mining of free-text data in clinical and post-mortem reports; an application that can greatly facilitate the adoption of animal health syndromic surveillance [34,35,36]. At farm level, precision technologies are providing farmers and veterinarians with large amounts of data the analysis of which can greatly support health and production management. Machine learning algorithms are central to the analysis of such data and, as for text data, they can enable their used for the purpose of syndromic surveillance [37,38].
In summary, due to their diversity and versatility, ML algorithms are being applied to an increasing range of tasks in animal and veterinary public health. In addition to broad domains of application analogous to those recognised in the field of global health, more specific uses of ML to address particular tasks continue to emerge. In the following section,

Examples of application in animal and veterinary public health surveillance
Use of machine learning to maximise probability of pathogen detection As described above, ML can be applied for different purposes within the context of animal and veterinary public health. In the area of surveillance it can be used to determine the likelihood of pathogen detection. This allows researchers to prioritise samples or cases that have the highest probability of being positive, ensuring resources and laboratory capacities are focused on these samples and to assist in the design of any future programmes of surveillance. Such approaches have been applied to animal disease, but also food borne disease [39] and plant diseases [40] and make the most of the metadata associated with the biological samples or cases, such as geographical location, type/age of host, etc.
As an example, Walsh et al. [41] used gradient boosted trees, which is an extension of a classification tree (  Another example of application of ML to generate insights into potential reservoirs of disease is the study by Wardeh et al. [50]. Predictions of associations between known viruses and potential reservoirs of disease (zoonotic and non-zoonotic), were obtained with an ensemble of six models using a large data set of mammal-pathogen interactions. The results highlighted that current knowledge is likely to heavily underestimate the number of existing associations, particularly in wild and semi-domesticated mammals.
An application of ML in the context of disease surveillance that deserves special mention is the exploration of genome sequencing data.
The characteristics of these data (large, complex and hiding patterns that would be challenging to determine via other means) make ML methodologies ideal for their analysis. Furthermore, sequencing data is now becoming more readily available due to reduction in cost and the increase in through-put within veterinary and public health institutes.
Examples of tasks relevant for design and implementation of infectious disease surveillance, which have been successfully accomplished by applying ML to whole genome sequencing data include source 41_2_24_Guitian_preprint 12/29 attribution [51], assessment of pathogenicity [52], prediction of antibiotic resistance phenotypes [53] and prediction of clinical outcomes [54].

Consider training a meta-model
Each ML method will have different strengths and weaknesses, and it is difficult to know a priori which approach will work best for any given problem. Users can also apply a stacking ensemble approach, where multiple methods are applied in parallel and the final model is a weighted combination of the predictions from all the models.

Consider the transparency of the approach
A disadvantage of ML algorithms when compared to statistical methods such as regression is the limited 'interpretable' information they provide beyond their immediate task (e.g. classifying observations). For example, neural networks have been found to be very effective at making predictions where there are complex non-linear relationships between variables, but they might be unsuitable for identifying individual risk factors from which to target farms for surveillance or control measures.

Consider whether the main objective is to explain or to predict
As the main focus of ML algorithms is on prediction rather than explanation, there can be differences in the variables that are included in the final predictive model between ML and classical statistics. While explanation is not the primary aim of ML methods, some of the factors that are found to be important for prediction by ML algorithms could be the target of further investigation and could shed light of causative explanations, even where they were not found to be statistically 41_2_24_Guitian_preprint 14/29 significant by classical statistics approaches. If the intention is to build a ML model in order to predict disease occurrence, the algorithm should ideally be trained on data that capture the array of disease determinants. This is particularly important when trying to predict the occurrence of multifactorial conditions.

Consider the balance of domain expertise and machine learning expertise
For example, Sperschneider [40] suggests that 5% of the time will be spent training the model but 95% selecting the most appropriate features, which needs biological/epidemiological expertise. Likewise, ML expertise is needed to ensure that the correct model is applied for the available data, and that overfitting is avoided.