AbstractAthlete engagement is influenced by several factors, including cohesion, passion and mental toughness. Machine learning methods are frequently employed to construct predictive models as a result of their high efficiency. In order to comprehend the effects of cohesion, passion and mental toughness on athlete engagement, this study utilizes the relevant methods of machine learning to construct a prediction model, so as to find the intrinsic connection between them. The construction and comparison methods of predictive models by machine learning algorithms are investigated to evaluate the level of predictive models in order to determine the optimal predictive model. The results show that the PSO-SVR model performs best in predicting athlete engagement, with a prediction accuracy of 0.9262, along with low RMSE (0.1227), MSE (0.0146) and MAE (0.0656). The prediction accuracy of the PSO-SVR model exhibits an obvious advantage. This advantage is mainly attributed to its strong generalization ability, nonlinear processing ability, and the ability to optimize and adapt to the feature space. Particularly noteworthy is that the PSO-SVR model reduces the RMSE (7.54%), MSE (17.05%), and MAE (3.53%) significantly, while improves the R2 (1.69%), when compared to advanced algorithms such as SWO. These results indicate that the PSO-SVR model not only improves the accuracy of prediction, but also enhances the reliability of the model, making it a powerful tool for predicting athlete engagement. In summary, this study not only provides a new perspective for understanding athlete engagement, but also provides important practical guidance for improving athlete engagement and overall performance. By adopting the PSO-SVR model, we can more accurately identify and optimise the key factors affecting athlete engagement, thus bringing far-reaching implications for research and practice in sport science and related fields. IntroductionMachine learning, a powerful interdisciplinary technology that incorporates multiple disciplines, has the remarkable capability to learn autonomously from experience.It constructs predictive models using “learning algorithms” to make accurate, personalized predictions for new scenarios. As a core component of artificial intelligence, machine learning plays a vital role by extracting valuable insights from diverse and complex data sets. Its applications extend to numerous fields, including medicine1, bioinformatics2 and automation technology3.As a core concept in the field of positive psychology, engagement refers to an individual’s persistent mental state filled with positive emotions, which is an important factor in influencing an individual’s behavioral performance and the direction of behavioral development. Although previous research in the area of sport performance has focused on burnout mostly as leading to maladaptation and sport dropout4,5. However, with the growth of the positive psychology movement, athlete engagement has received much attention and related research has become increasingly prominent6. Early identification of low-engagement individuals allows for intervention before the risk of burnout and disengagement occurs7. It needs to be specified that a state of non-burnout is not the same as engagement, and that low engagement is not burnout. Positive feedback can be obtained when we view athlete performance in a positive manner. Therefore, athlete engagement, as an important indicator of positive psychology of athletes, plays an important and positive role in improving athletes’ competitiveness and athletic performance. A high level of engagement is conducive to stimulating positive qualities that promote the development and maturation of athletes8, laying the foundation for enhancing their athletic ability9 and ultimately translating into significant improvements in athletic performance and achievement7.In sports, the application of machine learning and AI technologies is continually broadening the boundaries of sports talent development, athlete performance enhancement and event decision-making. Through these advanced technologies, coaches and sports teams are able to achieve accurate predictions of athlete performance, driving a shift in the decision-making process from intuitive experience to quantitative data-based analysis10. Researchers have used smart technologies to play an significant role in several aspects of selection of basketball players, analysis of game activity, prediction of results and so on, improving team competitiveness significantly11,12. The soccer club case further demonstrates the practical application of sports data analytics in predicting injury prevention, assessing optimal performance and optimizing training13. It is clear that predictions utilizing machine learning that enable timely assessment of athletes’ reaction and performance to different scenarios are critical for improving skills, promoting team synergy and guiding sports teams towards success. This highlights the significant value of sports prediction modeling in contemporary sports.Therefore, previous research has utilized predictive models of athletic performance as a benchmark for guiding athlete training, management14, and strategic planning in sports teams15. For example, researchers have built such models to evaluate motor control and stability to provide runners with precise training recommendations16. In addition, studies that predicted sports performance from a four-dimensional perspective found that the interest in seeking knowledge was a key factor influencing sports performance17. In China, although the exploration of artificial intelligence in the field of sports performance continues to deepen, the research in this area started later than abroad. We have notyet deeply explored its influence mechanism and established corresponding prediction models. Athlete engagement as a positive predictor and key influencing factor is in dire need of further research and development18,19.Currently, traditional statistical methods often struggle to capture complex nonlinear relationships when analyzing the relationship between multidimensional psychological traits and athlete performance. For example, while linear regression models provide basic predictions, they are limited in their explanatory power and predictive accuracy. In contrast, machine learning methods show remarkable advantages when dealing with complex datasets. It is not only capable of handling large amounts of high-dimensional data, but also has the ability to identify patterns and associations automatically, capturing the complex relationships between variables effectively. Particularly in the field of sport psychology, machine learning has been able to explore the influence of psychological traits on behavior in greater depth. Machine learning can identify the relationship between underlying psychological traits and athlete engagement through sophisticated data analysis. This provides a new methodological basis for quantitative sport psychology and improves the science and accuracy of prediction of athlete engagement. However, the multiple psychological factors involved in athlete engagement, such as passion20, burnout21 and thriving22, are often subjective and difficult to standardize and quantify. Machine learning requires large amounts of high-quality, structured data for training, but the diversity and complexity of subjective questionnaires bring challenges to data processing. Therefore, we propose to implement a comprehensive set of strategies covering preprocessing steps, recursive feature elimination for cross-validation (RFECV), sensitivity analysis, SHAP and LIME interpretive techniques, data averaging and downsampling and data standardization and normalization. We adopt a comprehensive strategy covering recursive feature elimination for cross-validation (RFECV), sensitivity analysis, SHAP and LIME interpretation techniques, data averaging, downsampling, and standardization. This aims to improve the quality of the dataset and the predictive performance of the model significantly, ensuring model reliability and accuracy.This study aims to predict the level of athlete engagement by examining both team and individual dimensions. At the team level, cohesion is one of the most frequently cited concepts in sports science research. It refers to the solidarity and sense of belonging that individuals feel in a team, which is a key factor influencing team performance. Strong team cohesion predicts the level of athlete engagement. In sporting events, cohesion provides crucial social support for athletes23. Teams with high levels of cohesion usually result in higher levels of member engagement and increased support and trust in each other. Research suggests that cohesion is associated with satisfying psychological needs, promoting intrinsic motivation, and elevating athlete engagement levels. Increased team cohesion is an effective way to increase athlete engagement among athletes. It can be concluded that cohesion not only affects the emotional experience of athletes, but also plays an significant role in their behavioral performance. At the individual level9, passion and mental toughness play crucial roles in influencing athlete engagement. Hegel once asserted, "Nothing great in the world has ever been accomplished without passion." Passion is not only a potential indicator of teacher engagement among secondary school24 but also significantly influences students’ learning engagement25. In sports, passion, internalized into personal identity26, fosters strong engagement, enabling athletes to persist in their activities. Research suggests that we should focus on promoting harmonious passion tendencies within achievement situations in order that there is an increased probability of adaptive behavioral outcomes ensuing and being able to promote healthy sport engagement27. Also, the two types of passion (harmonious and obsessive) should lead to engagement that, in turn, should lead to improved sport performance. Thus with passion, people are able to focus completely on the task at hand, give their full commitment to the sport, and experience positive results from their participation. Mental toughness is indispensable in both protecting and enhancing athlete engagement, representing a crucial psychological asset for achieving exceptional sporting performance. Research suggests that mental toughness is essential for athletes to cope with difficulties successfully, avoid the effects of stress and maintain their performance development28. It not only has a significantly positive predictive influence on athlete engagement29 but also correlates positively with an individual’s sports performance. Athletes with a high level of mental toughness can participate in the process of training and competing with a positive state of mind and body, and have a high level of athlete engagement (e.g., a high level of mental toughness). Athletes with high mental toughness have a strong degree of confidence in their own abilities and respond positively when they encounter difficulties, so they tend to approach training and competition with a more positive physical and mental condition30 and have a high level of athlete engagement. Therefore, mental toughness plays a central role in the foundation of sport success and is a proximal factor in the athlete engagement.This study develops and compares several prediction models, including Linear Regression (LR), k-Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT), Decision Tree Regression (DTR), Random Forest Regression (RFR) and Support Vector Regression (SVR). To enhance the predictive capabilities of these base models, we employ four efficient optimization algorithms: Random Grid Search (RGS)31, Genetic Algorithm (GA)32, Spider Web Optimization (SWO)33 and Particle Swarm Optimization (PSO)34. Additionally, to ensure the model’s reliability and robustness, we utilize five-fold cross-validation35 to assess the model’s stability. Through these integrated methods, the study aims to evaluate the accuracy, stability, and overall performance of the models in predicting athlete participation.Methods and dataData sourceWritten informed consent was signed by each team representative and all participants were informed that they could withdraw from the study at any time. All methods were carried out in accordance with relevant guidelines and regulations. This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of Zhejiang Normal University (ZSRT2022079).In order to obtain sufficient data sets, this study distributed and collected questionnaires to athletes in multiple professional sports teams, high-level national training teams from Zhejiang, Heilongjiang, and Liaoning provinces by means of Wenjuanxing. From 1 June to 1 October 2021, a total of 445 questionnaires were distributed. Of these, 175 (53.7%) were male participants, a higher rate than females, and the average age of the athletes was between 17 and 22 years old and had about three years of training experience. Sports include: basketball (74), volleyball (36), soccer (63), cricket (47), ice hockey (20), curling (11), team aerobics (47) and others (28). All questionnaire surveys were conducted after sports training sessions. The questionnaire scale covered in this study can be divided into three parts: in the first part, we made it clear that the survey was based on the principles of voluntariness and anonymity; the responses to the questionnaire were used only for the researchers; they should not be used for any kind of profit or other purposes. The second part was to collect basic information about the athletes. In the third part, we set up the scales required for this study, including cohesion (Group Environment Questionnaire), passion (Sports Passion Questionnaire), mental toughness (Sports Mental Toughness Questionnaire) and athlete engagement (Athlete Engagement Questionnaire).Group environment questionnaire (GEQ)The GEQ compiled by Carron (2010) and translated by Ma Hongyu36 was adopted. This questionnaire represents a specialized tool for assessing cohesion in sports, which has good reliability and validity in Chinese use. The questionnaire comprises 15 items, encompassing the dimensions of Attraction to Group Task (ATG-T), Attraction to Group Social (ATG-S), Group Interaction Consistency (GI-T), and Group Social Interaction Consistency (GI-S). The Cronbach ɑ coefficient of test–retest reliability is 0.869, which meets the needs of this study.Sports passion questionnaire (SPQ)The SPQ was compiled by Vallerand (2010)37. The use of Chinese athletes yielded favorable results in terms of reliability and validity. The scale comprises 16 items, encompassing three dimensions: general passion, harmonious passion and compulsive passion. The Cronbach’s alpha coefficient for this scale is 0.941, which meets the research requirements.Sports mental toughness questionnaire (SMTQ)The SMTQ was compiled by Sheard, Golby and Wersch (2009)38. The Chinese version of the questionnaire demonstrated good reliability and validity in its application within China. There are 14 items in total, including three dimensions of self-confidence, constancy and control. A higher score indicates greater mental toughness in the athlete. The Cronbach’s ɑ coefficient for the questionnaire in this study was 0.849, signifying its strong reliability.Athlete engagement questionnaire (AEQ)The AEQ was compiled by Lonsdale et al.39(2007) and revised by Ye Lv’s translation29. The questionnaire comprises 16 items, encompassing the four dimensions of self-confidence, dedication, vigor and enthusiasm. A higher score indicates a greater level of engagement from the athlete. The Cronbach’s alpha coefficient for this scale is 0.959, which meets the research requirements.Machine learning model construction processFeature selection of machine learning model. The study decided to select the features for the machine learning model from the athletes’ cohesion, passion and mental toughness. Cohesion consists of four scores: ATG-T, ATG-S, GI-T and GI-S; Passion consists of three scores: passion criteria, harmonious passion and obsessive passion; Mental toughness consists of three scores: confidence, constancy and control.We employ the cross-validation recursive feature elimination (RFECV) method40 for feature selection, removing features with lower weights iteratively until the model’s performance ceases to improve. Features are ranked by importance, and RFECV identifies the optimal number of features to retain automatically, ensuring the model includes the most predictive ones.Machine learning model establishment process. This research employed Python version 3.9.1 and utilized the Sklearn library to construct a machine learning prediction model for athlete engagement. The specific methodology is outlined as follows:Data collection and preprocessingThe study participants included professional athletes and high-level college athletes, selected from the national training team and various regions in China, such as Zhejiang, Heilongjiang and Liaoning. These athletes have had sports careers spanning over three years and have shown exceptional performance in their fields. Questionnaires were administered after athletic training sessions. Researchers obtained consent from both coaches and athletes, thoroughly explaining the study’s objectives and confidentiality protocols. Each team representative signed an informed consent form to ensure respondents were fully aware of any potential risks. To reduce the influence of coaches, the questionnaires were given to team captains for distribution. Participants took about 12 min to complete the questionnaire. The collected data included basic information, cohesion, passion, mental toughness and athlete engagement. Of the 445 precious data collected, we performed data preprocessing with the aim of identifying and dealing with outliers and outliers. In this research, we identified outliers in two specific scenarios: first, where all responses to the questionnaire were identical, suggesting a lack of variability in the data; and second, where the time taken to complete the questionnaire was exceptionally brief which amounting to less than 3 min. Following a stringent vetting process, we removed those questionnaires in which such conditions existed. This meticulous approach resulted in a dataset comprising 326 high-quality, valid entries, achieving a response rate of 73.3%. Importantly, the dataset was devoid of any missing values, which reinforces the integrity and dependability of our study findings.Given that athletes might be affected by various circumstances, including physical exhaustion and meteorological conditions. These factors may lead to noise or fluctuations in the statistics. By calculating an average of every 8 data points, the impact of short-term variations can be reduced effectively and unwanted noise is minimized. This facilitates the observation of long-term trends and alterations. In some cases, athlete data may show higher densities, complicating analysis. High frequency data can be reduced to lower frequencies to make the analysis process more efficient41. Data preprocessing is an essential step in tailoring the dataset to meet the unique requirements of various machine learning algorithms. It plays a pivotal role in assessing the predictive capabilities of different models.Data preprocessing and normalizing techniques are employed to enhance the dataset’s quality. These methods aim to optimize the performance of machine learning algorithms by minimizing data noise and decreasing data dimensions42. The input dataset comprises the cumulative scores for cohesion, passion and mental toughness, whereas the output dataset comprises the overall score for athlete engagement. Input and output data normalized in the range 0 to 1. Equation (1) was used for normalization of data.$$begin{array}{*{20}c} {N = frac{{left( {X - X_{min} } right)}}{{left( {X_{max} - X_{min} } right)}} } end{array}$$ (1) N represents the normalized data, X represents the measured value, ({X}_{min}) and ({X}_{max}) represent the minimum and maximum values, respectively.Division of training set and test setTo validate the performance of the model, the available database was divided into two distinct groups: a training set for training predictive models and a testing set for evaluating its performance. After comparing the experimental results of different data partition ratios, we found that the 8:2 partition ratio has the lowest RMSE, as detailed in Table 1. During the model training phase, we improve the parameters continuously and use the test set to evaluate the performance of the model for further refinement. The test set not only evaluates the data, but also confirms the reliability and generalization ability of the model. In addition, employing five-fold cross-validation35, we guarantee the ability of model to generalize and successfully mitigate the danger of overfitting.Table 1 Comparative analysis of RMSE for predicting different feature combinations under different machine learning algorithms.Full size tableMachine learning model algorithm testIn this study, we conducted a comprehensive analysis of the dataset with the aim of evaluating the applicability of different machine learning algorithms in constructing a prediction model for athlete participation based on their characteristics and advantages. We selected LR, KNN, GBRT, DTR, RFR, SVR and models combining four optimization algorithms (RGS, GA, SWO and PSO) to be trained and tested with the aim of identifying the most suitable models for predicting machine learning models for athlete engagement. During the course of our research, we found that each algorithm and its optimization strategy face unique challenges and potential limitations:LR struggles to capture the complex nonlinear relationships of psychological features and has limited predictive power. KNN has reduced accuracy on small datasets and is sensitive to feature scaling. GBRT and DTR are susceptible to overfitting by noise and outliers. RFR performs poorly when dealing with highly correlated features. SVR has stringent kernel function selection and preprocessing requirements. RGS is computationally expensive in high dimensional spaces. GA requires fine tuning of parameters and multiple iterations to converge31,32. SWO as a new algorithm, lacks empirical guidance for parameter tuning and may not be as effective as mature algorithms33. PSO is easy to fall into local optimum in high dimensional space, and is very sensitive to the initial setting and parameter adjustment34.By considering the advantages and disadvantages of these algorithms and optimization strategies, we found that the SVR model combined with PSO performs the best in predicting athlete participation. The PSO-SVR model optimizes the parameters of the SVR through the PSO algorithm, which effectively improves the predictive performance of the model and reduces the risk of overfitting. Compared to other models, it performs better in handling nonlinear relationships and complex feature interactions. This result indicates that the PSO-SVR model has higher accuracy and stronger generalization ability in the task of athlete engagement prediction, which provides a valuable reference for future research and practice.Feature selection of machine learning modelIn the construction of machine learning models, the RFECV method was used for feature selection. As an efficient wrapper-based feature selection technique, RFECV combined the advantages of Recursive Feature Elimination (RFE) and Cross-Validation (CV), enabling the automatic filtering of feature subsets that had the greatest impact on model performance.RFE is a feature selection technique that relies on a specific machine learning model to achieve a predefined number of features by iteratively eliminating features with lower weights. It sorts features based on the model’s coefficient weights or importance scores, recursively removing a certain proportion of redundant or irrelevant features at each iteration, ultimately retaining the subset of features that contributes most to the model’s predictive capability. However, a limitation of the traditional RFE method is the need to manually set a threshold for retaining the number of features, which could lead to the loss of important features or the retention of unnecessary ones. To overcome this drawback, this study adopted RFECV. This method evaluates the performance of each feature subset and selects the feature combination with the highest score, thereby ensuring that the chosen feature subset maximizes its contribution to model performance and effectively avoids the accuracy loss that may result from manually setting thresholds43. Using the RFECV method, we screened out features with higher weights and determined that the three features of cohesion, passion and mental toughness were critical for model performance by comparing different combinations of features, as detailed in Table 2.Table 2 Comparative analysis of predicting different feature combinations under different machine learning algorithms.Full size tableOutput machine learning model prediction resultsTrain the training set and evaluate the prediction performance of each model.Machine learning model evaluationIn order to determine the best technology, we use the Root Mean Square Error (RMSE), Mean Square Error (MSE), Mean Absolute Error (MAE) and coefficient of determination (({R}^{2})) as evaluation indicators41. In the following formula, ({X}_{i}) represents the predicted value of the (i) th element, while ({Y}_{i}) denotes the actual values of the (i) th element. The regression method forecasts the ({Y}_{i}) value corresponding to the ({X}_{i}) value in a real dataset. The average of the true values44. Its formula is expressed in Eq. (2).$$begin{array}{*{20}l} {overline{Y} = frac{1}{m}mathop sum limits_{i = 1}^{m} Y_{i} } end{array}$$ (2) To assess the predictive performance of each model, four types of metrics were used: RMSE MSE, MAE and ({R}^{2}).The coefficient of determination ({R}^{2}), is used to quantify the extent to which the predictive model can capture variations in outcomes and to determine which model excels in fitting the data. The calculation of R squared is described by Eq. (3). The preferred metric, as illustrated in Eq. (4), is the RMSE, a quantitative measure used to evaluate the effectiveness of regression models. It quantifies the average of the squared differences between the observed and predicted values of the target variable. A lower RMSE value indicates superior model performance, signifying a smaller average deviation between the predicted and actual values. Additionally, as calculated in Eq. (5), the MSE can be included as a complementary metric, which measures the average of the squared errors in a set of predictions. The last metric is the MAE, which quantifies the differences between observed and predicted values, providing a measure of the typical error magnitude. It is calculated according to Eq. (6), where m represents the sample size, ({Y}_{i}) denotes the actual values, and (widehat{{Y}_{i}}) represents the forecasted values.$$begin{array}{*{20}l} {R^{2} = 1 - frac{{mathop sum nolimits_{i = 1}^{m} left( {Y_{i} - widehat{{Y_{i} }}} right)^{2} }}{{mathop sum nolimits_{i = 1}^{m} left( {Y_{i} - overline{Y}} right)^{2} }}} end{array}$$ (3) $$begin{array}{*{20}l} {RMSE = sqrt {frac{1}{m}mathop sum limits_{i = 1}^{m} left( {Y_{i} - widehat{{Y_{i} }}} right)^{2} } } end{array}$$ (4) $${text{MSE }} = { }frac{1}{m}mathop sum limits_{i = 1}^{m} left( {Y_{i} - widehat{{Y_{i} }}} right)^{2}$$ (5) $$begin{array}{*{20}l} {MAE = frac{1}{m}mathop sum limits_{i = 1}^{m} left| {Y_{i} - widehat{{Y_{i} }}} right|} end{array}$$ (6) Machine learning algorithm theoryLR algorithmSupervised learning models can be classified into regression and classification techniques when used to predict continuous and discrete target variables. LR is a typical regression model that predicts targets through the following equation45 . Its formula is expressed in Eq. (7).$$begin{array}{*{20}c} {hat{y}left( {w,x} right) = omega_{0} + omega_{1} x_{1} + cdots + omega_{c} x_{c} + b} end{array}$$ (7) In this model, the weight vector (omega) can be obtained through the “coef_” attribute of the model, while the intercept (b) is obtained through the “intercept_” attribute. And LR models usually don’t require much tuning of the hyperparameters; the performance of a linear model depends largely on the degree to which the problem or data follow a linear distribution46. If the data do exhibit a linear relationship, then LR models tend to provide better predictions. However, if the true distribution of the data is nonlinear, more complex models may need to be considered, or appropriate preprocessing of the data might be required to improve the performance of the LR model.KNN algorithmThe KNN algorithm is applicable not only to classification tasks but also to regression tasks. In regression tasks, the goal is to predict continuous values rather than discrete labels as in classification tasks. Specifically, for each test sample to be predicted, its predicted value is determined by examining the k nearest neighbors found in the training set and averaging the values of these neighbors.Suppose the training set (T={left({x}_{1},{y}_{1}right),cdots ,left({x}_{N},{y}_{N}right)}), ({x}_{i}) is the feature vector of an instance and ({y}_{i}in {{c}_{1},{c}_{2},cdot cdot cdot ,{c}_{n}}) is the class of that instance, (i =(text{1,2},cdot cdot cdot ,n)), for a test instance x, its class y can be represented in47 Eq. (8).$$begin{array}{*{20}c} {hat{y}left( x right) = frac{{mathop sum nolimits_{{x_{i} in N_{k} left( x right)}} omega left( {x,x_{i} } right)y_{i} }}{{mathop sum nolimits_{{x_{i} in N_{k} left( x right)}} omega left( {x,x_{i} } right)}} } end{array}$$ (8) where (omega left(x,{x}_{i}right)) is the weight function, usually defined as (omega left(x,{x}_{i}right)=frac{1}{{dleft(x,{x}_{i}right)}^{p}}), d is the distance function and p is the power parameter of the Minkowski distance.GBRT algorithmGBRT is an ensemble algorithm with a powerful learning strategy. Although it was initially designed to address classification problems, it has been successfully applied to the field of regression48. In the gradient boosting process, each step aims to minimize the loss function by adding elementary tree that reduce the loss function at each step, ultimately minimizing the loss function. In this way, GBRT is able to learn complex nonlinear relationships within the data.Our goal is to determine a function, (Fleft(xright)), which minimizes a specific loss function (L(y,Fleft(xright))). GBRT approaches the optimal solution (widehat{F}(x)) step by step by weighted combination of these simple regression trees (h({x}_{t})). In GBRT, each basic regression tree (h({x}_{t})) is constructed based on the input variable X and the negative gradient of the loss function. Together, these trees allow the model to capture the complex nonlinear relationships in the data more accurately. The model is formulated as follows49 Eqs. (9) and (10).$$begin{array}{*{20}c} {F_{0} left( x right) = arg begin{array}{*{20}c} {min} y end{array} mathop sum limits_{t = 1}^{N} Lleft( {y_{t} ,gamma } right)} end{array}$$ (9) $$begin{array}{*{20}c} {F_{m} left( x right) = F_{m - 1} left( x right) + gamma_{m} h_{m} left( x right)} end{array}$$ (10) where ({F}_{m}(x)) is the integral of the predicted values of the basic regression tree. ({h}_{m}(x)) is the ({m}_{th}) regression tree and ({y}_{m}) is the weighting factor of the ({m}_{th}) regression tree. The GBRT is optimized using the steepest descent method, the negative gradient ({z}_{m}(x)) is given by Eq. (11).$$begin{array}{*{20}c} {z_{m} left( {x_{t} } right) = - frac{{partial Lleft( {y_{t} ,F_{m - 1} left( {x_{t} } right)} right)}}{{partial F_{m - 1} left( {x_{t} } right)}}} end{array}$$ (11) The regression tree ({h}_{m}(x)) is constructed by modeling ({z}_{m}(x)) and x. The weighting factors can be obtained by multiplying the importance of each feature by the corresponding coefficient of the model. Its formula is expressed in Eq. (12).$$begin{array}{*{20}l} {y_{m} = arg begin{array}{*{20}c} {min} y end{array} mathop sum limits_{ - l = 1}^{N} Lleft( {y_{t} ,F_{m - 1} left( {x_{t} } right) - y_{m} h_{m} left( {x_{t} } right)} right)} end{array}$$ (12) DTR algorithmDTR is both an effective classification method and a commonly used technique in regression analysis. As a fundamental machine learning algorithm, it is characterized by strong self-learning capabilities, good model interpretability and high learning efficiency50. The structure of a decision tree consists of leaf nodes, branches and a root node. It integrates all variables, subdivides nodes into child nodes, and selects the splitting method that generates the most homogeneous child nodes during this process, with predictions determined by the path leading to the leaf nodes. DTR is not only easy to learn and understand but also directly visualizable. It constitutes one of the most direct methods for revealing key interactions between variables.During the construction process, the decision tree relies on a specific minimization criterion, such as Mean Square Error (MSE) or Gini impurity, to pick the best cut-off feature j and cut-off point s. The following equation represents the use of MSE as the criterion, as shown in Eq. (13).$$begin{array}{*{20}l} {left( {j,s} right)^{*} = {}_{j,s }^{argmin} left[ {mathop sum limits_{{x_{i} in R_{1} left( {j,s} right)}} left( {y_{i} - c_{1} } right)^{2} + mathop sum limits_{{x_{i} in R_{2} left( {j,s} right)}} left( {y_{i} - c_{2} } right)^{2} } right]} end{array}$$ (13) Here ({(j,s)}^{*}) represents the selected best feature (j) and the cut point (text{s}), while ({R}_{1} (j,s)) and ({R}_{2} (j,s)) denote the two subregions obtained after the cut respectively. ({y}_{i}) is the target value of the ith sample, ({c}_{1}) and ({c}_{2}) are the average of the target values in the two subregions, which are calculated using the formula in Eq. (14).$$begin{array}{*{20}l} {c_{m} = frac{1}{{N_{m} }}mathop sum limits_{{x_{i} in R_{m} left( {j,s} right)}} y_{i} ,m = 1,2} end{array}$$ (14) where ({N}_{m}) is the number of samples in sub-region ({R}_{m}(j,s)). The data is divided into regions by selected ((j,s)) and the corresponding output values are determined as shown in Eq. (15).$$begin{array}{*{20}l} {R_{1} left( {j,s} right) = left. x right|x_{j} le s,R_{2} left( {j,s} right) = left. x right|x_{j} > s} end{array}$$ (15) This process is recursively applied to both subregions until the stopping condition is satisfied. Eventually, the input space is partitioned into M regions R and a decision tree (fleft(xright)) is generated, as shown in Eq. (16).$$begin{array}{*{20}l} {fleft( x right) = mathop sum limits_{m = 1}^{M} hat{c}_{m} Ileft( {x in R_{m} } right)} end{array}$$ (16) where ({widehat{c}}_{m}) is the predicted value of the (m) th region and (I(xin {R}_{m})) is an indicator function to determine whether (x) belongs to region ({R}_{m}) . In this way, the decision tree can predict and analyze the data effectively.RFR algorithmThe RFR algorithm is an ensemble learning algorithm that combines multiple decision trees for prediction, and is particularly suited for handling high-dimensional and nonlinear complex datasets. This method effectively prevents data overfitting and allows for rapid training on test data. As shown in Fig. 1, the RFR algorithm consists of many decision trees, each constructed based on data samples from the training set. When performing regression tasks, the predictions from each tree are averaged to obtain the final prediction value. Typically, deep decision trees tend to overfit, but random forests can effectively mitigate this issue by training each tree with feature and sample subsets41. Features and samples were randomly selected from the input data, with each decision tree trained independently. The predictions of these trees were then aggregated to achieve more accurate predictions than a single decision tree model51.Fig. 1Algorithm for making predictions using random forest.Full size imageSVR algorithmSVR is a regression algorithm based on the Support Vector Machine (SVM) algorithm. SVR maps the input data to a high-dimensional feature space and constructs a hyperplane that maximizes the boundary between the predicted and actual values. In the given sample set (T={left({x}_{1},{y}_{1}right),cdots ,left({x}_{N},{y}_{N}right)}), there are N input variables ({x}_{t}) and corresponding output variables ({y}_{t},t=1,cdots ,N). The form of SVR can be expressed as follows Eq. (17).$$begin{array}{*{20}l} {fleft( x right) = w^{T} varphi left( x right) + b} end{array}$$ (17) where (w) is an element in the high-dimensional space, (varphi left(xright)) is a nonlinear mapping function, and b is a threshold. Then, the above regression problem seeks to minimize the risk of f by minimizing the Euclidean norm of w to reduce the complexity as follows Eq. (18).$$begin{array}{*{20}l} {R_{reg} = frac{1}{2}left| {left| w right|} right|^{2} + C times frac{1}{n}mathop sum limits_{i = 1}^{n} left| {y_{i} - fleft( x right)} right|_{varepsilon } } end{array}$$ (18) where ({|left|wright||}^{2}) is the Euclidean norm of the weight vector, which is used to control the complexity of the model. ({R}_{reg}) is the empirical risk, (C) is the penalty parameter used to balance the weights of the descriptive function and the empirical risk, and ({left|{y}_{i}-fleft(xright)right|}_{varepsilon }) is the insensitive loss function used to measure the difference between the predicted value (fleft({x}_{i}right)) and the true value ({y}_{i}). By introducing the non-negative slack variable ξ and the Lagrangian function, the function (fleft(xright)) is described as follows using the Karush–Kuhn–Tucker (KKT) condition Eq. (19).$$begin{array}{*{20}l} {fleft( x right) = mathop sum limits_{i = 1}^{n} left( {a_{i} - a_{i}^{*} } right)kleft( {x_{i} ,x} right) + b} end{array}$$ (19) ({a}_{i}), ({a}_{i}^{*}) is the Lagrange multiplier and (kleft({x}_{i},xright)) is defined as the kernel function. There are a number of different kernels used to generate the inner product of the machine to be constructed in the input space. Different kernels will have different nonlinear decision surfaces and thus give different results. The kernel was chosen according to the characteristics of the input. In general, a common example of kernel function is Gaussian Radial Basis Function (RBF). In this paper, we also select RBF as kernel function, which is defined as follows Eq. (20).$$begin{array}{*{20}l} {fleft( x right) = mathop sum limits_{i = 1}^{n} left( {a_{i} - a_{i}^{*} } right)exp left( { - frac{{left| {|x_{i} - x} right||^{2} }}{{2sigma^{2} }}} right) + b} end{array}$$ (20) where ({x}_{i}) is an element of the training set, x is an element of the test set, and σ is a parameter of the RBF. However, the parameters of SVM are predetermined. Cross-validation is used to determine the parameter values. Whereas, SVM with fixed parameters are not applicable to changing athlete data.PSO-SVR algorithmThe SVR with fixed parameters struggled to adapt to changes in athlete data, and online learning consumed a substantial amount of computational resources. Additionally, the execution speed did not meet the requirements. A threshold θ was set, and once the estimation error exceeded this threshold, the model was retrained with the latest data. To improve parameter optimization, Particle Swarm Optimization (PSO) was introduced into SVR.In our proposed method, as shown in Fig. 2, PSO is used to optimize the parameters (C and g) of SVR and the data is divided into training and validation sets. We select the best parameters by optimizing the prediction accuracy of the validation set. The cost function is presented in Eq. (21).$$begin{array}{*{20}l} {fleft( {C,g} right) = min arg mathop sum limits_{m}^{M} mathop sum limits_{n}^{N} Lossleft( {y_{i} - mathop ylimits_{i} left( {c_{m} ,g_{n} } right)} right)} end{array}$$ (21) where ({y}_{i}) is the truth value of the ith sample and (mathop ylimits_{i} left( {c_{m} ,g_{n} } right)) is a function of C and g. We terminate the optimization process when the predictive model yields the model we need.Fig. 2Process of adaption in SVR.Full size imageThe choice of PSO-SVR over integrated learning is mainly based on its advantages in terms of interpretability, applicability to small datasets and strong generalization capabilities. This choice is suitable for scenarios particularly where accurate predictions need to be extracted from limited data. In contrast, integrated learning models tend to be more complex, demand higher computational resources, and may perform poorly when dealing with small sample data.ResultsModel feature selection resultsIn this study, RFECV was employed to identify the optimal number of features for the model. The input features considered included mental toughness, cohesion, passion, sports satisfaction and psychological collectivism.With the RFECV method, we successfully filtered the optimal feature subset for the machine learning model based on cohesion, passion and mental toughness. Figure 3 demonstrates that the model was effective in predicting athlete engagement using three feature subsets. These selected features comprised cohesion, passion and mental toughness, covering dimensions such as ATG-T, ATG-S, GI-T and GI-S; passion criteria, harmonious passion and obsessive passion; and confidence, constancy and control. The synergy of these features contributed to the model’s strong performance in predicting athlete engagement.Fig. 3Selection of cross-validation scores based on machine learning feature models.Full size imageThe results of the SHAP analysis show that the three features cohesion, passion and mental toughness have a decisive influence on the predictions of model clearly. The data in Fig. 4 demonstrate the average SHAP values for the features Mental Toughness (MT), Passion (Pa), Cohesion (Co), Sports Satisfaction (SS) and Psychological Collectivism (PC), which visualise the average influence of each feature in the model predictions. It can be observed from the Fig. 4 that features MT, Pa and Co have higher average SHAP values, indicating that they occupy an significant position in model prediction. We further refine these to the level of a single sample, revealing the relationship between the feature values and their SHAP values for this sample which demonstrating the high prediction accuracy of the model by comparing the model predicted values with the actual values, as shown in Fig. 5. In this sample, the positive effect of feature MT is significant particularly, while the effects of features PC and SS are minor relatively. Combining the analyses in Figs. 4 and 5, we conclude that in the study sample, feature MT is the most critical factor influencing the prediction results of model, while features Pa and Co also have a significant impact. However, the impact of features PC and SS is weak relatively. These findings provide valuable information for an in-depth understanding of the decision-making mechanism of the model and also indicate the key features to focus on during data analysis and model optimisation.Fig. 4Average effect of SHAP values on model outputs.Full size imageFig. 5Machine Learning Model Prediction Results and Key Characterization.Full size imageThrough Table 3, the pso-SVR model was identified as the best hyperparameter combination. As shown in Fig. 6, the best hyperparameters for this machine learning model with respect to the features of cohesion, passion and mental toughness were C = 6.6636 and g = 0.0436. With this hyperparameter combination, the model achieved a prediction accuracy of 92.62%, demonstrating its effectiveness in predicting athlete engagement.Table 3 Comparative analysis of optimization algorithms for various machine learning models.Full size tableFig. 6Optimal hyperparameters for machine learning feature models.Full size imageThe hyperparameters of the LR model include: copy_X (default True, prevents modification of the original data), fit_intercept (default True, calculates the intercept), n_jobs (default None, does not compute in parallel), and positive (default False, does not force the coefficients to be positive). For the optimal configuration of SWO combined with GBR, DTR, RFR and KNNR, the following hyperparameters are explained: in the GBR model, the learning rate is set to 0.1837, a lower value that helps the model to perform more fine-grained weight adjustments during the training process, thus improving the accuracy of the prediction; the maximum depth is 1.0, which indicates that we adopt a simpler tree structure to prevent overfitting; the number of estimators is 19.13, indicating that the predictive ability of the model is enhanced by adding more trees. In the DTR model, the optimal maximum depth is about 4.46, which helps the model to capture complex patterns in the data while avoiding the overfitting problem caused by too deep a tree structure. The number of estimators in the RFR model is 18.85, which indicates that the generalisation of model is enhanced by integrating more than one decision tree; the maximum depth is 37.85, which allows the model to learn more complex feature relationships; and the minimum sample leaf is 7.62, ensuring that the leaf nodes contain sufficient information. For the KNNR model, the optimal number of neighbours is 16, which indicates that the model makes predictions by considering the data points of the 16 nearest neighbours; the distance metric is set to “distance”, which indicates that the model takes into account the distances between the data points when calculating the predicted values.As for the combination of PSO and SVR, the following is an explanation of the optimal configuration of the hyperparameters: the regularization parameter C was set to 6.66, a choice that highlights the fact that we struck a good balance between model complexity and training error. This aims to prevent model overfitting while ensuring the model’s ability to fit the training data efficiently; the coefficient of the kernel function is 0.0436, and this low gamma value implies that the model prefers to use smoothed decision boundaries, which helps to capture the overall structural characteristics of the data; and the kernel type is chosen to be “rbf”, the radial basis function kernel, which specializes in revealing nonlinear relationships in the data, which is essential for enhancing the predictive power of the model. By carefully optimizing these key hyperparameters, the prediction accuracy of the SVR model is significantly improved. This makes it stand out in the comparison of many optimization algorithms and show excellent performance.Comparison of prediction models under different machine learning algorithmsFigure 7 displayed the prediction accuracy of regression models constructed using different machine learning algorithms. By comparing the R2 scores, the overall effectiveness of the different prediction models could be determined. The R2 score, also known as the coefficient of determination, is an important metric for evaluating the quality of regression models. It measures the extent to which the model explains the variability in the dependent variable, reflecting the accuracy of the model’s predictions. In model evaluation, a higher R2 score is preferred, indicating that the model fits the data better.Fig. 7Different machine learning algorithms predict the outcome of R2.Full size imageThe accuracy of model predictions is a key indicator of regression model performance, achieved by comparing the prediction accuracy of models built with different machine learning algorithms. Among the seven algorithms, the PSO-SVM algorithm performs the best, with a prediction accuracy of 0.9262, followed by the GA-SVR algorithm with a prediction accuracy of 0.9226. The RSG-SVR algorithm comes in third place, with a relatively high prediction accuracy as well. While LR, SWO-GBR, PSO-GBR, GA-GBR and SWO-DTR algorithms had prediction accuracies of 0.9103, 0.9079, 0.9069, 0.9045 and 0.8861, respectively. Therefore, based on the R2 scores, the PSO-SVM model demonstrated the highest accuracy in predicting athlete engagement, meeting the research requirements most effectively.Figure 8 presented the statistical results of seven machine learning algorithm models from the perspective of RMSE. RMSE is an important metric for evaluating the prediction accuracy of different machine learning models. It is a commonly used indicator that measures the difference between the model’s predicted values and the actual observed values, reflecting the magnitude of prediction errors. By comparing these seven machine learning algorithms, we find the model with the lowest RMSE, that is, the one with the smallest prediction error. The PSO-SVR algorithm has the best performance in terms of RMSE which is 0.1227; followed by the GA-SVR algorithm, which has an RMSE result of 0.1236; and the RSV-SVR algorithm, which has a relatively low RMSE, which is 0.128. The other four algorithms had higher RMSE values, indicating that their prediction errors were larger and thus, they were not suitable as the optimal models.Fig. 8Different machine learning algorithms predict the outcome of the RMSE.Full size imageFigures 9 and 10 displayed the MSE and MAE of seven machine learning algorithms. Both MSE and MAE were crucial indicators for assessing model quality, directly reflecting the accuracy of the model’s predictions. By comparing these error metrics, the overall effectiveness of the different prediction models could be determined. Before assessing model performance, lower MSE and MAE values indicated that the model’s predictions were closer to the actual data. Based on the results in Fig. 9, we can see that the PSO-SVR model has the best performance with an MSE of 0.046. This is followed by the GA-SVR model with an MSE of 0.0153. The LR model has an MSE of 0.0177. According to the results in Fig. 10, the LR model has an MAE of 0.0618 which is the best performance. This is followed by the PSO-SVR model with an MAE of 0.0656. The MAE of the RFR model was 0.0708, while the MSE and MAE values of the other four algorithms were higher, indicating larger prediction errors. Therefore, based on MSE and MAE values, the PSO-SVR model demonstrated the best performance in predicting athlete engagement, with the lowest error metrics, meeting the research requirements. This suggested that the PSO-SVR model was more effective in predicting athlete engagement while maintaining lower prediction errors.Fig. 9Different machine learning algorithms predict the outcome of the MSE.Full size imageFig. 10Different machine learning algorithms predict the outcome of the MAE.Full size imageTable 1 presented a summary of the predictive accuracy, RMSE, MSE and MAE of the prediction models constructed using seven different machine learning algorithms. Figure 11 displayed a radar chart generated from Table 3, which was used to assess the overall effectiveness of the prediction models established by these algorithms. To visually compare the performance of the different machine learning algorithms’ prediction models in terms of statistical accuracy, the radar chart was used to show the models’ combined performance across four dimensions: RMSE, MSE, MAE and predictive accuracy. In this radar chart, PSO-SVR and LR perform well on all assessment metrics, especially their R2 values are close to 1, showing high predictive accuracy. Although GBR performs well on R2, its RMSE and MSE are not as good.Fig. 11Radar plot of overall evaluation level of different machine learning algorithms.Full size imageOverall results indicated that the PSO-SVR algorithm performed the best in terms of accuracy, and RMSE and MSE scores are relatively low (see Table 3). It can be concluded that this model not only has high predictive accuracy but also exhibits low prediction errors. Compared to PSO-SVR, the LR algorithm has lower accuracy. However, LR outperforms PSO-SVR in terms of MAE, possibly because the errors in LR are symmetrically distributed, whereas the increased complexity of PSO-SVR may result in asymmetric error distribution. GA explores the solution space in depth through crossover and mutation so that better Mae metrics can be found. The stochastic nature of RSG can be an advantage in complex parameter spaces, especially when optimizing MSE.The other four algorithms showed lower accuracy and higher errors, indicating that these models are not as effective as PSO-SVR in terms of prediction performance. In addition, the computational cost is related closely to the training and inference time of the model, which is especially obvious in the performance of different optimization algorithms. For example, the GBR model with GA takes as much as 184.3377 s, while the GBR model with PSO takes only 35.9754 s; similarly, applying PSO optimization on the SVR model takes only 5.7664 s. This study suggests that the PSO-SVR model is one of the best models for predicting athlete engagement based on cohesion, passion and mental toughness. It demonstrates high reliability and validity in predicting athlete engagement.DiscussionIn the field of sports, the utilization of machine learning technology has become a key strategy for data exploration and extraction, leveraging its interdisciplinary nature to integrate the advantages of sports science, computer science and statistics. This field not only demonstrates multi-layered and multi-domain characteristics but also highlights the importance of data analysis in specific training and competition scenarios, reflecting the overall trend toward precise sports training and management52. Big data analysis in the field of physical activity using AI techniques is justified as one of the important and effective tools to improve the quality of health and sports performance of professional athletes. As a fundamental application of artificial intelligence, machine learning, with its integration of various interdisciplinary domains and excellent performance, has emerged as a prominent topic in sports training method research.Previously, researchers have delved into the factors affecting athlete performance from a number of perspectives, encompassing not only the traditional technical, tactical and physical aspects, but also cutting-edge areas such as exercise fatigue, recovery mechanisms and psychological factors. Therefore, analyzing and predicting athlete performance is a complex and dynamic endeavor that involves multiple considerations. And machine learning technology plays an important role in analyzing data on athlete performance. The application of this technology is changing the way athletes are monitored in terms of health and performance, thus impacting their training, competition and career management. However, in previous studies, many researchers and coaches have focused on reducing exercise fatigue and burnout, so that they overlooking the fact that athletes who maintain a high level of engagement with training and competition are the ones who respond positively to the stress of competition and perform well in training or competition. Engagement represents a sustained, shared affective-cognitive state of concentration on any object, event, person or behavior. When athletes have a sense of control over their ego state, it helps them achieve their goals. Athlete engagement studied in this paper serves as an important indicator of positive athletic performance and can reflect positive and healthy mental states of athletes. Athletes’ mental health is a major resource throughout their athletic career and beyond. However, the diversity and complexity of sports performance and the nonlinear relationships between psychological traits complicate the task of analysis and prediction using traditional methods. Therefore, the use of effective machine learning models may help to achieve highly accurate predictions of athletic performance.As an algorithmic discipline, machine learning analyzes and assimilates large volumes of pre-existing or synthesized datasets to make predictions and assessments. This provides sports scientists, coaches and athletes with an objective analysis tool for evaluating their performance, thus assisting them in making optimal decisions. Through personalized predictions from “learning algorithms”, it is possible to conduct in-depth analysis of each athlete’s performance and develop customized training plans to help them achieve optimal athletic performance. This allows athletes to gain a more comprehensive understanding of their physical and mental health, enabling timely adjustments and improvements in their performance during training and competitions. Coaches, for their part, can use predictive modeling to monitor athletes engagement levels regularly. Based on the prediction results, real-time information on the athlete’s true athletic performance can be obtained, and measures can be taken timely, such as adjusting the training load or adding psychological counseling. Changes in athlete engagement before and after training can also be analyzed to assess the effectiveness of training interventions and optimize future training strategies continuously. Such data-driven decision-making is often more reliable than subjective judgment. Sports managers can leverage insights from machine learning research to better support coaches and athletes, optimize resource allocation, and enhance the overall competitiveness of sports teams or programs. For example, predictive modeling of athletic engagement is utilized to develop scientific selection criteria to ensure that athlete reserves are recruited that match the team’s culture; Counseling and support resources can be allocated according to the psychological needs of athletes to ensure that athletes receive the support they need during training and competition.Regarding the predictive accuracy of various models, the PSO-SVR model achieved the highest accuracy at 92.62%. The variables chosen to represent this study are cohesion, passion and mental toughness, which are significant potential factors influencing athlete engagement, although other factors may also affect their engagement. The PSO-SVR model excels in handling cohesion, passion and mental toughness due to its adaptability to data features and patterns, strong ability to handle high-dimensional features, and effectiveness in managing small samples and category imbalances.Therefore, the potential for implementing machine learning in sports is extensive. The application of machine learning in sports prediction offers significant opportunities for China to play a leading role on the global sports stage, having a profound impact on enhancing the overall competitiveness of the sports industry and advancing scientific excellence in sports.Limitations and future researchMachine learning-based predictive models do show great potential in providing personalized strategies for athlete training and enhancing athlete engagement. However, this predictive model faces a number of challenges. For example, the PSO algorithm may encounter the problem of local optimal solutions when applied, the choice of algorithm parameters has a significant impact on the final results, and also its search efficiency may decrease when dealing with high-dimensional feature spaces. At the same time, the PSO-SVR model’s dependence on self-reported data may lead to biased predictions, as these data are often influenced by the subjective feelings of individual athletes. In addition, data privacy and security issues are critical when deploying models. We need to take measures such as data desensitization and encryption to protect the privacy of research subjects and to improve model transparency and interpretability.To go beyond these limitations, future research should consider introducing more physiological and behavioral data or employing finer classification techniques to reveal the true state of the athlete. In order for the model to be of broad benefit, we suggest that the data output from the model can be integrated with the data fed back from wearable devices. Instant feedback is then provided to coaches through devices such as coaching dashboards in order to tailor training programs to the individual needs of the athlete. By monitoring changes in athletes throughout the training cycle, we can gain a more dynamic and comprehensive view of the data. In addition, the integrated data can provide data-driven decision support for team management and optimize resource allocation. Multi-model integration and adaptive learning techniques are used to extend the model and ensure its performance in complex environments. Besides that, incorporating advanced machine learning techniques such as deep learning is expected to further improve the accuracy of the predictions. Exploring hybrid optimization strategies such as PSO combined with Nutcracker Optimization Algorithm (NOA) or Spider Web Optimization (SWO) is an innovative way to improve the performance of Support Vector Machines (SVM). These research paths will not only advance computer science and psychology research, but will also bring deeper insights and greater value to the field of exercise science by making these models more sophisticated and effective in terms of theoretical exploration and practical application.ConclusionThis study comprehensively considered key characteristics such as cohesion, passion and mental toughness, and developed a machine learning model aimed at predicting athlete engagement. Through an in-depth comparison and analysis of the six base algorithms and four optimization algorithms, we observe that the PSO-SVR model outperforms the other models in terms of prediction accuracy significantly. In particular, it has demonstrated a remarkable ability to deal with non-linearities, adapt and optimize the feature space, and provide more accurate and reliable predictions of athlete engagement. The PSO-SVR model performed impressively in predicting athlete engagement with a prediction accuracy of 0.9262. In contrast, other models such as GA-SVR, RGS-SVR, SVR, LR and SWO-GBR are less predictive slightly. The results of this research not only highlight the great potential of the PSO-SVR model in predicting athlete engagement, but also provide a solid scientific basis for the field of athletic training and management. In summary, the wide application of the model is expected to provide scientific decision support in the field of sports training and management, and thus improve the individual performance of athletes and the overall competitiveness of sports teams. Data availability The data used to support the findings of this study are available from the corresponding author upon request. ReferencesMA Mohammed 2023 Adaptive secure malware efficient machine learning algorithm for healthcare data J. CAAI Trans. Intell. Technol. 2 1 12MATH Google Scholar X Yizi 2024 Bioinformatics combined with machine learning to identify early warning markers for severe dengue J. China Med Univ. 53 7 583 590MATH Google Scholar W Zhaohua L Jie W Bo D Nana N Fuhua 2024 Research on mining and applications of individual heterogeneity factors in resident demand response by integrating machine learning and SHAP value algorithm J. Syst. Eng. theory Pract. 44 7 2247 2259 Google Scholar K Goodger T Gorely D Lavallee C Harwood 2007 Burnout in sport: A systematic review J. Sport Psychol. 21 127 151Article Google Scholar M Sorkkila A Tolvanen K Aunola TV Ryba 2019 The role of resilience in student-athletes’ sport and school burnout and dropout: A longitudinal person-oriented study Scand. J. Med. Sci. Sports 29 7 1059 1067Article PubMed Google Scholar J Chuantong R Norsilawati Abdul R Nelfianty Mohd 2023 A Preliminary psychometric investigation of a Chinese version of athlete engagement questionnaire (AEQ) Int. J. Hum. Mov. Sports Sci. 11 1 103 111 Google Scholar C Lonsdale K Hodge TD Raedeke 2007 Athlete engagement: I. A qualitative investigation of relevance and dimensions Int. J. Sport Psychol. 38 451 470MATH Google Scholar Z Zhongqiu 2012 Research, development and application of sports psychology in the field of competitive sports J. Tianjin Univ. Phys. Educ. 27 3 7MATH Google Scholar S Gu L Xue 2022 Relationships among sports group cohesion, psychological collectivism, mental toughness and athlete engagement in Chinese team sports athletes Int. J. Environ. Res. Public Health 19 9 4987Article PubMed PubMed Central

A machine learning model the prediction of athlete engagement based on cohesion ...

Abstract Therefore, previous research has utilized predictive models of athletic performance as a benchmark for guiding athlete training, management14, and strategic planning in sports teams15. For example, researchers have built such models to evaluate motor control and stability to provide runners with precise training recommendations16. In addition, studies that predicted sports performance from a four-dimensional […]

Abstract

Therefore, previous research has utilized predictive models of athletic performance as a benchmark for guiding athlete training, management¹⁴, and strategic planning in sports teams¹⁵. For example, researchers have built such models to evaluate motor control and stability to provide runners with precise training recommendations¹⁶. In addition, studies that predicted sports performance from a four-dimensional perspective found that the interest in seeking knowledge was a key factor influencing sports performance¹⁷. In China, although the exploration of artificial intelligence in the field of sports performance continues to deepen, the research in this area started later than abroad. We have notyet deeply explored its influence mechanism and established corresponding prediction models. Athlete engagement as a positive predictor and key influencing factor is in dire need of further research and development^18,19.

Introduction

Google Scholar
S Sanjaykumar K Udaichi G Rajendiran M Cretu Z Kozina 2024 Cricket performance predictions: a comparative analysis of machine learning models for predicting cricket player’s performance in the one day international (ODI) world cup 2023 J. Health Sport Rehab. 10 1 6 19This process is recursively applied to both subregions until the stopping condition is satisfied. Eventually, the input space is partitioned into M regions R and a decision tree (fleft(xright)) is generated, as shown in Eq. (16).Article
MATH

Google Scholar
MATH

Google Scholar
Sorry, a shareable link is not currently available for this article.iashchynskyi, P., & Liashchynskyi, P. Grid search, random search, genetic algorithm: a big comparison for NAS. arXiv preprint arXiv. 1912, 06059 (2019).

Methods and data

Data source

Article

Google Scholar
Average effect of SHAP values on model outputs.

Group environment questionnaire (GEQ)

Article
MATH

Google Scholar

Sports passion questionnaire (SPQ)

(a_i), (a_i^*) is the Lagrange multiplier and (kleft(x_i,xright)) is defined as the kernel function. There are a number of different kernels used to generate the inner product of the machine to be constructed in the input space. Different kernels will have different nonlinear decision surfaces and thus give different results. The kernel was chosen according to the characteristics of the input. In general, a common example of kernel function is Gaussian Radial Basis Function (RBF). In this paper, we also select RBF as kernel function, which is defined as follows Eq. (20).

Sports mental toughness questionnaire (SMTQ)

Article
MATH

Google Scholar

Athlete engagement questionnaire (AEQ)

The SPQ was compiled by Vallerand (2010)³⁷. The use of Chinese athletes yielded favorable results in terms of reliability and validity. The scale comprises 16 items, encompassing three dimensions: general passion, harmonious passion and compulsive passion. The Cronbach’s alpha coefficient for this scale is 0.941, which meets the research requirements.

Machine learning model construction process

The RFR algorithm is an ensemble learning algorithm that combines multiple decision trees for prediction, and is particularly suited for handling high-dimensional and nonlinear complex datasets. This method effectively prevents data overfitting and allows for rapid training on test data. As shown in Fig. 1, the RFR algorithm consists of many decision trees, each constructed based on data samples from the training set. When performing regression tasks, the predictions from each tree are averaged to obtain the final prediction value. Typically, deep decision trees tend to overfit, but random forests can effectively mitigate this issue by training each tree with feature and sample subsets⁴¹. Features and samples were randomly selected from the input data, with each decision tree trained independently. The predictions of these trees were then aggregated to achieve more accurate predictions than a single decision tree model⁵¹.Article
ADS
CAS
MATH

Google Scholar
Machine learning model establishment process. This research employed Python version 3.9.1 and utilized the Sklearn library to construct a machine learning prediction model for athlete engagement. The specific methodology is outlined as follows:

Data collection and preprocessing

Therefore, the potential for implementing machine learning in sports is extensive. The application of machine learning in sports prediction offers significant opportunities for China to play a leading role on the global sports stage, having a profound impact on enhancing the overall competitiveness of the sports industry and advancing scientific excellence in sports.DW Ariani 2021 The relationship of passion, burnout, engagement, and performance: An analysis of direct and indirect effects among indonesian students J. Behav. Sci. 16 2 86 98In order to obtain sufficient data sets, this study distributed and collected questionnaires to athletes in multiple professional sports teams, high-level national training teams from Zhejiang, Heilongjiang, and Liaoning provinces by means of Wenjuanxing. From 1 June to 1 October 2021, a total of 445 questionnaires were distributed. Of these, 175 (53.7%) were male participants, a higher rate than females, and the average age of the athletes was between 17 and 22 years old and had about three years of training experience. Sports include: basketball (74), volleyball (36), soccer (63), cricket (47), ice hockey (20), curling (11), team aerobics (47) and others (28). All questionnaire surveys were conducted after sports training sessions. The questionnaire scale covered in this study can be divided into three parts: in the first part, we made it clear that the survey was based on the principles of voluntariness and anonymity; the responses to the questionnaire were used only for the researchers; they should not be used for any kind of profit or other purposes. The second part was to collect basic information about the athletes. In the third part, we set up the scales required for this study, including cohesion (Group Environment Questionnaire), passion (Sports Passion Questionnaire), mental toughness (Sports Mental Toughness Questionnaire) and athlete engagement (Athlete Engagement Questionnaire).

$$beginarray*20c {N = fracleft( X – X_min right){{left( {X_{max} – X_min } right)}} } endarray$$

(1)

Written informed consent was signed by each team representative and all participants were informed that they could withdraw from the study at any time. All methods were carried out in accordance with relevant guidelines and regulations. This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of Zhejiang Normal University (ZSRT2022079).

Division of training set and test set

where (x_i) is an element of the training set, x is an element of the test set, and σ is a parameter of the RBF. However, the parameters of SVM are predetermined. Cross-validation is used to determine the parameter values. Whereas, SVM with fixed parameters are not applicable to changing athlete data.

Table 1 Comparative analysis of RMSE for predicting different feature combinations under different machine learning algorithms.

Full size table

Machine learning model algorithm test

The results of the SHAP analysis show that the three features cohesion, passion and mental toughness have a decisive influence on the predictions of model clearly. The data in Fig. 4 demonstrate the average SHAP values for the features Mental Toughness (MT), Passion (Pa), Cohesion (Co), Sports Satisfaction (SS) and Psychological Collectivism (PC), which visualise the average influence of each feature in the model predictions. It can be observed from the Fig. 4 that features MT, Pa and Co have higher average SHAP values, indicating that they occupy an significant position in model prediction. We further refine these to the level of a single sample, revealing the relationship between the feature values and their SHAP values for this sample which demonstrating the high prediction accuracy of the model by comparing the model predicted values with the actual values, as shown in Fig. 5. In this sample, the positive effect of feature MT is significant particularly, while the effects of features PC and SS are minor relatively. Combining the analyses in Figs. 4 and 5, we conclude that in the study sample, feature MT is the most critical factor influencing the prediction results of model, while features Pa and Co also have a significant impact. However, the impact of features PC and SS is weak relatively. These findings provide valuable information for an in-depth understanding of the decision-making mechanism of the model and also indicate the key features to focus on during data analysis and model optimisation.Article

Google Scholar
D Fletcher M Sarkar 2012 A grounded theory of psychological resilience in Olympic champions J. Psychol. Sport Exerc. 13 5 669 678

Feature selection of machine learning model

GBRT is an ensemble algorithm with a powerful learning strategy. Although it was initially designed to address classification problems, it has been successfully applied to the field of regression⁴⁸. In the gradient boosting process, each step aims to minimize the loss function by adding elementary tree that reduce the loss function at each step, ultimately minimizing the loss function. In this way, GBRT is able to learn complex nonlinear relationships within the data.Y Kittichotsatsawat N Tippayawong KY Tippayawong 2022 Prediction of Arabica coffee production using artificial neural network and multiple linear regression techniques J. Sci. Rep. 12 1 14488

Table 2 Comparative analysis of predicting different feature combinations under different machine learning algorithms.

Full size table

Output machine learning model prediction results

The regression tree ({h}_{m}(x)) is constructed by modeling ({z}_{m}(x)) and x. The weighting factors can be obtained by multiplying the importance of each feature by the corresponding coefficient of the model. Its formula is expressed in Eq. (12).

Machine learning model evaluation

H Liu W Hou I Emolyn Y Liu 2023 Building a prediction model of college students’ sports behavior based on machine learning method: combining the characteristics of sports learning interest and sports autonomy Sci. Rep. 13 1 15628

$$beginarray{*20l} {overline{Y} = frac{1}{m}mathop sum limits_{i = 1}^{m} Y_i } endarray$$

(2)

Xin Zhang and Zhikang Lin contributed equally to this work and should be regarded as co first authors.Article
MATH

Google Scholar

$$beginarray{*20l} {R^{2} = 1 – frac{{mathop sum nolimits_{i = 1}^{m} left( {Y_i – widehat{{Y_i }}} right)^{2} }}{{mathop sum nolimits_{i = 1}^{m} left( {Y_i – overline{Y}} right)^{2} }}} endarray$$

(3)

$$beginarray{*20l} {RMSE = sqrt {frac{1}{m}mathop sum limits_{i = 1}^{m} left( {Y_i – widehat{{Y_i }}} right)^{2} } } endarray$$

(4)

$${text{MSE }} = { }frac{1}{m}mathop sum limits_{i = 1}^{m} left( {Y_i – widehat{{Y_i }}} right)^{2}$$

(5)

$$beginarray{*20l} {MAE = frac{1}{m}mathop sum limits_{i = 1}^{m} left| {Y_i – widehat{{Y_i }}} right|} endarray$$

(6)

Machine learning algorithm theory

LR algorithm

D Chicco MJ Warrens G Jurman 2021 The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation J. PeerJ Comput. Sci. 7 623

$$beginarray*20c {hat{y}left( {w,x} right) = omega_{0} + omega_{1} x_{1} + cdots + omega_{c} x_{c} + b} endarray$$

(7)

MA Mohammed 2023 Adaptive secure malware efficient machine learning algorithm for healthcare data J. CAAI Trans. Intell. Technol. 2 1 12

KNN algorithm

A Dorothea Wahyu 2021 The relationship of passion, burnout, engagement, and performance: an analysis of direct and indirect effects among Indonesian students J. Behav. Sci. 16 2 86 98Figure 7 displayed the prediction accuracy of regression models constructed using different machine learning algorithms. By comparing the R² scores, the overall effectiveness of the different prediction models could be determined. The R² score, also known as the coefficient of determination, is an important metric for evaluating the quality of regression models. It measures the extent to which the model explains the variability in the dependent variable, reflecting the accuracy of the model’s predictions. In model evaluation, a higher R² score is preferred, indicating that the model fits the data better.

$$beginarray*20c {hat{y}left( x right) = frac{{mathop sum nolimits_{{x_i in N_{k} left( x right)}} omega left( {x,x_i } right)y_i }}{{mathop sum nolimits_{{x_i in N_{k} left( x right)}} omega left( {x,x_i } right)}} } endarray$$

(8)

You can also search for this author in
PubMed Google Scholar

GBRT algorithm

Article

Google Scholar
CM Bishop NM Nasrabadi 2006 Pattern Recognition and Machine Learning Springer New York 738

$$beginarray*20c {F_{0} left( x right) = arg beginarray*20c min y endarray mathop sum limits_{t = 1}^{N} Lleft( {y_{t} ,gamma } right)} endarray$$

(9)

$$beginarray*20c {F_{m} left( x right) = F_{m – 1} left( x right) + gamma_{m} h_{m} left( x right)} endarray$$

(10)

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

$$beginarray*20c {z_{m} left( {x_{t} } right) = – frac{{partial Lleft( {y_{t} ,F_{m – 1} left( {x_{t} } right)} right)}}{{partial F_{m – 1} left( {x_{t} } right)}}} endarray$$

(11)

You can also search for this author in
PubMed Google Scholar

$$beginarray{*20l} {y_{m} = arg beginarray*20c min y endarray mathop sum limits_{ – l = 1}^{N} Lleft( {y_{t} ,F_{m – 1} left( {x_{t} } right) – y_{m} h_{m} left( {x_{t} } right)} right)} endarray$$

(12)

DTR algorithm

In the construction of machine learning models, the RFECV method was used for feature selection. As an efficient wrapper-based feature selection technique, RFECV combined the advantages of Recursive Feature Elimination (RFE) and Cross-Validation (CV), enabling the automatic filtering of feature subsets that had the greatest impact on model performance.Article

Google Scholar

$$beginarray{*20l} {left( {j,s} right)^* = {}_{j,s }^{argmin} left[ {mathop sum limits_{{x_i in R_{1} left( {j,s} right)}} left( {y_i – c_{1} } right)^{2} + mathop sum limits_{{x_i in R_{2} left( {j,s} right)}} left( {y_i – c_{2} } right)^{2} } right]} endarray$$

(13)

Article

Google Scholar

$$beginarray{*20l} {c_{m} = frac{1}{{N_{m} }}mathop sum limits_{{x_i in R_{m} left( {j,s} right)}} y_i ,m = 1,2} endarray$$

(14)

N represents the normalized data, X represents the measured value, ({X}_min) and ({X}_{max}) represent the minimum and maximum values, respectively.

$$beginarray{*20l} {R_{1} left( {j,s} right) = left. x right|x_{j} le s,R_{2} left( {j,s} right) = left. x right|x_{j} > s} endarray$$

(15)

The hyperparameters of the LR model include: copy_X (default True, prevents modification of the original data), fit_intercept (default True, calculates the intercept), n_jobs (default None, does not compute in parallel), and positive (default False, does not force the coefficients to be positive). For the optimal configuration of SWO combined with GBR, DTR, RFR and KNNR, the following hyperparameters are explained: in the GBR model, the learning rate is set to 0.1837, a lower value that helps the model to perform more fine-grained weight adjustments during the training process, thus improving the accuracy of the prediction; the maximum depth is 1.0, which indicates that we adopt a simpler tree structure to prevent overfitting; the number of estimators is 19.13, indicating that the predictive ability of the model is enhanced by adding more trees. In the DTR model, the optimal maximum depth is about 4.46, which helps the model to capture complex patterns in the data while avoiding the overfitting problem caused by too deep a tree structure. The number of estimators in the RFR model is 18.85, which indicates that the generalisation of model is enhanced by integrating more than one decision tree; the maximum depth is 37.85, which allows the model to learn more complex feature relationships; and the minimum sample leaf is 7.62, ensuring that the leaf nodes contain sufficient information. For the KNNR model, the optimal number of neighbours is 16, which indicates that the model makes predictions by considering the data points of the 16 nearest neighbours; the distance metric is set to “distance”, which indicates that the model takes into account the distances between the data points when calculating the predicted values.

$$beginarray{*20l} {fleft( x right) = mathop sum limits_{m = 1}^{M} hat{c}_{m} Ileft( {x in R_{m} } right)} endarray$$

(16)

Athlete engagement is influenced by several factors, including cohesion, passion and mental toughness. Machine learning methods are frequently employed to construct predictive models as a result of their high efficiency. In order to comprehend the effects of cohesion, passion and mental toughness on athlete engagement, this study utilizes the relevant methods of machine learning to construct a prediction model, so as to find the intrinsic connection between them. The construction and comparison methods of predictive models by machine learning algorithms are investigated to evaluate the level of predictive models in order to determine the optimal predictive model. The results show that the PSO-SVR model performs best in predicting athlete engagement, with a prediction accuracy of 0.9262, along with low RMSE (0.1227), MSE (0.0146) and MAE (0.0656). The prediction accuracy of the PSO-SVR model exhibits an obvious advantage. This advantage is mainly attributed to its strong generalization ability, nonlinear processing ability, and the ability to optimize and adapt to the feature space. Particularly noteworthy is that the PSO-SVR model reduces the RMSE (7.54%), MSE (17.05%), and MAE (3.53%) significantly, while improves the R² (1.69%), when compared to advanced algorithms such as SWO. These results indicate that the PSO-SVR model not only improves the accuracy of prediction, but also enhances the reliability of the model, making it a powerful tool for predicting athlete engagement. In summary, this study not only provides a new perspective for understanding athlete engagement, but also provides important practical guidance for improving athlete engagement and overall performance. By adopting the PSO-SVR model, we can more accurately identify and optimise the key factors affecting athlete engagement, thus bringing far-reaching implications for research and practice in sport science and related fields.

RFR algorithm

J Bi 2023 PSOSVRPos: WiFi indoor positioning using SVR optimized by PSO Expert Syst. Appl. 222 119778

SVR algorithm

S Gu L Xue 2022 Relationships among sports group cohesion, psychological collectivism, mental toughness and athlete engagement in Chinese team sports athletes Int. J. Environ. Res. Public Health 19 9 4987

$$beginarray{*20l} {fleft( x right) = w^{T} varphi left( x right) + b} endarray$$

(17)

Suppose the training set (T={left(x_{1},{y}_{1}right),cdots ,left(x_{N},{y}_{N}right)}), (x_i) is the feature vector of an instance and ({y}_iin {{c}_{1},{c}_{2},cdot cdot cdot ,{c}_{n}}) is the class of that instance, (i =(text{1,2},cdot cdot cdot ,n)), for a test instance x, its class y can be represented in⁴⁷ Eq. (8).

$$beginarray{*20l} {R_{reg} = frac{1}{2}left| {left| w right|} right|^{2} + C times frac{1}{n}mathop sum limits_{i = 1}^{n} left| {y_i – fleft( x right)} right|_{varepsilon } } endarray$$

(18)

Stander, F.W., Leon, T. D. B., Stander, M., Mostert, K. & Coxen, L. A strength-based approach to athlete engagement: An exploratory study. J. South Afri. J. Res. Sport Phys. Educ. Recreat. 39, 165–175 (2017).

$$beginarray{*20l} {fleft( x right) = mathop sum limits_{i = 1}^{n} left( {a_i – a_i^* } right)kleft( {x_i ,x} right) + b} endarray$$

(19)

The choice of PSO-SVR over integrated learning is mainly based on its advantages in terms of interpretability, applicability to small datasets and strong generalization capabilities. This choice is suitable for scenarios particularly where accurate predictions need to be extracted from limited data. In contrast, integrated learning models tend to be more complex, demand higher computational resources, and may perform poorly when dealing with small sample data.

$$beginarray{*20l} {fleft( x right) = mathop sum limits_{i = 1}^{n} left( {a_i – a_i^* } right)exp left( { – frac{{left| {|x_i – x} right||^{2} }}{{2sigma^{2} }}} right) + b} endarray$$

(20)

C Lonsdale K Hodge TD Raedeke 2007 Athlete engagement: I. A qualitative investigation of relevance and dimensions Int. J. Sport Psychol. 38 451 470

PSO-SVR algorithm

X Yizi 2024 Bioinformatics combined with machine learning to identify early warning markers for severe dengue J. China Med Univ. 53 7 583 590With the RFECV method, we successfully filtered the optimal feature subset for the machine learning model based on cohesion, passion and mental toughness. Figure 3 demonstrates that the model was effective in predicting athlete engagement using three feature subsets. These selected features comprised cohesion, passion and mental toughness, covering dimensions such as ATG-T, ATG-S, GI-T and GI-S; passion criteria, harmonious passion and obsessive passion; and confidence, constancy and control. The synergy of these features contributed to the model’s strong performance in predicting athlete engagement.

$$beginarray{*20l} {fleft( {C,g} right) = min arg mathop sum limits_{m}^{M} mathop sum limits_{n}^{N} Lossleft( {y_i – mathop ylimits_i left( {c_{m} ,g_{n} } right)} right)} endarray$$

(21)

M Abdel-Basset R Mohamed M Jameel 2023 Spider wasp optimizer: A novel meta-heuristic optimization algorithm Artif. Intell. Rev. 56 11675 11738

Article
MATH

Google Scholar

Results

Model feature selection results

MATH

Google Scholar
MATH

Google Scholar

Algorithm for making predictions using random forest.

Download references

Table 3 Comparative analysis of optimization algorithms for various machine learning models.

Full size table

By considering the advantages and disadvantages of these algorithms and optimization strategies, we found that the SVR model combined with PSO performs the best in predicting athlete participation. The PSO-SVR model optimizes the parameters of the SVR through the PSO algorithm, which effectively improves the predictive performance of the model and reduces the risk of overfitting. Compared to other models, it performs better in handling nonlinear relationships and complex feature interactions. This result indicates that the PSO-SVR model has higher accuracy and stronger generalization ability in the task of athlete engagement prediction, which provides a valuable reference for future research and practice.where ({N}_{m}) is the number of samples in sub-region ({R}_{m}(j,s)). The data is divided into regions by selected ((j,s)) and the corresponding output values are determined as shown in Eq. (15).

Comparison of prediction models under different machine learning algorithms

where ({widehat{c}}_{m}) is the predicted value of the (m) th region and (I(xin {R}_{m})) is an indicator function to determine whether (x) belongs to region ({R}_{m}) . In this way, the decision tree can predict and analyze the data effectively.

As a core concept in the field of positive psychology, engagement refers to an individual’s persistent mental state filled with positive emotions, which is an important factor in influencing an individual’s behavioral performance and the direction of behavioral development. Although previous research in the area of sport performance has focused on burnout mostly as leading to maladaptation and sport dropout^4,5. However, with the growth of the positive psychology movement, athlete engagement has received much attention and related research has become increasingly prominent⁶. Early identification of low-engagement individuals allows for intervention before the risk of burnout and disengagement occurs⁷. It needs to be specified that a state of non-burnout is not the same as engagement, and that low engagement is not burnout. Positive feedback can be obtained when we view athlete performance in a positive manner. Therefore, athlete engagement, as an important indicator of positive psychology of athletes, plays an important and positive role in improving athletes’ competitiveness and athletic performance. A high level of engagement is conducive to stimulating positive qualities that promote the development and maturation of athletes⁸, laying the foundation for enhancing their athletic ability⁹ and ultimately translating into significant improvements in athletic performance and achievement⁷.J Chuantong R Norsilawati Abdul R Nelfianty Mohd 2023 A Preliminary psychometric investigation of a Chinese version of athlete engagement questionnaire (AEQ) Int. J. Hum. Mov. Sports Sci. 11 1 103 111

In this study, we conducted a comprehensive analysis of the dataset with the aim of evaluating the applicability of different machine learning algorithms in constructing a prediction model for athlete participation based on their characteristics and advantages. We selected LR, KNN, GBRT, DTR, RFR, SVR and models combining four optimization algorithms (RGS, GA, SWO and PSO) to be trained and tested with the aim of identifying the most suitable models for predicting machine learning models for athlete engagement. During the course of our research, we found that each algorithm and its optimization strategy face unique challenges and potential limitations:

X Zhan S Zhang WY Szeto X Chen 2020 Multi-step-ahead traffic speed forecasting using multi-output gradient boosting regression tree J. Intell. Transp. Syst. 24 2 125 141

Process of adaption in SVR.
Google Scholar

Discussion

X.Z.: Conceptualization, writing and review. Z.K.L.: Analysis of experimental data and writing. S.G.: Research resource collection and experiment design. All authors reviewed the manuscript.Zhang, X., Lin, Z. & Gu, S. A machine learning model the prediction of athlete engagement based on cohesion, passion and mental toughness.
Sci Rep 15, 3220 (2025). https://doi.org/10.1038/s41598-025-87794-yYe, L. The Impacting Factors and Mechanism of Athlete Engagement. (D. Central China Normal University, 2014).MATH

Google Scholar
Radar plot of overall evaluation level of different machine learning algorithms.

Limitations and future research

In this model, the weight vector (omega) can be obtained through the “coef_” attribute of the model, while the intercept (b) is obtained through the “intercept_” attribute. And LR models usually don’t require much tuning of the hyperparameters; the performance of a linear model depends largely on the degree to which the problem or data follow a linear distribution⁴⁶. If the data do exhibit a linear relationship, then LR models tend to provide better predictions. However, if the true distribution of the data is nonlinear, more complex models may need to be considered, or appropriate preprocessing of the data might be required to improve the performance of the LR model.You can also search for this author in
PubMed Google Scholar

Conclusion

Z Zhongqiu 2012 Research, development and application of sports psychology in the field of competitive sports J. Tianjin Univ. Phys. Educ. 27 3 7

Data availability

RJ Vallerand 2010 On passion for life activities: The dualistic model of passion Adv. Exp. Soc. Psychol. 42 97 193

References

where ({y}_i) is the truth value of the ith sample and (mathop ylimits_i left( {c_{m} ,g_{n} } right)) is a function of C and g. We terminate the optimization process when the predictive model yields the model we need.Houyuan, Z., Qi, Y., Shengren, C. & Bonan, Z. Research on prediction model of amateur runners’ sports performance based on FMS. C. Sports Biomech. Branch. 449–451 (2023).
Article

Google Scholar
DF Gucciardi S Hanton S Gordon CJ Mallett P Temby 2015 The concept of mental toughness: Tests of dimensionality, nomological network, and traitness J. Personal. 83 1 26 44
JM Keller MR Gray JA Givens 1985 A fuzzy k-nearest neighbor algorithm IEEE Trans. Syst. Man Cybern. 4 580 585This study aims to predict the level of athlete engagement by examining both team and individual dimensions. At the team level, cohesion is one of the most frequently cited concepts in sports science research. It refers to the solidarity and sense of belonging that individuals feel in a team, which is a key factor influencing team performance. Strong team cohesion predicts the level of athlete engagement. In sporting events, cohesion provides crucial social support for athletes²³. Teams with high levels of cohesion usually result in higher levels of member engagement and increased support and trust in each other. Research suggests that cohesion is associated with satisfying psychological needs, promoting intrinsic motivation, and elevating athlete engagement levels. Increased team cohesion is an effective way to increase athlete engagement among athletes. It can be concluded that cohesion not only affects the emotional experience of athletes, but also plays an significant role in their behavioral performance. At the individual level⁹, passion and mental toughness play crucial roles in influencing athlete engagement. Hegel once asserted, “Nothing great in the world has ever been accomplished without passion.” Passion is not only a potential indicator of teacher engagement among secondary school²⁴ but also significantly influences students’ learning engagement²⁵. In sports, passion, internalized into personal identity²⁶, fosters strong engagement, enabling athletes to persist in their activities. Research suggests that we should focus on promoting harmonious passion tendencies within achievement situations in order that there is an increased probability of adaptive behavioral outcomes ensuing and being able to promote healthy sport engagement²⁷. Also, the two types of passion (harmonious and obsessive) should lead to engagement that, in turn, should lead to improved sport performance. Thus with passion, people are able to focus completely on the task at hand, give their full commitment to the sport, and experience positive results from their participation. Mental toughness is indispensable in both protecting and enhancing athlete engagement, representing a crucial psychological asset for achieving exceptional sporting performance. Research suggests that mental toughness is essential for athletes to cope with difficulties successfully, avoid the effects of stress and maintain their performance development²⁸. It not only has a significantly positive predictive influence on athlete engagement²⁹ but also correlates positively with an individual’s sports performance. Athletes with a high level of mental toughness can participate in the process of training and competing with a positive state of mind and body, and have a high level of athlete engagement (e.g., a high level of mental toughness). Athletes with high mental toughness have a strong degree of confidence in their own abilities and respond positively when they encounter difficulties, so they tend to approach training and competition with a more positive physical and mental condition³⁰ and have a high level of athlete engagement. Therefore, mental toughness plays a central role in the foundation of sport success and is a proximal factor in the athlete engagement.
DOI: https://doi.org/10.1038/s41598-025-87794-y
Google Scholar
SVR is a regression algorithm based on the Support Vector Machine (SVM) algorithm. SVR maps the input data to a high-dimensional feature space and constructs a hyperplane that maximizes the boundary between the predicted and actual values. In the given sample set (T={left(x_{1},{y}_{1}right),cdots ,left(x_{N},{y}_{N}right)}), there are N input variables (x_{t}) and corresponding output variables ({y}_{t},t=1,cdots ,N). The form of SVR can be expressed as follows Eq. (17).Different machine learning algorithms predict the outcome of the MAE.
Xin Zhang & Song Guwhere (w) is an element in the high-dimensional space, (varphi left(xright)) is a nonlinear mapping function, and b is a threshold. Then, the above regression problem seeks to minimize the risk of f by minimizing the Euclidean norm of w to reduce the complexity as follows Eq. (18).
Different machine learning algorithms predict the outcome of R².Supervised learning models can be classified into regression and classification techniques when used to predict continuous and discrete target variables. LR is a typical regression model that predicts targets through the following equation⁴⁵ . Its formula is expressed in Eq. (7).
Previously, researchers have delved into the factors affecting athlete performance from a number of perspectives, encompassing not only the traditional technical, tactical and physical aspects, but also cutting-edge areas such as exercise fatigue, recovery mechanisms and psychological factors. Therefore, analyzing and predicting athlete performance is a complex and dynamic endeavor that involves multiple considerations. And machine learning technology plays an important role in analyzing data on athlete performance. The application of this technology is changing the way athletes are monitored in terms of health and performance, thus impacting their training, competition and career management. However, in previous studies, many researchers and coaches have focused on reducing exercise fatigue and burnout, so that they overlooking the fact that athletes who maintain a high level of engagement with training and competition are the ones who respond positively to the stress of competition and perform well in training or competition. Engagement represents a sustained, shared affective-cognitive state of concentration on any object, event, person or behavior. When athletes have a sense of control over their ego state, it helps them achieve their goals. Athlete engagement studied in this paper serves as an important indicator of positive athletic performance and can reflect positive and healthy mental states of athletes. Athletes’ mental health is a major resource throughout their athletic career and beyond. However, the diversity and complexity of sports performance and the nonlinear relationships between psychological traits complicate the task of analysis and prediction using traditional methods. Therefore, the use of effective machine learning models may help to achieve highly accurate predictions of athletic performance.K Goodger T Gorely D Lavallee C Harwood 2007 Burnout in sport: A systematic review J. Sport Psychol. 21 127 151
Through Table 3, the pso-SVR model was identified as the best hyperparameter combination. As shown in Fig. 6, the best hyperparameters for this machine learning model with respect to the features of cohesion, passion and mental toughness were C = 6.6636 and g = 0.0436. With this hyperparameter combination, the model achieved a prediction accuracy of 92.62%, demonstrating its effectiveness in predicting athlete engagement.Here ({(j,s)}^*) represents the selected best feature (j) and the cut point (text{s}), while ({R}_{1} (j,s)) and ({R}_{2} (j,s)) denote the two subregions obtained after the cut respectively. ({y}_i) is the target value of the ith sample, ({c}_{1}) and ({c}_{2}) are the average of the target values in the two subregions, which are calculated using the formula in Eq. (14).
C Lonsdale K Hodge S Jackson 2007 Athlete engagement: II. Development and initial validation of the athlete engagement questionnaire Int. J. Sport Psychol. 38 4 471 492This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of Zhejiang Normal University (ZSRT2022079). All methods were carried out in accordance with relevant guidelines and regulations. Informed consent has been obtained from the participants. We express our great gratitude to all the authors of the scales and all the participants in the present study.
RFE is a feature selection technique that relies on a specific machine learning model to achieve a predefined number of features by iteratively eliminating features with lower weights. It sorts features based on the model’s coefficient weights or importance scores, recursively removing a certain proportion of redundant or irrelevant features at each iteration, ultimately retaining the subset of features that contributes most to the model’s predictive capability. However, a limitation of the traditional RFE method is the need to manually set a threshold for retaining the number of features, which could lead to the loss of important features or the retention of unnecessary ones. To overcome this drawback, this study adopted RFECV. This method evaluates the performance of each feature subset and selects the feature combination with the highest score, thereby ensuring that the chosen feature subset maximizes its contribution to model performance and effectively avoids the accuracy loss that may result from manually setting thresholds⁴³. Using the RFECV method, we screened out features with higher weights and determined that the three features of cohesion, passion and mental toughness were critical for model performance by comparing different combinations of features, as detailed in Table 2.AC St-Louis M Rapaport PL Chénard RJ Vallerand S Dandeneau 2021 On emotion regulation strategies and well-being: The role of passion J. Happiness Stud. 22 1791 1818
T Zhou L He J Wu F Du Z Zou 2019 Prediction of surface roughness of 304 stainless steel and multi-objective optimization of cutting parameters based on GA-GBRT J. Appl. Sci. 9 18 3684AV Carron P Chelladurai 2010 The dynamics of group cohesion in sport J. Sport Psych. 3 2 123 139
MATH

Google Scholar
Feature selection of machine learning model. The study decided to select the features for the machine learning model from the athletes’ cohesion, passion and mental toughness. Cohesion consists of four scores: ATG-T, ATG-S, GI-T and GI-S; Passion consists of three scores: passion criteria, harmonious passion and obsessive passion; Mental toughness consists of three scores: confidence, constancy and control.
Google Scholar
Article

Google Scholar
Article

Google Scholar
Article

Google Scholar
To validate the performance of the model, the available database was divided into two distinct groups: a training set for training predictive models and a testing set for evaluating its performance. After comparing the experimental results of different data partition ratios, we found that the 8:2 partition ratio has the lowest RMSE, as detailed in Table 1. During the model training phase, we improve the parameters continuously and use the test set to evaluate the performance of the model for further refinement. The test set not only evaluates the data, but also confirms the reliability and generalization ability of the model. In addition, employing five-fold cross-validation³⁵, we guarantee the ability of model to generalize and successfully mitigate the danger of overfitting.Article
MATH

Google Scholar
JA Segovia JF Toaquiza JR Llanos DR Rivas 2023 Meteorological variables forecasting system using machine learning and open-source software J. Electron. 12 4 1007Article

Google Scholar
Zhang, L. W. & Mao, Z. X. Evaluation Manual of Psychological Scales Commonly Used in Sports Science. J. Beijing Sports Univ. Press. 378, 7-81100-226-4 (2004).
M Sheard J Golby A Wersch van 2009 Progress toward construct validation of the sports mental toughness questionnaire (SMTQ) Eur. J. Psychol. Assess. 25 3 186 193 https://doi.org/10.1027/1015-5759.25.3.186Figures 9 and 10 displayed the MSE and MAE of seven machine learning algorithms. Both MSE and MAE were crucial indicators for assessing model quality, directly reflecting the accuracy of the model’s predictions. By comparing these error metrics, the overall effectiveness of the different prediction models could be determined. Before assessing model performance, lower MSE and MAE values indicated that the model’s predictions were closer to the actual data. Based on the results in Fig. 9, we can see that the PSO-SVR model has the best performance with an MSE of 0.046. This is followed by the GA-SVR model with an MSE of 0.0153. The LR model has an MSE of 0.0177. According to the results in Fig. 10, the LR model has an MAE of 0.0618 which is the best performance. This is followed by the PSO-SVR model with an MAE of 0.0656. The MAE of the RFR model was 0.0708, while the MSE and MAE values of the other four algorithms were higher, indicating larger prediction errors. Therefore, based on MSE and MAE values, the PSO-SVR model demonstrated the best performance in predicting athlete engagement, with the lowest error metrics, meeting the research requirements. This suggested that the PSO-SVR model was more effective in predicting athlete engagement while maintaining lower prediction errors.
Data preprocessing and normalizing techniques are employed to enhance the dataset’s quality. These methods aim to optimize the performance of machine learning algorithms by minimizing data noise and decreasing data dimensions⁴². The input dataset comprises the cumulative scores for cohesion, passion and mental toughness, whereas the output dataset comprises the overall score for athlete engagement. Input and output data normalized in the range 0 to 1. Equation (1) was used for normalization of data.
Google Scholar
W Zhaohua L Jie W Bo D Nana N Fuhua 2024 Research on mining and applications of individual heterogeneity factors in resident demand response by integrating machine learning and SHAP value algorithm J. Syst. Eng. theory Pract. 44 7 2247 2259Correspondence to
Song Gu.
S Kent K Kingston KF Paradis 2018 The relationship between passion, basic psychological needs satisfaction and athlete burnout: Examining direct and indirect effects J. Clin. Sport Psychol. 12 75 96Optimal hyperparameters for machine learning feature models.
In sports, the application of machine learning and AI technologies is continually broadening the boundaries of sports talent development, athlete performance enhancement and event decision-making. Through these advanced technologies, coaches and sports teams are able to achieve accurate predictions of athlete performance, driving a shift in the decision-making process from intuitive experience to quantitative data-based analysis¹⁰. Researchers have used smart technologies to play an significant role in several aspects of selection of basketball players, analysis of game activity, prediction of results and so on, improving team competitiveness significantly^11,12. The soccer club case further demonstrates the practical application of sports data analytics in predicting injury prevention, assessing optimal performance and optimizing training¹³. It is clear that predictions utilizing machine learning that enable timely assessment of athletes’ reaction and performance to different scenarios are critical for improving skills, promoting team synergy and guiding sports teams towards success. This highlights the significant value of sports prediction modeling in contemporary sports.Table 1 presented a summary of the predictive accuracy, RMSE, MSE and MAE of the prediction models constructed using seven different machine learning algorithms. Figure 11 displayed a radar chart generated from Table 3, which was used to assess the overall effectiveness of the prediction models established by these algorithms. To visually compare the performance of the different machine learning algorithms’ prediction models in terms of statistical accuracy, the radar chart was used to show the models’ combined performance across four dimensions: RMSE, MSE, MAE and predictive accuracy. In this radar chart, PSO-SVR and LR perform well on all assessment metrics, especially their R² values are close to 1, showing high predictive accuracy. Although GBR performs well on R², its RMSE and MSE are not as good.
Article
PubMed
PubMed Central
MATH

Google Scholar
where ({|left|wright||}^{2}) is the Euclidean norm of the weight vector, which is used to control the complexity of the model. ({R}_{reg}) is the empirical risk, (C) is the penalty parameter used to balance the weights of the descriptive function and the empirical risk, and ({left|{y}_i-fleft(xright)right|}_{varepsilon }) is the insensitive loss function used to measure the difference between the predicted value (fleft(x_iright)) and the true value ({y}_i). By introducing the non-negative slack variable ξ and the Lagrangian function, the function (fleft(xright)) is described as follows using the Karush–Kuhn–Tucker (KKT) condition Eq. (19).
Article

Google Scholar
AB Mahmoud WD Reisel A Berman L Fuxman 2024 Unlocking athletic potential: Harnessing people analytics for optimal performance and talent management in sports A Mansurali P Mary Jeyanthi D Hack-Polay AB Mahmoud Eds Sports analytics: Data-driven sports and decision intelligence Springer Nature Switzerland Cham
The study was funded by Zhejiang Provincial Philosophy and Social Sciences Work Office, 23NDJC118YBThis study comprehensively considered key characteristics such as cohesion, passion and mental toughness, and developed a machine learning model aimed at predicting athlete engagement. Through an in-depth comparison and analysis of the six base algorithms and four optimization algorithms, we observe that the PSO-SVR model outperforms the other models in terms of prediction accuracy significantly. In particular, it has demonstrated a remarkable ability to deal with non-linearities, adapt and optimize the feature space, and provide more accurate and reliable predictions of athlete engagement. The PSO-SVR model performed impressively in predicting athlete engagement with a prediction accuracy of 0.9262. In contrast, other models such as GA-SVR, RGS-SVR, SVR, LR and SWO-GBR are less predictive slightly. The results of this research not only highlight the great potential of the PSO-SVR model in predicting athlete engagement, but also provide a solid scientific basis for the field of athletic training and management. In summary, the wide application of the model is expected to provide scientific decision support in the field of sports training and management, and thus improve the individual performance of athletes and the overall competitiveness of sports teams.
LR struggles to capture the complex nonlinear relationships of psychological features and has limited predictive power. KNN has reduced accuracy on small datasets and is sensitive to feature scaling. GBRT and DTR are susceptible to overfitting by noise and outliers. RFR performs poorly when dealing with highly correlated features. SVR has stringent kernel function selection and preprocessing requirements. RGS is computationally expensive in high dimensional spaces. GA requires fine tuning of parameters and multiple iterations to converge^31,32. SWO as a new algorithm, lacks empirical guidance for parameter tuning and may not be as effective as mature algorithms³³. PSO is easy to fall into local optimum in high dimensional space, and is very sensitive to the initial setting and parameter adjustment³⁴.ZA Sejuti MS Islam 2023 A hybrid CNN–KNN approach for identification of COVID-19 with 5-fold cross validation J. Sens. Int. 4 100229
MATH

Google Scholar
C Tedi 2023 Coach leadership style, coach-athlete relationships and engagement in team sports: The mediating role of athlete burnout Revista iberoamericana de psicología del ejercicio y el deporte. 18 5 687 692The SMTQ was compiled by Sheard, Golby and Wersch (2009)³⁸. The Chinese version of the questionnaire demonstrated good reliability and validity in its application within China. There are 14 items in total, including three dimensions of self-confidence, constancy and control. A higher score indicates greater mental toughness in the athlete. The Cronbach’s ɑ coefficient for the questionnaire in this study was 0.849, signifying its strong reliability.
As an algorithmic discipline, machine learning analyzes and assimilates large volumes of pre-existing or synthesized datasets to make predictions and assessments. This provides sports scientists, coaches and athletes with an objective analysis tool for evaluating their performance, thus assisting them in making optimal decisions. Through personalized predictions from “learning algorithms”, it is possible to conduct in-depth analysis of each athlete’s performance and develop customized training plans to help them achieve optimal athletic performance. This allows athletes to gain a more comprehensive understanding of their physical and mental health, enabling timely adjustments and improvements in their performance during training and competitions. Coaches, for their part, can use predictive modeling to monitor athletes engagement levels regularly. Based on the prediction results, real-time information on the athlete’s true athletic performance can be obtained, and measures can be taken timely, such as adjusting the training load or adding psychological counseling. Changes in athlete engagement before and after training can also be analyzed to assess the effectiveness of training interventions and optimize future training strategies continuously. Such data-driven decision-making is often more reliable than subjective judgment. Sports managers can leverage insights from machine learning research to better support coaches and athletes, optimize resource allocation, and enhance the overall competitiveness of sports teams or programs. For example, predictive modeling of athletic engagement is utilized to develop scientific selection criteria to ensure that athlete reserves are recruited that match the team’s culture; Counseling and support resources can be allocated according to the psychological needs of athletes to ensure that athletes receive the support they need during training and competition.
In the field of sports, the utilization of machine learning technology has become a key strategy for data exploration and extraction, leveraging its interdisciplinary nature to integrate the advantages of sports science, computer science and statistics. This field not only demonstrates multi-layered and multi-domain characteristics but also highlights the importance of data analysis in specific training and competition scenarios, reflecting the overall trend toward precise sports training and management⁵². Big data analysis in the field of physical activity using AI techniques is justified as one of the important and effective tools to improve the quality of health and sports performance of professional athletes. As a fundamental application of artificial intelligence, machine learning, with its integration of various interdisciplinary domains and excellent performance, has emerged as a prominent topic in sports training method research.
MATH

Google Scholar
The GEQ compiled by Carron (2010) and translated by Ma Hongyu³⁶ was adopted. This questionnaire represents a specialized tool for assessing cohesion in sports, which has good reliability and validity in Chinese use. The questionnaire comprises 15 items, encompassing the dimensions of Attraction to Group Task (ATG-T), Attraction to Group Social (ATG-S), Group Interaction Consistency (GI-T), and Group Social Interaction Consistency (GI-S). The Cronbach ɑ coefficient of test–retest reliability is 0.869, which meets the needs of this study.
Faculty of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, 510225, ChinaDownload citation
In order to determine the best technology, we use the Root Mean Square Error (RMSE), Mean Square Error (MSE), Mean Absolute Error (MAE) and coefficient of determination (({R}^{2})) as evaluation indicators⁴¹. In the following formula, ({X}_i) represents the predicted value of the (i) th element, while ({Y}_i) denotes the actual values of the (i) th element. The regression method forecasts the ({Y}_i) value corresponding to the ({X}_i) value in a real dataset. The average of the true values⁴⁴. Its formula is expressed in Eq. (2).Zhikang Lin
Article

Google Scholar
Given that athletes might be affected by various circumstances, including physical exhaustion and meteorological conditions. These factors may lead to noise or fluctuations in the statistics. By calculating an average of every 8 data points, the impact of short-term variations can be reduced effectively and unwanted noise is minimized. This facilitates the observation of long-term trends and alterations. In some cases, athlete data may show higher densities, complicating analysis. High frequency data can be reduced to lower frequencies to make the analysis process more efficient⁴¹. Data preprocessing is an essential step in tailoring the dataset to meet the unique requirements of various machine learning algorithms. It plays a pivotal role in assessing the predictive capabilities of different models.Our goal is to determine a function, (Fleft(xright)), which minimizes a specific loss function (L(y,Fleft(xright))). GBRT approaches the optimal solution (widehat{F}(x)) step by step by weighted combination of these simple regression trees (h(x_{t})). In GBRT, each basic regression tree (h(x_{t})) is constructed based on the input variable X and the negative gradient of the loss function. Together, these trees allow the model to capture the complex nonlinear relationships in the data more accurately. The model is formulated as follows⁴⁹ Eqs. (9) and (10).
Machine learning-based predictive models do show great potential in providing personalized strategies for athlete training and enhancing athlete engagement. However, this predictive model faces a number of challenges. For example, the PSO algorithm may encounter the problem of local optimal solutions when applied, the choice of algorithm parameters has a significant impact on the final results, and also its search efficiency may decrease when dealing with high-dimensional feature spaces. At the same time, the PSO-SVR model’s dependence on self-reported data may lead to biased predictions, as these data are often influenced by the subjective feelings of individual athletes. In addition, data privacy and security issues are critical when deploying models. We need to take measures such as data desensitization and encryption to protect the privacy of research subjects and to improve model transparency and interpretability.Reprints and permissions
Anyone you share the following link with will be able to read this content:Selection of cross-validation scores based on machine learning feature models.
As for the combination of PSO and SVR, the following is an explanation of the optimal configuration of the hyperparameters: the regularization parameter C was set to 6.66, a choice that highlights the fact that we struck a good balance between model complexity and training error. This aims to prevent model overfitting while ensuring the model’s ability to fit the training data efficiently; the coefficient of the kernel function is 0.0436, and this low gamma value implies that the model prefers to use smoothed decision boundaries, which helps to capture the overall structural characteristics of the data; and the kernel type is chosen to be “rbf”, the radial basis function kernel, which specializes in revealing nonlinear relationships in the data, which is essential for enhancing the predictive power of the model. By carefully optimizing these key hyperparameters, the prediction accuracy of the SVR model is significantly improved. This makes it stand out in the comparison of many optimization algorithms and show excellent performance.Regarding the predictive accuracy of various models, the PSO-SVR model achieved the highest accuracy at 92.62%. The variables chosen to represent this study are cohesion, passion and mental toughness, which are significant potential factors influencing athlete engagement, although other factors may also affect their engagement. The PSO-SVR model excels in handling cohesion, passion and mental toughness due to its adaptability to data features and patterns, strong ability to handle high-dimensional features, and effectiveness in managing small samples and category imbalances.
Article

Google Scholar
The authors declare no competing interests.
Different machine learning algorithms predict the outcome of the RMSE.The AEQ was compiled by Lonsdale et al.³⁹(2007) and revised by Ye Lv’s translation²⁹. The questionnaire comprises 16 items, encompassing the four dimensions of self-confidence, dedication, vigor and enthusiasm. A higher score indicates a greater level of engagement from the athlete. The Cronbach’s alpha coefficient for this scale is 0.959, which meets the research requirements.
Machine Learning Model Prediction Results and Key Characterization.
G Aurélien 2017 Hands-on machine learning with scikit-learn & tensorflow Geron Aurelien 134 145 150Received: 07 August 2024
Overall results indicated that the PSO-SVR algorithm performed the best in terms of accuracy, and RMSE and MSE scores are relatively low (see Table 3). It can be concluded that this model not only has high predictive accuracy but also exhibits low prediction errors. Compared to PSO-SVR, the LR algorithm has lower accuracy. However, LR outperforms PSO-SVR in terms of MAE, possibly because the errors in LR are symmetrically distributed, whereas the increased complexity of PSO-SVR may result in asymmetric error distribution. GA explores the solution space in depth through crossover and mutation so that better Mae metrics can be found. The stochastic nature of RSG can be an advantage in complex parameter spaces, especially when optimizing MSE.Different machine learning algorithms predict the outcome of the MSE.
Article

Google Scholar
Alibrahim, H. & Ludwig, S. A. Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization. In 2021 IEEE Congress on Evolutionary Computation (CEC), Kraków, Poland, 1551–1559 (2021).
H Zhaorigetu 2022 Prediction of sports performance combined with deep learning model and analysis of influencing factors J. Sci. Program. https://doi.org/10.1155/2022/4082906N Louw SJ Steel 2006 Variable selection in kernel Fisher discriminant analysis by means of recursive feature elimination J. Comput. Stat. Data Anal. 51 3 2043 2055
S Kannappan 2023 Perceived organizational support and employee engagement: Mediating role of work passion of the employees Manag. Br. J. Adm. Manag. 58 156 210 212Article
MathSciNet
MATH

Google Scholar
Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Machine learning, a powerful interdisciplinary technology that incorporates multiple disciplines, has the remarkable capability to learn autonomously from experience.It constructs predictive models using “learning algorithms” to make accurate, personalized predictions for new scenarios. As a core component of artificial intelligence, machine learning plays a vital role by extracting valuable insights from diverse and complex data sets. Its applications extend to numerous fields, including medicine¹, bioinformatics² and automation technology³.
Article
MATH

Google Scholar
JO Emily H James M David 2010 Identifying important practice behaviors for the development of high-level youth athletes: Exploring the perspectives of elite coaches J. Psychol. Exerc. 11 6 433 443
College of Physical Education and Health Sciences, Zhejiang Normal University, Jinhua, 321004, ChinaIn this study, RFECV was employed to identify the optimal number of features for the model. The input features considered included mental toughness, cohesion, passion, sports satisfaction and psychological collectivism.
To assess the predictive performance of each model, four types of metrics were used: RMSE MSE, MAE and ({R}^{2}).

M Sorkkila A Tolvanen K Aunola TV Ryba 2019 The role of resilience in student-athletes’ sport and school burnout and dropout: A longitudinal person-oriented study Scand. J. Med. Sci. Sports 29 7 1059 1067

Acknowledgements

The SVR with fixed parameters struggled to adapt to changes in athlete data, and online learning consumed a substantial amount of computational resources. Additionally, the execution speed did not meet the requirements. A threshold θ was set, and once the estimation error exceeded this threshold, the model was retrained with the latest data. To improve parameter optimization, Particle Swarm Optimization (PSO) was introduced into SVR.

Funding

Train the training set and evaluate the prediction performance of each model.

Author information

M Oytun C Tinazci B Sekeroglu C Acikada HU Yavuz 2020 Performance prediction and evaluation in female handball players using machine learning models IEEE Access 8 116321 116335

Authors and Affiliations

Accepted: 22 January 2025Currently, traditional statistical methods often struggle to capture complex nonlinear relationships when analyzing the relationship between multidimensional psychological traits and athlete performance. For example, while linear regression models provide basic predictions, they are limited in their explanatory power and predictive accuracy. In contrast, machine learning methods show remarkable advantages when dealing with complex datasets. It is not only capable of handling large amounts of high-dimensional data, but also has the ability to identify patterns and associations automatically, capturing the complex relationships between variables effectively. Particularly in the field of sport psychology, machine learning has been able to explore the influence of psychological traits on behavior in greater depth. Machine learning can identify the relationship between underlying psychological traits and athlete engagement through sophisticated data analysis. This provides a new methodological basis for quantitative sport psychology and improves the science and accuracy of prediction of athlete engagement. However, the multiple psychological factors involved in athlete engagement, such as passion²⁰, burnout²¹ and thriving²², are often subjective and difficult to standardize and quantify. Machine learning requires large amounts of high-quality, structured data for training, but the diversity and complexity of subjective questionnaires bring challenges to data processing. Therefore, we propose to implement a comprehensive set of strategies covering preprocessing steps, recursive feature elimination for cross-validation (RFECV), sensitivity analysis, SHAP and LIME interpretive techniques, data averaging and downsampling and data standardization and normalization. We adopt a comprehensive strategy covering recursive feature elimination for cross-validation (RFECV), sensitivity analysis, SHAP and LIME interpretation techniques, data averaging, downsampling, and standardization. This aims to improve the quality of the dataset and the predictive performance of the model significantly, ensuring model reliability and accuracy.
Published: 25 January 2025Cornman, A., Spellman, G. & Wright, D. Machine learning for professional tennis match prediction and betting. Working Paper, Stanford University. (2017).

Authors

Xin Zhang

View author publications

Figure 8 presented the statistical results of seven machine learning algorithm models from the perspective of RMSE. RMSE is an important metric for evaluating the prediction accuracy of different machine learning models. It is a commonly used indicator that measures the difference between the model’s predicted values and the actual observed values, reflecting the magnitude of prediction errors. By comparing these seven machine learning algorithms, we find the model with the lowest RMSE, that is, the one with the smallest prediction error. The PSO-SVR algorithm has the best performance in terms of RMSE which is 0.1227; followed by the GA-SVR algorithm, which has an RMSE result of 0.1236; and the RSV-SVR algorithm, which has a relatively low RMSE, which is 0.128. The other four algorithms had higher RMSE values, indicating that their prediction errors were larger and thus, they were not suitable as the optimal models.
Zhikang Lin

View author publications

where ({F}_{m}(x)) is the integral of the predicted values of the basic regression tree. ({h}_{m}(x)) is the ({m}_{th}) regression tree and ({y}_{m}) is the weighting factor of the ({m}_{th}) regression tree. The GBRT is optimized using the steepest descent method, the negative gradient ({z}_{m}(x)) is given by Eq. (11).
Song Gu

View author publications

The accuracy of model predictions is a key indicator of regression model performance, achieved by comparing the prediction accuracy of models built with different machine learning algorithms. Among the seven algorithms, the PSO-SVM algorithm performs the best, with a prediction accuracy of 0.9262, followed by the GA-SVR algorithm with a prediction accuracy of 0.9226. The RSG-SVR algorithm comes in third place, with a relatively high prediction accuracy as well. While LR, SWO-GBR, PSO-GBR, GA-GBR and SWO-DTR algorithms had prediction accuracies of 0.9103, 0.9079, 0.9069, 0.9045 and 0.8861, respectively. Therefore, based on the R² scores, the PSO-SVM model demonstrated the highest accuracy in predicting athlete engagement, meeting the research requirements most effectively.

Contributions

MATH

Google Scholar

Corresponding author

Ahmed, N. Understanding Sports Data Analytics Simplified. Hevodata. Saatavilla: https://hevodata.com/learn/sports-data-analytics/. (2022).

Ethics declarations

Competing interests

The coefficient of determination ({R}^{2}), is used to quantify the extent to which the predictive model can capture variations in outcomes and to determine which model excels in fitting the data. The calculation of R squared is described by Eq. (3). The preferred metric, as illustrated in Eq. (4), is the RMSE, a quantitative measure used to evaluate the effectiveness of regression models. It quantifies the average of the squared differences between the observed and predicted values of the target variable. A lower RMSE value indicates superior model performance, signifying a smaller average deviation between the predicted and actual values. Additionally, as calculated in Eq. (5), the MSE can be included as a complementary metric, which measures the average of the squared errors in a set of predictions. The last metric is the MAE, which quantifies the differences between observed and predicted values, providing a measure of the typical error magnitude. It is calculated according to Eq. (6), where m represents the sample size, ({Y}_i) denotes the actual values, and (widehat{{Y}_i}) represents the forecasted values.

Additional information

Publisher’s note

This study develops and compares several prediction models, including Linear Regression (LR), k-Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT), Decision Tree Regression (DTR), Random Forest Regression (RFR) and Support Vector Regression (SVR). To enhance the predictive capabilities of these base models, we employ four efficient optimization algorithms: Random Grid Search (RGS)³¹, Genetic Algorithm (GA)³², Spider Web Optimization (SWO)³³ and Particle Swarm Optimization (PSO)³⁴. Additionally, to ensure the model’s reliability and robustness, we utilize five-fold cross-validation³⁵ to assess the model’s stability. Through these integrated methods, the study aims to evaluate the accuracy, stability, and overall performance of the models in predicting athlete participation.

Rights and permissions

Article
PubMed

Google Scholar

DTR is both an effective classification method and a commonly used technique in regression analysis. As a fundamental machine learning algorithm, it is characterized by strong self-learning capabilities, good model interpretability and high learning efficiency⁵⁰. The structure of a decision tree consists of leaf nodes, branches and a root node. It integrates all variables, subdivides nodes into child nodes, and selects the splitting method that generates the most homogeneous child nodes during this process, with predictions determined by the path leading to the leaf nodes. DTR is not only easy to learn and understand but also directly visualizable. It constitutes one of the most direct methods for revealing key interactions between variables.

About this article

Cite this article

The KNN algorithm is applicable not only to classification tasks but also to regression tasks. In regression tasks, the goal is to predict continuous values rather than discrete labels as in classification tasks. Specifically, for each test sample to be predicted, its predicted value is determined by examining the k nearest neighbors found in the training set and averaging the values of these neighbors.L Yang A Shami 2020 On hyperparameter optimization of machine learning algorithms: Theory and practice J. Neurocomput. 415 295 316

MATH

Google Scholar

RI Andrianova DV Fedoseev 2021 Digital Scouting Technologies in Game Sports EG Popkova VN Ostrovskaya AV Bogoviz Eds Socio-economic Systems: Paradigms for the Future. Studies in Systems, Decision and Control Springer Cham

During the construction process, the decision tree relies on a specific minimization criterion, such as Mean Square Error (MSE) or Gini impurity, to pick the best cut-off feature j and cut-off point s. The following equation represents the use of MSE as the criterion, as shown in Eq. (13).