Updated 1 week ago Research
By Ming Xuan Samuel Tan

Class Recommendation with Archetypes

Archetype‑based class matching study

Executive Summary

  • We investigated the feasibility of building a similarity-based class recommendation system by examining whether users who engage in multiple sports prefer sports similar to their preferred exercise types (archetypes: primary_exercise and secondary_exercises).
  • 78 unique sports from 499 users were collected from the test period. Sports were compared along Structure, Location, Objective, Intensity, and Skill Required to generate an exercise type similarity matrix.
  • 34.9% of users recorded at least one sport from the 10 sports most similar to their primary_exercise. 82.7% of users recorded at least one sport from either the top 5 sports most similar to their primary_exercise or secondary_exercise. Both observed proportions are statistically significant (p-value: less than 0.0005).
  • Users tend to engage in sports similar to their preferred sport. Sport similarity measures can be combined with exercise preference archetypes to design personalized sport and exercise recommendation systems.

Introduction

Sports preferences are shaped by a complex interplay of personal, social, and environmental factors. Previous research highlights that an individual’s history of sports participation plays a crucial role in determining their current and future sports interests. Moreover, the specific characteristics of the sports one has participated in, such as intensity, skill requirements, structure, and setting, also influence their preferences. These factors are further shaped by cultural norms and environmental accessibility.

Recent advancements in wearable technology and smartphone-based fitness tracking have generated an unprecedented volume of granular data on individual exercise habits. Millions of users now regularly log not just their physical activity levels, but also the exact sports or fitness activities they engage in. This offers a unique opportunity to empirically examine patterns in sports selection.

Given that past participation and the characteristics of previous sports strongly influence future preferences, we hypothesize that analyzing users’ logged sports activities can reveal underlying patterns, specifically that users tend to engage in sports with similar attributes. This leads to the potential of leveraging such data to recommend new sports or exercise classes aligned to an individual’s preferences, thereby enhancing engagement and diversification in physical activity.

In this report, we construct a comprehensive sports taxonomy based on the similarity between sports in five key characteristics: Structure, Location, Objective, Intensity, and Skill Required. Using this taxonomy, we identify sports similar to a user’s most preferred sports as captured by the archetypes primary_exercise and secondary_exercises. This list of similar sports was compared against the user’s exercise records to assess the validity of the hypothesis that individuals tend to engage in sports sharing similar characteristics.

Methods

Dataset

Exercise sessions logged between October 2024 to January 2025 were compiled. 78 unique sports and exercise types were identified. Seven sports were excluded as they were non-specific: other, preparation and recovery, stretching, cooldown, walking, workout, and flexibility. This gave a final set of 71 unique sports.

A total of 12,852 sessions from 499 unique users were logged during the assessment period. To ensure meaningful activity patterns, we filtered for users who reported at least 10 exercise sessions during this period, resulting in a subset of 11,805 exercise sessions from 249 users. The mean sessions logged per user was 47.4 (SD: 128.6).

Sport Similarity Scoring

Sports were compared across 5 dimensions, and similarity within each dimension was scored on a scale of 0-1:

  • Structure: Similarity in format and rules.
  • Location: Whether they are typically performed indoors or outdoors, in specific venues, etc.
  • Objective: Competitive vs recreational, individual vs team-based.
  • Intensity: Physical exertion level required.
  • Skill required: Technical skill level or learning curve.

Scores across all dimensions were averaged to yield an overall similarity score. We then performed hierarchical clustering with Ward’s linkage on the sports to generate a dendrogram to aid visualization.

Hypothesis Testing

This similarity score was used to identify sports most similar to users’ preferred sport as captured in archetypes primary_exercise and secondary_exercise. Two methods for identification were tested:

  • Method 1: Using the 10 sports most similar to a user’s primary_exercise.
  • Method 2: Using the 5 sports most similar to primary_exercise and 5 sports most similar to secondary_exercise.

For both methods, we calculated the proportion of users whose list of recorded sports contained at least one overlap with the list of similar sports identified.

Permutation testing was performed to assess whether the observed overlap between recorded sports and similar sports was statistically significant. 5,000 iterations were performed. In each iteration, sports were randomly selected without replacement for each user to simulate the null hypothesis (H0: Users engage in sports independently of sport similarity). A final p-value was calculated as the proportion of iterations where overlap was greater than or equal to the observed overlap.

Results

Sport Taxonomy

Using the similarity scores obtained from the sport similarity scoring, we constructed a taxonomy of sports observed in the logs. This taxonomy can be represented as a dendrogram.

Figure 1: Dendrogram of sports encountered in data logs
Dendrogram of sports encountered in data logs

Permutation Test Results

We observed that 87 users (34.9%) logged at least one sport from the list of similar sports identified using Method 1. In the permutation test, no iteration achieved a higher proportion than the observed proportion (p-value: less than 0.0005).

206 users (82.7%) logged at least one sport from the list of similar sports identified using Method 2. In the permutation test, no iteration achieved a higher proportion than the observed proportion (p-value: less than 0.0005).

The observed proportions of users who engage in at least one similar sport identified using both methods are statistically significant.

Figure 2: Density plot of proportions
Density plot of proportions

Density plot of the proportions obtained with randomly selected sports (blue solid line) and the observed proportion when similarity scores are used (red dotted line). The observed proportion far exceeds the expected proportion under the null hypothesis in both Method 1 and Method 2.

Discussion

In this study, we examined the hypothesis that users tend to engage in sports sharing similar characteristics. We constructed a taxonomy of sports encountered in our logs and identified sports similar to users’ preferred sports as captured in archetypes primary_exercise and secondary_exercises.

We tested two methods of identifying similar sports. We found that 87 users (34.9%) logged at least one sport from the list of similar sports identified using Method 1 and 206 users (82.7%) logged at least one sport from the list identified using Method 2. Both outcomes are statistically significant (p-value: less than 0.0005).

This supports our hypothesis that users tend to engage in sports sharing similar characteristics and that a user’s preferred sport as captured in archetypes can be used to predict other sports they might engage in or be interested in engaging in.

These findings demonstrate that a recommendation system using sports similarity is viable. Such systems could suggest new class offerings to users based on their engagement history, potentially improving engagement, user retention, and satisfaction.

Future work could combine sports similarity with customer usage patterns to further refine recommendations.

Conclusion

This study confirms our hypothesis that the sports a user engages in are predictable using sport similarity. The findings can be combined with Sahha archetypes to build compelling recommendation systems to offer new sports and exercise classes to users.

Key Takeaways

  • Sport similarity can predict user exercise preferences with 82.7% accuracy when combining primary and secondary exercise archetypes.
  • Five dimensions define sport similarity: Structure, Location, Objective, Intensity, and Skill Required.
  • Personalized fitness class recommendations based on user archetypes can improve engagement and retention.
  • Results are statistically significant (p less than 0.0005) across both testing methods.