Klim Kostyuk
Before filtering:
After filtering:
4782 DEGs (17.25%)
1229 DEGs (4.43%)
6706 DEGs (24.19%)
1225 DEGs (4.42%)
6413 DEGs (23.14%)
2176 DEGs (7.85%)
6619 DEGs (23.88%)
4007 DEGs (14.46%)
4949 DEGs (17.85%)
161 DEGs (0.58%)
Differentially expressed genes subset shows the best clusterization of PAM50 subtypes
First 2 components of PCA explain only ~16% of variance
t-SNE better distinguishes Luminal A and Luminal B subtypes
Adding PCA (`n_components` explaining 80% of variance) before t-SNE led to a more distinguished representation of subtypes
UMAP captures global tendencies better than t-SNE
With all genes as features:
| Classifier | Accuracy | Precision | Recall | F1-score | ROC AUC |
|---|---|---|---|---|---|
| Gradient Boosting | 0.830 | 0.789 | 0.830 | 0.809 | 0.900 |
| Random Forest | 0.805 | 0.764 | 0.805 | 0.778 | 0.931 |
| Logistic Regression | 0.774 | 0.767 | 0.774 | 0.765 | 0.926 |
| SVM | 0.767 | 0.731 | 0.767 | 0.744 | 0.920 |
| Neural Network | 0.761 | 0.726 | 0.761 | 0.743 | 0.855 |
| AdaBoost | 0.704 | 0.702 | 0.704 | 0.700 | 0.889 |
| Naive Bayes | 0.692 | 0.673 | 0.692 | 0.680 | 0.773 |
| KNN | 0.623 | 0.639 | 0.623 | 0.548 | 0.771 |
With only DEGs as features:
| Classifier | Accuracy | Precision | Recall | F1-score | ROC AUC |
|---|---|---|---|---|---|
| Gradient Boosting | 0.818 | 0.779 | 0.818 | 0.797 | 0.916 |
| Random Forest | 0.805 | 0.764 | 0.805 | 0.780 | 0.938 |
| SVM | 0.792 | 0.758 | 0.792 | 0.772 | 0.921 |
| Neural Network | 0.774 | 0.737 | 0.774 | 0.751 | 0.865 |
| Logistic Regression | 0.761 | 0.724 | 0.761 | 0.741 | 0.909 |
| Naive Bayes | 0.748 | 0.746 | 0.748 | 0.739 | 0.823 |
| AdaBoost | 0.730 | 0.694 | 0.730 | 0.711 | 0.850 |
| KNN | 0.660 | 0.642 | 0.660 | 0.611 | 0.827 |
With all genes as features:
| Classifier | Accuracy | Precision | Recall | F1-score | ROC AUC |
|---|---|---|---|---|---|
| Gradient Boosting | 0.830 | 0.789 | 0.830 | 0.809 | 0.900 |
| Random Forest | 0.805 | 0.764 | 0.805 | 0.778 | 0.931 |
| Logistic Regression | 0.774 | 0.767 | 0.774 | 0.765 | 0.926 |
| SVM | 0.767 | 0.731 | 0.767 | 0.744 | 0.920 |
| Neural Network | 0.761 | 0.726 | 0.761 | 0.743 | 0.855 |
| AdaBoost | 0.704 | 0.702 | 0.704 | 0.700 | 0.889 |
| Naive Bayes | 0.692 | 0.673 | 0.692 | 0.680 | 0.773 |
| KNN | 0.623 | 0.639 | 0.623 | 0.548 | 0.771 |
With only DEGs as features:
| Classifier | Accuracy | Precision | Recall | F1-score | ROC AUC |
|---|---|---|---|---|---|
| Gradient Boosting | 0.818 | 0.779 | 0.818 | 0.797 | 0.916 |
| Random Forest | 0.805 | 0.764 | 0.805 | 0.780 | 0.938 |
| SVM | 0.792 | 0.758 | 0.792 | 0.772 | 0.921 |
| Neural Network | 0.774 | 0.737 | 0.774 | 0.751 | 0.865 |
| Logistic Regression | 0.761 | 0.724 | 0.761 | 0.741 | 0.909 |
| Naive Bayes | 0.748 | 0.746 | 0.748 | 0.739 | 0.823 |
| AdaBoost | 0.730 | 0.694 | 0.730 | 0.711 | 0.850 |
| KNN | 0.660 | 0.642 | 0.660 | 0.611 | 0.827 |
classifiers_degs = {
'Random Forest': RandomForestClassifier(random_state=42),
'SVM': make_pipeline(StandardScaler(), SVC(probability=True, random_state=42)),
'Gradient Boosting': HistGradientBoostingClassifier(random_state=42),
}
Gradient Boosting:
param_space_gb = {
'learning_rate': [0.01, 0.1, 0.2],
'max_iter': [100, 200, 300],
'max_leaf_nodes': [21, 31, 41]
}
grid_search_gb = GridSearchCV(
classifiers_degs['Gradient Boosting'],
param_space_gb,
n_jobs=-1,
scoring='f1_weighted',
verbose=1,
)
grid_search_gb.fit(X_train, y_train)
print(grid_search_gb.best_params_)
print(grid_search_gb.best_score_)
Random Forest:
param_space_rf = {
'n_estimators': [100, 200, 300],
'max_features': ['sqrt', 'log2', None],
'max_depth': [None, 10, 20, 30]
}
grid_search_rf = GridSearchCV(
classifiers_degs['Random Forest'],
param_space_rf,
n_jobs=-1,
scoring='f1_weighted',
verbose=1,
)
grid_search_rf.fit(X_train, y_train)
print(grid_search_rf.best_params_)
print(grid_search_rf.best_score_)
SVM:
param_space_svm = {
'svc__C': [0.1, 1, 10],
'svc__gamma': ['scale', 'auto'],
'svc__kernel': ['linear', 'poly',
'rbf', 'sigmoid']}
grid_search_svm = GridSearchCV(
classifiers_degs['SVM'],
param_space_svm,
n_jobs=-1,
scoring='f1_weighted',
verbose=1,
)
grid_search_svm.fit(X_train, y_train)
print(grid_search_svm.best_params_)
print(grid_search_svm.best_score_)
classifiers = {
'rf1': RandomForestClassifier(
**{'max_depth': None, 'max_features': None,
'n_estimators': 100}),
'rf2': RandomForestClassifier(
**{'max_depth': None, 'max_features': None,
'n_estimators': 300}),
'gb': HistGradientBoostingClassifier(
**{'learning_rate': 0.1, 'max_iter': 200,
'max_leaf_nodes': 31}),
'svm': make_pipeline(StandardScaler(), SVC(
**{'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}))
}
rkf = RepeatedKFold(n_splits=5, n_repeats=5,
random_state=42)
scores = {}
for name, clf in (pbar := tqdm(classifiers.items())):
pbar.set_description(name)
scores[name] = cross_val_score(
clf, X_train, y_train, cv=rkf,
n_jobs=-1, verbose=2,
scoring='f1_weighted')
Choosing Random Forest with 100 estimators, as its performance is not significantly worse
than RF with 300 estimators or Gradient Boosting, but it is less computationally expensive
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| Basal | 0.92 | 0.92 | 0.92 | 25 |
| Her2 | 0.67 | 0.75 | 0.71 | 8 |
| LumA | 0.84 | 0.93 | 0.88 | 68 |
| LumB | 0.90 | 0.84 | 0.87 | 51 |
| Normal | 0.50 | 0.14 | 0.22 | 7 |
| | ||||
| accuracy | | | 0.86 | 159 |
| macro avg | 0.76 | 0.72 | 0.72 | 159 |
| weighted avg | 0.85 | 0.86 | 0.85 | 159 |
* The impact of correlation was reduced by clustering features using Spearman correlation and silhouette scores,
retaining only a core feature from each cluster.
TPX2, CENPA, ESR1, KRT14, SGOL1, KRT5 (highlighted in bold)
were ranked among the top 10 features
in at least 2 feature importance evaluation techniques
Luminal vs other:
Luminal A vs B:
KRT5, ESR1, TPX2, KRT14, MLPH, SGOL1, NEIL3 (highlighted in bold)
were ranked among the top 10 features
in at least 2 feature importance evaluation techniques
Only ESR1 (highlighted in bold)
was ranked among the top 10 features
in at least 2 feature importance evaluation techniques
ESR1 and HER2 expression in various breast carcinoma subtypes defined by PAM50 assay (Tishchenko et al., 2016).
MLPH, FOXA1, ESR1, TTC6, HJURP (highlighted in bold)
were ranked among the top 10 features
in at least 2 feature importance evaluation techniques
MLPH and FOXA1 expression in various breast carcinoma subtypes defined by PAM50 assay (Tishchenko et al., 2016).
ESR1, TPX2, KRT14, KIF20A (highlighted in bold)
were ranked among the top 10 features
in at least 2 feature importance evaluation techniques
Genes already being used in clinic:
Genes being investigated as potential biomarkers:
SFRP1 (Wu et al., 2020, Baharudin et al., 2020),
EGFR (Kjær et al., 2020),
LINC00504 (Hou et al., 2021),
STARD3 (Lodi et al., 2023, Korucu and Inandiklioglu,
2024).