Lecture 6 - Cross-Validation#

Learning goals#

  • Use KFold and cross_val_score for fair evaluation.

  • Tune hyperparameters with GridSearchCV.

  • Explain predictions using feature importance and permutation importance.

    Colab


1. Dataset and setup#

Why this dataset?

We will reuse the same table from Lecture 5. It includes SMILES plus targets such as Melting Point and Toxicity. We compute four lightweight descriptors that we already trust: MolWt, LogP, TPSA, NumRings. They are fast to compute, easy to interpret, and good enough to demonstrate model selection and cross‑validation.

1.1 Imports and environment#

Hide code cell source

# Commented out IPython magic to ensure Python compatibility.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
try:
    from rdkit import Chem
    from rdkit.Chem import Draw, Descriptors, Crippen, rdMolDescriptors, AllChem
except Exception:
    try:
#         %pip install rdkit
        from rdkit import Chem
        from rdkit.Chem import Draw, Descriptors, Crippen, rdMolDescriptors, AllChem
    except Exception as e:
        print("RDKit is not available in this environment. Drawing and descriptors will be skipped.")
        Chem = None

from rdkit import Chem
from rdkit.Chem import Descriptors, Crippen, rdMolDescriptors


from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score
from sklearn.model_selection import GridSearchCV, validation_curve, learning_curve
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LogisticRegression, lasso_path
from sklearn.svm import SVR, SVC
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.neural_network import MLPRegressor, MLPClassifier
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.inspection import permutation_importance, PartialDependenceDisplay

1.2 Load data and build descriptors#

Data source

We will use this C-H oxidation dataset we saw before:
https://raw.githubusercontent.com/zzhenglab/ai4chem/main/book/_data/C_H_oxidation_dataset.csv

url = "https://raw.githubusercontent.com/zzhenglab/ai4chem/main/book/_data/C_H_oxidation_dataset.csv"
df_raw = pd.read_csv(url)

def calc_descriptors_row(smiles):
  m = Chem.MolFromSmiles(smiles)
  if m is None:
    return pd.Series({"MolWt": np.nan, "LogP": np.nan, "TPSA": np.nan, "NumRings": np.nan})
  return pd.Series({
            "MolWt": Descriptors.MolWt(m),
            "LogP": Crippen.MolLogP(m),
            "TPSA": rdMolDescriptors.CalcTPSA(m),
            "NumRings": rdMolDescriptors.CalcNumRings(m),
        })


desc_df = df_raw["SMILES"].apply(calc_descriptors_row)
df_reg_mp = pd.concat([df_raw, desc_df], axis=1)
df_reg_mp.head()
Compound Name CAS SMILES Solubility_mol_per_L pKa Toxicity Melting Point Reactivity Oxidation Site MolWt LogP TPSA NumRings
0 3,4-dihydro-1H-isochromene 493-05-0 c1ccc2c(c1)CCOC2 0.103906 5.80 non_toxic 65.8 1 8,10 134.178 1.7593 9.23 2.0
1 9H-fluorene 86-73-7 c1ccc2c(c1)Cc1ccccc1-2 0.010460 5.82 toxic 90.0 1 7 166.223 3.2578 0.00 3.0
2 1,2,3,4-tetrahydronaphthalene 119-64-2 c1ccc2c(c1)CCCC2 0.020589 5.74 toxic 69.4 1 7,10 132.206 2.5654 0.00 2.0
3 ethylbenzene 100-41-4 CCc1ccccc1 0.048107 5.87 non_toxic 65.0 1 1,2 106.168 2.2490 0.00 1.0
4 cyclohexene 110-83-8 C1=CCCCC1 0.060688 5.66 non_toxic 96.4 1 3,6 82.146 2.1166 0.00 1.0

What these features mean

  • MolWt: molecular weight (sum of atomic masses).

  • LogP: hydrophobicity descriptor (octanol/water partition).

  • TPSA: topological polar surface area.

  • NumRings: ring count.


2. Quick Exploratory Data Analysis (EDA)#

2.1 Descriptor histograms#

Goal

Check scale, spread, and any extreme values. If a feature is highly skewed, consider transforms or standardization before training.

features = ["MolWt", "LogP", "TPSA", "NumRings"]
ax = df_reg_mp[features].hist(bins=20, figsize=(5,4))
plt.tight_layout()
plt.show()
_images/182d8368dce923628285fec322bdccc65c100640960f10ab6270a37ccae0f9b4.png

2.2 Pairwise relationships#

pd.plotting.scatter_matrix(df_reg_mp[features], figsize=(5,5))
plt.suptitle("Descriptor scatter matrix", y=1.02)
plt.show()
_images/365c7e04c443660fc8973b08118f2eddf704524f1f8e6bb443eac39ff43257a7.png

As you see above, a pairwise plot (also called a scatterplot matrix) shows relationships between every possible pair of features. For example, if you have four descriptors — MolWt, LogP, TPSA, and NumRings — the matrix will draw 12 scatterplots pairs (diagonal being their own distribution).

Why we look at pairwise plots?

Spot correlations: If two descriptors are strongly correlated (for example, heavier molecules often have higher logP), you can see it visually. Highly correlated pairs may be redundant.

Detect outliers: Points that are far away from the general cloud of data stand out.

2.3 Redundancy demo and correlation#

To illustrate what redundancy looks like, in the following example, we added a synthetic feature called MolWt_half, created simply by multiplying molecular weight by 0.5. In the scatter matrix, MolWt_half and MolW form a perfect straight-line relationship, highlighting redundancy.

In real-world datasets, you might encounter similar situations where two measured properties track one another closely because they arise from the same underlying physical or chemical trend. In such cases, redundancy reduces the effective dimensionality of the dataset.

This visual impression can be confirmed by calculating the correlation coefficient numerically. A correlation near 1.0 (or –1.0) indicates strong redundancy, just like we see between MolWt and MolWt_half.

# Add demo redundant feature
df_reg_mp["MolWt_half"] = df_reg_mp["MolWt"] * 0.5

# Choose features to visualize
features = ["MolWt", "LogP", "TPSA", "NumRings", "MolWt_half"]

# Numerical correlation matrix
print("\nCorrelation Matrix (rounded to 3 decimals):")
corr_matrix = df_reg_mp[features].corr().round(3)
print(corr_matrix)

# Scatter matrix
pd.plotting.scatter_matrix(df_reg_mp[features], figsize=(6,6), diagonal="hist")
plt.suptitle("Descriptor Scatter Matrix", y=1.02)
plt.tight_layout()
plt.show()
Correlation Matrix (rounded to 3 decimals):
            MolWt   LogP   TPSA  NumRings  MolWt_half
MolWt       1.000  0.595  0.447     0.759       1.000
LogP        0.595  1.000 -0.268     0.605       0.595
TPSA        0.447 -0.268  1.000     0.294       0.447
NumRings    0.759  0.605  0.294     1.000       0.759
MolWt_half  1.000  0.595  0.447     0.759       1.000
_images/a87c79b6ca2b03d61872518e69d4b55dc937a05e55234efa23856261b2901b51.png

3. Introduction to hyperparameters#

Lasso refresher

Lasso solves a linear regression with an L1 penalty. The loss (on a dataset of size n) is
[ \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 + \alpha \sum_j |w_j| ]
When (\alpha) is larger, more coefficients go to zero.

3.1 Single split evaluation#

When we fit a model, we often think about weights (w) that are learned from the data. But some knobs are not learned automatically — they are chosen by us. These are called hyperparameters. One of the most important hyperparameters in Lasso is alpha. As you may recall from last lecture, it controls the strength of the L1 penalty.

# Define X (features) and y (target)
X = df_reg_mp[["MolWt", "LogP", "TPSA", "NumRings"]]
y = df_reg_mp["Melting Point"]

# Split into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1
)

# Define models in a dictionary
models = {
    "Linear Regression": LinearRegression(),
    "Lasso (alpha=0.1)": Lasso(alpha=0.1, max_iter=500),
    "Lasso (alpha=1)":   Lasso(alpha=1,   max_iter=500),
    "Lasso (alpha=10)":  Lasso(alpha=10,  max_iter=500),
    "Lasso (alpha=100)": Lasso(alpha=100, max_iter=500),
    "Lasso (alpha=500)": Lasso(alpha=500, max_iter=500)
}

# Collect results
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    mse_val = mean_squared_error(y_test, y_pred)
    mae_val = mean_absolute_error(y_test, y_pred)
    r2_val  = r2_score(y_test, y_pred)

    results.append([name, mse_val, mae_val, r2_val])

# Put results in a nice DataFrame
df_results = pd.DataFrame(results, columns=["Model", "MSE", "MAE", "R2"])
df_results
Model MSE MAE R2
0 Linear Regression 398.206541 16.006228 0.816101
1 Lasso (alpha=0.1) 396.546775 15.975604 0.816868
2 Lasso (alpha=1) 389.603474 15.816549 0.820074
3 Lasso (alpha=10) 390.347484 15.855184 0.819731
4 Lasso (alpha=100) 399.273984 16.080242 0.815608
5 Lasso (alpha=500) 423.683879 16.757040 0.804335

3.2 Effect of alpha (plot)#

# Extract results for plotting
lasso_results = df_results[df_results["Model"].str.contains("Lasso")]
alphas = [0.1, 1, 10, 100, 500]
r2_scores = lasso_results["R2"].values

# Linear Regression baseline
linear_r2 = df_results.loc[df_results["Model"] == "Linear Regression", "R2"].values[0]

# Plot R² vs alpha
plt.figure(figsize=(7,5))
plt.semilogx(alphas, r2_scores, marker="o", label="Lasso R²")
plt.axhline(linear_r2, color="k", linestyle="--", label=f"Linear Regression R² = {linear_r2:.3f}")

plt.xlabel("Alpha (log scale)")
plt.ylabel("R² Score")
plt.title("Effect of Alpha on Lasso Performance")
plt.legend()
plt.grid(True, which="both", linestyle="--", alpha=0.6)
plt.show()
_images/5a196ebf7c869c4daf753f2c1d7b230e6190d744d609503c363d0c7b766ca35c.png

4. Cross validation#

So far we evaluated models with a single 80/20 split, which can give performance estimates that depend strongly on the random split. To obtain a more reliable measure, we can use 4-fold cross validation*. This way, the dataset is split into 4 equal parts, and each part is used once as the test set while the other 3 parts are used for training. The metrics are then averaged across the 4 folds.

Why CV?

A single split can overestimate or underestimate performance. K-fold CV repeats the split several times and averages the metric, giving a more stable estimate.

alphas = [0.1, 1, 10, 100]
kf = KFold(n_splits=4, shuffle=True, random_state=1)

for a in alphas:
    model = Lasso(alpha=a, max_iter=500)
    scores = cross_val_score(model, X, y, cv=kf, scoring="r2")
    print(f"alpha={a}, Scores on each fold: {scores} \n mean R²={scores.mean():.5f} \n")
alpha=0.1, Scores on each fold: [0.80086458 0.8800675  0.85024608 0.81400961] 
 mean R²=0.83630 

alpha=1, Scores on each fold: [0.8012391  0.88025613 0.85027326 0.81361778] 
 mean R²=0.83635 

alpha=10, Scores on each fold: [0.80123962 0.88012751 0.85111665 0.81276335] 
 mean R²=0.83631 

alpha=100, Scores on each fold: [0.79943866 0.87767314 0.85590684 0.80952064] 
 mean R²=0.83563 

Exercise 4.1

Change the alphas to [0.05, 0.5, 1, 2, 5, 15, 30, 100].
Use 5-fold CV and the neg_mean_squared_error scorer.
Print the per-fold values and the mean MSE.


5. Model selection, evaluation and prediction#

Plan

Pick an alpha using CV, fit on train, evaluate on held‑out test, then apply to new molecules.

from sklearn.model_selection import KFold, cross_validate

# Define X and y again
X = df_reg_mp[["MolWt", "LogP", "TPSA", "NumRings"]]
y = df_reg_mp["Melting Point"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Use 4-fold cross validation
kf = KFold(n_splits=4, shuffle=True, random_state=1)

# Same models as before
models = {
    "Linear Regression": LinearRegression(),
    "Lasso (alpha=0.1)": Lasso(alpha=0.1, max_iter=500),
    "Lasso (alpha=1)":   Lasso(alpha=1,   max_iter=500),
    "Lasso (alpha=10)":  Lasso(alpha=10,  max_iter=500),
    "Lasso (alpha=30)": Lasso(alpha=30, max_iter=500),
    "Lasso (alpha=100)": Lasso(alpha=100, max_iter=500)
}

results_cv = []

for name, model in models.items():
    cv_results = cross_validate(
        model, X_train, y_train,
        cv=kf,
        scoring=["r2", "neg_mean_squared_error", "neg_mean_absolute_error"],
        return_train_score=False
    )

    # Compute mean and std across folds
    mean_r2 = cv_results["test_r2"].mean()
    mean_mse = -cv_results["test_neg_mean_squared_error"].mean()
    mean_mae = -cv_results["test_neg_mean_absolute_error"].mean()

    results_cv.append([name, mean_mse, mean_mae, mean_r2])

# Put results in a DataFrame
df_results_cv = pd.DataFrame(results_cv, columns=["Model", "MSE", "MAE", "R2"])
print(df_results_cv)
               Model         MSE        MAE        R2
0  Linear Regression  410.132745  15.969645  0.832197
1  Lasso (alpha=0.1)  409.638038  15.958751  0.832326
2    Lasso (alpha=1)  408.902710  15.918428  0.832658
3   Lasso (alpha=10)  408.189050  15.900646  0.833132
4   Lasso (alpha=30)  408.099029  15.906188  0.833208
5  Lasso (alpha=100)  408.969334  15.961744  0.832932

Based on above table, you know alpha = 30 works best for this case. Let’s assign this value to build a model, fit with data and evaluate the outcome, which can be completed within three lines.

alpha_optimized_model = Lasso(alpha=30, max_iter=500)
fitted_lasso = alpha_optimized_model.fit(X_train, y_train)

print(f"Test R² = {r2_score(y_test, fitted_lasso.predict(X_test)):.3f} when we use alpha = 30")
Test R² = 0.818 when we use alpha = 30

R² = 0.820 tells us how good the fitting is. Note this R² is calculated based on the final hold-out test set, so it will be different from the R² one we see during the CV (0.833). The fact that they are very similar to each other is a good sign because this model has never seen any datapoint from X_test before but can still do a good prediction.

Now we can use this function to give prediction for points that are neither in train and test data set. For example, we can predict two molecules’s melting point by listing their descriptors.

Predict new points

We can predict from raw descriptors, or compute descriptors from SMILES first (see example below).

fitted_lasso.predict(pd.DataFrame([[100, 1.5, 10, 2],
                                   [150, 2.0, 20, 3]],
                                  columns=["MolWt", "LogP", "TPSA", "NumRings"]))
array([70.44917852, 95.30123863])
smiles_list = ["C1CCCC(C)C1", "CCCO"]

# Explicitly call your function for each SMILES
desc_rows = [calc_descriptors_row(s) for s in smiles_list]

# Build DataFrame
new_desc = pd.DataFrame(desc_rows)
new_desc
MolWt LogP TPSA NumRings
0 98.189 2.5866 0.00 1.0
1 60.096 0.3887 20.23 0.0
# Predict
two_new_preds = fitted_lasso.predict(new_desc)

# Show results
results = pd.DataFrame({"SMILES": smiles_list, "Prediction": two_new_preds})
print(results)
        SMILES  Prediction
0  C1CCCC(C)C1   69.345037
1         CCCO   51.000708

If your combine all above together, the workflow will look like:

  1. Define features and target (MolWt, LogP, TPSA, NumRingsMelting Point).

  2. Split data: 80% train, 20% test.

  3. Model selection

  • Last lecture: fit on train directly if no hyperparameter tuning needed.

  • Lasso Regression (this lecture): use 4-fold cross-validation on train to pick best alpha, then refit on full train.

  1. Evaluate chosen model on test set for generalization.

  2. Apply final model to new molecules for prediction.



7. Explaining the model with feature importance#

Two views

  • Coefficients: direction and weight (Lasso shrinks some to zero).

  • Permutation importance: how much test R² drops when a feature is shuffled.

# --- Coefficient-based importance (Lasso) ---
coef_importance = pd.Series(
    final_lasso.coef_,
    index=final_lasso.feature_names_in_
).sort_values(key=abs, ascending=False)

print("Lasso coefficients (feature importance):")
print(coef_importance)

# Plot coefficients
coef_importance.plot(kind="barh", figsize=(6,4))
plt.title("Feature Importance from Lasso Coefficients")
plt.xlabel("Coefficient Value")
plt.gca().invert_yaxis()
plt.show()

# --- Permutation importance (model agnostic) ---
perm_importance = permutation_importance(
    final_lasso, X_test, y_test,
    scoring="r2", n_repeats=20, random_state=1
)

perm_series = pd.Series(
    perm_importance.importances_mean,
    index=final_lasso.feature_names_in_
).sort_values(ascending=True)

print("\nPermutation importance (mean decrease in R²):")
print(perm_series)

# Plot permutation importance
perm_series.plot(kind="barh", figsize=(6,4))
plt.title("Permutation Importance (on Test Set)")
plt.xlabel("Mean Decrease in R²")
plt.show()
Lasso coefficients (feature importance):
MolWt       0.492808
TPSA        0.021167
LogP       -0.000000
NumRings   -0.000000
dtype: float64
_images/014a98b5151c634dab01b31e406166d7a7636434e440a7ceee670b7c00b82c91.png
Permutation importance (mean decrease in R²):
LogP        0.000000
NumRings    0.000000
TPSA        0.002218
MolWt       1.607811
dtype: float64
_images/4e5b94ff60e08160dab6b4516d8c82c2e9d9018b472b2e6bc4afa1afd88cb437.png

8. Glossary#

Lasso Regression#

Linear regression with L1 regularization that can set some coefficients to exactly zero (feature selection).

Alpha (λ)#

Penalty strength in Lasso. Larger alpha increases regularization and may reduce the number of active features.

Cross validation (CV)#

Split the data into folds, train on k‑1 folds and validate on the held‑out fold; repeat and average.

R² (coefficient of determination)#

Measures how much variance the model explains. 1 is perfect, 0 matches the mean predictor.

MSE#

Mean of squared errors. Sensitive to large errors.

MAE#

Mean of absolute errors. More robust to outliers than MSE.

Permutation importance#

Test‑time feature shuffle to estimate performance drop for each feature.

Coefficient importance#

Weights in a linear model. In Lasso, many are shrunk to 0.

Feature redundancy#

Two or more features carry nearly the same information (high correlation).

GridSearchCV#

Scikit‑learn tool that evaluates parameter grids with CV and returns the best estimator.


9. Quick reference#

Common scikit‑learn patterns

  • Split: train_test_split(X, y, test_size=0.2, random_state=...)

  • CV: KFold(n_splits=4, shuffle=True, random_state=...)

  • Score: cross_val_score(model, X, y, cv=kf, scoring="r2")

  • Grid search: GridSearchCV(model, {"alpha":[...]}, cv=kf, scoring="r2", refit=True)

  • Metrics: mean_squared_error, mean_absolute_error, r2_score

Diagnostic plots

  • Parity: predicted vs true (should hug the diagonal).

  • Residuals: residual vs prediction (should center near 0 without pattern).

  • CV curve: metric vs hyperparameter; pick the stable peak.


10. In‑class activity#

Each task uses the functions and concepts above. Fill in the ... lines.

Q1. Prepare log‑solubility and visualize#

Prepare log-solubility, split 80/20 (seed=15), and visualize.

  1. Create y_log· = log10(Solubility_mol_per_L + 1e-6),

  2. Set X = [MolWt, LogP, TPSA, NumRings],

  3. Split 80/20 with random_state=15,

  4. Plot a histogram of y_log, then a pairwise scatter matrix of the four descriptors.

Hint: After spliting, keep t

# TO DO
# y_log = ...
# X = ...
# X_train, X_test, y_train, y_test = train_test_split(X, y_log, test_size=0.2, random_state=15)

# Plots
# ...

Q2. Tune Ridge on the training split#

  • Search alphas on np.logspace(-2, 3, 12) using 5‑fold CV on (X_tr, y_tr)

  • Report best alpha and mean CV R², then refit Ridge(alpha=best) on train

# TO DO
# cv = KFold(n_splits=..., shuffle=..., random_state=...)
# alphas = ...
# ...
# ...
# ridge_best = Ridge(alpha=...).fit(..., ...)
# print(f"best alpha={...:.4f}  CV mean R2={max(...):.3f}")

Q3. Evaluate Ridge on test and draw diagnostics#

  • Compute MSE, MAE, on (X_te, y_te)

  • Plot a parity plot and a residual plot

# TO DO
# y_hat = ...
# print(f"Test MSE={...:.4f}  MAE={...:.4f}  R2={...:.3f}")
# Parity and residual plots
# ...

Q4. Predict two new molecules (Ridge)#

Use the trained Ridge to predict log‑solubility for:

  • [135.0, 2.0, 9.2, 2]

  • [301.0, 0.5, 17.7, 2]

# TO DO
# X_new = np.array([[135.0, 2.0, 9.2, 2],
#                   [301.0, 0.5, 17.7, 2]])
# y_new = ridge_best.predict(X_new)
# print(pd.DataFrame({"MolWt": X_new[:,0], "LogP": X_new[:,1], "TPSA": X_new[:,2], "NumRings": X_new[:,3],
#                     "Predicted log10(solubility)": y_new}))

Q5. Interpret Ridge with coefficients and permutation importance#

  • Show coefficient magnitudes and signs

  • Compute permutation importance on the test split and plot

# TO DO
# feat = ["MolWt","LogP","TPSA","NumRings"]
# coef_ser = ...
# coef_ser.plot(kind="barh"); ...

# perm = permutation_importance(...)
# perm_ser = ...
# perm_ser.plot(kind="barh"); ...

11. Solutions#

11.1 Solution Q1#

# Log target
df_reg_mp = df_reg_mp.copy()
y_log = np.log10(df_reg_mp["Solubility_mol_per_L"] + 1e-6)

# Features and split
X = df_reg_mp[["MolWt","LogP","TPSA","NumRings"]].values
X_tr, X_te, y_tr, y_te = train_test_split(X, y_log, test_size=0.2, random_state=15)

# Plots
plt.figure(figsize=(5,3))
plt.hist(y_log, bins=30, alpha=0.85)
plt.xlabel("log10(Solubility + 1e-6)"); plt.ylabel("Count"); plt.title("Log-solubility")
plt.show()

pd.plotting.scatter_matrix(df_reg_mp[["MolWt","LogP","TPSA","NumRings"]], figsize=(5.5,5.5))
plt.suptitle("Descriptor scatter matrix", y=1.02); plt.show()
_images/3ade42a869662442893fa34351be2a4933cd4312213f73a2fd05084ef8a6f692.png _images/bc9f8070cc3eebdcc189a3046eba2d64baf6368500c4a3de9424b713e51b5598.png

11.2 Solution Q2#

cv = KFold(n_splits=5, shuffle=True, random_state=1)
alphas = np.logspace(-2, 3, 12)
means = [cross_val_score(Ridge(alpha=a), X_tr, y_tr, cv=cv, scoring="r2").mean() for a in alphas]
best_a = float(alphas[int(np.argmax(means))])
ridge_best = Ridge(alpha=best_a).fit(X_tr, y_tr)
print(f"best alpha={best_a:.4f}  CV mean R2={max(means):.3f}")
best alpha=1.8738  CV mean R2=0.968

11.3 Solution Q3#

y_hat = ridge_best.predict(X_te)
print(f"Test MSE={mean_squared_error(y_te,y_hat):.4f}  "
      f"MAE={mean_absolute_error(y_te,y_hat):.4f}  "
      f"R2={r2_score(y_te,y_hat):.3f}")

plt.figure(figsize=(4.2,4))
plt.scatter(y_te, y_hat, alpha=0.7)
lims = [min(y_te.min(), y_hat.min()), max(y_te.max(), y_hat.max())]
plt.plot(lims, lims, "k--", lw=1)
plt.xlabel("True log-solubility"); plt.ylabel("Predicted"); plt.title("Parity — Ridge"); plt.show()

resid = y_te - y_hat
plt.figure(figsize=(4.2,4))
plt.scatter(y_hat, resid, alpha=0.7); plt.axhline(0, color="k", ls=":")
plt.xlabel("Predicted"); plt.ylabel("Residual"); plt.title("Residuals — Ridge"); plt.show()
Test MSE=0.0160  MAE=0.1024  R2=0.970
_images/c17c780964c3bd2393dfe1c3bc464510c8c619a5ffb1c9abee274b7c00876a84.png _images/09e5082a65c400c82f645abe932b97b9759aa9a460a974f9ea2db1c69274b537.png

11.4 Solution Q4#

X_new = np.array([
    [135.0,  2.0,  9.2, 2],   # Molecule A
    [301.0,  0.5, 17.7, 2]    # Molecule B
])  # descriptors: [MolWt, LogP, TPSA, NumRings]

y_new = ridge_best.predict(X_new)
print(pd.DataFrame({
    "MolWt": X_new[:,0], "LogP": X_new[:,1], "TPSA": X_new[:,2], "NumRings": X_new[:,3],
    "Predicted log10(solubility)": y_new
}))
   MolWt  LogP  TPSA  NumRings  Predicted log10(solubility)
0  135.0   2.0   9.2       2.0                    -1.195502
1  301.0   0.5  17.7       2.0                    -0.501762

11.5 Solution Q5#

feat = ["MolWt","LogP","TPSA","NumRings"]

coef_ser = pd.Series(ridge_best.coef_, index=feat).sort_values(key=np.abs, ascending=False)
print("Ridge coefficients:\n", coef_ser)
coef_ser.plot(kind="barh"); plt.gca().invert_yaxis()
plt.xlabel("Coefficient"); plt.title("Ridge coefficients"); plt.show()

perm = permutation_importance(ridge_best, X_te, y_te, scoring="r2", n_repeats=30, random_state=1)
perm_ser = pd.Series(perm.importances_mean, index=feat).sort_values()
perm_ser.plot(kind="barh"); plt.xlabel("Mean decrease in R²"); plt.title("Permutation importance on test"); plt.show()
Ridge coefficients:
 LogP       -0.437668
NumRings   -0.002761
TPSA       -0.000699
MolWt       0.000260
dtype: float64
_images/bfaaa5a02eb717b2ec739fff482fa179e7d8398fcf42c479ca837996cdac1def.png _images/c544f6385c1968c5d3b7e44c1d1ba339e6b863ab85b4abc1676c107926737a95.png