ECTS:

3

Course Outline
e-Class

The aim of the course is to develop students’ analytical thinking for extracting information from complex datasets and solving problems using data-driven methods and Machine Learning techniques. Upon successful completion of the course, students will be able to identify, select, and apply appropriate methods from multivariate data analysis and Machine Learning to problems requiring supervised and unsupervised learning. Regarding supervised learning, students will be able to apply methods to (i) regression problems using algorithms such as Least Squares Regression and Partial Least Squares Regression, and (ii) classification problems using algorithms such as Logistic Regression, Multinomial Logistic Regression, k-Nearest Neighbors, and Decision Trees. Regarding unsupervised learning, students will be able to apply methods to (i) dimensionality reduction problems and (ii) clustering problems using well-known algorithms from the literature. Students are expected to acquire practical skills in implementing these methodologies and techniques using the open-source programming language R, developing code within the RStudio Integrated Development Environment (IDE). The goal is to extract useful information hidden in raw data and evaluate the fitted models, thereby contributing to the assessment of claims and conclusions in data-driven solutions..

Upon successful completion of the course, students will be able to:

  • Understand the fundamentals of Machine Learning, including supervised and unsupervised learning, basic terminology, and workflow for knowledge discovery in environmental chemistry, biology, and life and health sciences.
  • Perform data preprocessing, including cleaning, transformation, normalization, and handling of missing values, and visualize univariate and multivariate data using R (histograms, density plots, boxplots, scatterplots, parallel coordinates, Chernoff faces, star plots, etc.).
  • Apply supervised learning methods for regression problems, including simple and multiple linear regression, check model assumptions, perform diagnostics, fit models, and interpret results in R.
  • Apply supervised learning methods for classification problems, including Logistic Regression, Multinomial and Ordinal Logistic Regression, Linear Discriminant Analysis, Naïve Bayes classifiers, Decision Trees, k-Nearest Neighbors, and Support Vector Machines, with practical case studies in R.
  • Evaluate predictive performance using appropriate metrics for regression (errors, absolute error, relative error) and classification (accuracy, precision, recall, F-measure), and validate models using cross-validation (hold-out, k-fold, leave-one-out, leave-p-out), bootstrapping, and graphical methods such as regression error characteristic curves and ROC curves.
  • Apply unsupervised learning techniques for clustering, including partitioning methods (k-means, k-medoids, EM) and hierarchical clustering (agglomerative and divisive algorithms) in R.
  • Apply dimensionality reduction techniques such as Principal Component Analysis (PCA) and Correspondence Analysis to multivariate datasets.
  • Combine multiple classifiers using ensemble methods, including ensemble averaging for regression, majority voting for classification, and bootstrap aggregating (bagging), with applications in R.
  • Conduct experimental comparisons of algorithms, extract conclusions, and apply inferential statistics to assess model performance and generalization.
  • Understand the principles of Explainable Artificial Intelligence (XAI), including the “black-box” problem, interpretability, explainability, transparency, and global/local explanation techniques, and apply these methods to real-world datasets in environmental chemistry, biology, and life and health sciences using R.

Professors

Select to view more information about each course.

Name Title email
Nikolaos Mittas Associate Professor nmittas@chem.duth.gr