Teaching activities are held in form of graduate courses as well as annual courses in collaboration with some European experts. Our teaching is compliant with the guidelines of Bologna convention and we extensively use distance-learning methods. We have one introductory course – Bioinformatics, while the rest of the courses are a part of Computational biology module for studding advanced bioinformatics tools and methods.
Communication with students and distribution of course materials are accomplished through our bioinformatics eduactional portal, BEPo. The portal is based on the “Moodle” e-learning system and has been in use for over a decade.
1. Introduction, biological databases, network resources and using specific databases.
2. Basic sequence alignment tools, substitution matrices and dynamic programing.
3. Search by sequence similarity, Heuristic methods
4. Tools and methods for multiple sequence alignment
5. Basic phylogenetics
6. Structure databases and modeling, visualization of biological macromolecules
7. Basics of functional genomics, DNA Sequencing Genomic bioinformatics
Students work in groups and with assistance from professor and assistants perform research on specific topics.
While going through their projects, students search biological databases to investigate epidemiology, genetics and molecular mechanisms of the disease. After literature search, students search sequence databases and learn to manipulate with sequence information using different bioinformatics tools to acquire further information. Students complete their projects by preforming phylogenetic analysis and calcification and visually presenting their results.
• A.M. Campbell, L.J. Heyer (2002) Discovering Genomics, Proteomics and Bioinformatics. J.H.Wiley & Sons
• N. C. Jones, P. A. Pevzner (2004) An Introduction to Bioinformatics Algorithms. MIT Press
• D.W. Mount (2004) Bioinformatics: Sequence and Genome Analysis 2ed. CSHL Press
• S. B. Primrose, R.M. Twyman (2003) Principles of Genome Analysis and Genomics 3ed. Blackwell Publishing
• R. Durbin, S. Eddy, A. Krogh, G. Mitchinson (1998) Biological Sequence Analysis. Cambridge University Press.
• P. Baldi, S. Brunak (2002) Bioinformatics: A Machine Learning Approach. MIT Press
Algorithms and programming
1. short history of R programming language and reasons behind the rise of its usage in biology and data science;
2. overview of basic integrated development environment (IDE) packed with R programming language;
3. getting acquainted with RStudio IDE;
4. writing R scripts;
5. good programming practices (code formatting and writing code comments);
6. variable types in R;
7. writing basic functions in R; return value of the function;
8. control structures in R (if, if-else, for, while);
9. vectorized operations in R and their advantages over non-vectorized operations;
10. data visualization by using the functions in R base package;
11. using basic Linux command line tools (cd, ls, du, grep, chmod, find);
12. regular expressions; using regular expressions to search through and manipulate text data;
13. using external R packages;
14. packages microbenchmark and dplyr;
15. debugging in RStudio IDE
In practical part of the course students are expected to apply the knowledge gained in classes to solve six groups of exercises. Each student’s solution is graded and used in forming part of the final grade for the class. Each student gets individual feedback with comments for each solved exercise. All groups of exercises are solved in RStudio IDE by using the R programming language except for one (21.-25. hour) which is solved by using the Linux operating system’s console.
1.-5. writing simple scripts; creating sequences; creating simple functions; creating simple line plots; vector variables; recursive functions; writing readable program code
6.-10. using functions for generating random numbers from variety of standard distributions, calculating probability density and cumulative distribution; creating histograms; matrix data type; data frame; list; replacement functions;
11.-15. loading data from csv formatted text file; making changes into and searching through text data by using regular expressions; loading data from unformatted text files; simulating stochastic events;
16.-20. infix functions; advanced manipulations of data frames; using the dplyr package for manipulating the data regarding intron position in the genome; expression data analysis;
21.-25. connecting to Linux console of a remote computer; working with Linux command line; data manipulation by using the Linux command line;
26.-30. Programming complex functions; using the RStudio IDE for debugging purposes; breaking complex problem into smaller parts (functions); execution time comparison by using the microbenchmark package;
• N. Matloff (2011) The Art of R Programming 1st ed. No Starch Press
• R. Kabacoff (2011) R in Action 1st ed. Manning Publications
Machine learning and statistics
1. Populations and samples. Measurement scales. Categorical and numerical variables.
2. Descriptive statistics. Numerical analysis of distributions – measures of location and spread. Graphical analysis of distributions – pie-charts, bar plots, histograms, box and whisker plots.
3. Confidence intervals, hypothesis testing, p-values. Student’s t-test.
4. Nonparametric statistical tests. Contingency table analysis.
5. Simple and multiple linear regression.
6. Analysis of variance.
7. Linear models for classification.
8. Regularization methods (ridge regression, LASSO, elastic net)
9. Model assessment and selection – resampling methods (bootstrap, cross-validation), feature selection methods.
10. Tree-based methods (regression and classification trees, bagging, boosting, Random Forests)
11. Support vector machines
12. Hidden Markov models
13. Principal components analysis (PCA)
14. Clustering methods (hierarchical clustering, k-means clustering)
1. Analysing distributions in the R statistical environment. Calculating the measures of location and spread implemented in the base package.
2. Analysing distributions in the R statistical environment. Functions for graphical analysis of samples.
3. Calculating confidence intervals. Hypothesis testing and errors in hypothesis testing (Type I and Type II errors). Power of a statistical test.
4. Comparison of different types of Student’s t-test (based on the dependence of the observations in different samples, the size of the samples and the equality of variance of the samples) in the R statistical environment.
5. Implementation of Fisher’s test, chi-squared test and hypergeometric test in the R statistical environment.
6. Analysis of linear regression models. Interactions. Qualitative predictors. Transformation of non-linear regression models. Analysis of residuals.
7. Implementation of analysis of variance in the R statistical environment.
8. Comparison of various linear models for classification (logistic regression, linear discriminant analysis, quadratic discriminant analysis) in the R statistical environment.
9. Comparison of ridge regression, LASSO and elastic net in the R statistical environment (glmnet package). Choosing the optimal value of λ.
10. Implementation of resampling methods in the R statistical environment (k-fold cross-validation, leave-one-out cross-validation).
11. Tree-based methods in the R statistical environment (tree, gbm and randomForest packages).
12. Support vector machines in the R statistical environment (e1071 package). Using cross-validation to select the optimal model parameters.
13. Using Hidden Markov Models to analyse biological sequences in the R statistical environment.
14. Principal components analysis in the R statistical environment (prcomp() function).
15. Clustering methods in the R statistical environment (hclust() and kmeans() functions). Choosing the optimal number of clusters.
• G. James, D. Witten, T. Hastie and R. Tibshirani , “An Introduction to Statistical Learning, with applications in R” (Springer, 2013)
• D. Montgomery and G.Runger.Applied statistics and probability for engineers (John Wiley & Sons, Inc., 2003)
1. – 2. Introduction to computational genomics. The history of genome sequencing. Sequencing of the human genome. Definition of a gene.
3. – 4. Next generation sequencing methods. Roche 454/pyrosequencing, Illumina GAII, Illumina Hiseq, ABI Solid, Helicos, Pacific bioscience. Advantages and disadvantages of various sequencing platforms.
5. De novo genome assembly methods. Greedy algorithms, overlap layout consensus methods, de Brujin graphs.
6. Genome annotation. Identification of genes, non-coding regions and regulatory patterns. Functional genome annotation.
7. Pre-processing of short reads obtained using next-generation sequencing. Quality control, adapter removal, filtering.
8. Mapping short reads to previously assembled genomes.
9. – 10. Methods for transcriptome analysis based on next-generation sequencing (RNA-Seq).
11. – 12. Methods for analyzing interactions of proteins and DNA (chromatin immunoprecipitation combined with next-generation sequencing, ChIP -Seq).
13. – 14. Determining the three-dimensional structure of the (3C, 4C, 5C and Hi-C methods).
15. Review of the material covered in the course and an overview of selected recent scientific publications based on next generation sequencing methods.
1. Introduction to genome browsers (UCSC Genome Browser).
2. Introduction to Bioconductor, a repository of R packages used for analysing biological data.
3. Manipulating and analysing strings with the Biostrings package.
4. Pattern matching.
5. – 6. Interval operations. IRanges and GenomicRanges packages.
7. – 8. Retrieving annotations from publicly available databases. GenomicFeatures package.
9. – 10. Reading and analysing short reads obtained using next generation sequencing methods. ShortRead and RSamtools packages.
11. Filtering and quality control of next generation sequencing data.
12. Normalization of next generation sequencing data.
13. Analysis of differentially expressed genes. DESeq, EdgeR and DEXSeq packages.
14. Gene set enrichment analysis and Gene Ontology enrichment analysis.
15. Methods for analysing ChIP-Seq data. Determining the genomic location of protein binding sites.
• R. C. Deonier, S. Tavare, M. S. Waterman, Computational Genome Analysis: An Introduction. Springer 2005.
• R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer 2006.
• N. Cristianini, M. Hahn, Introduction to Computational Genomics: A Case Studies Approach. Cambridge University Press, 2007.