How to Squeeze Machine Learning for EEG Data Analysis
First of all, thanks a lot for all the comments on my last blog post! I would like to dive a bit deeper in the topic, so as I announced I will try to illustrate with some examples how to tackle the intrinsic variability of EEG data with the machinery offered by Machine Learning (also known as Pattern Recognition, Data Mining, Computational Intelligence, among others).
Pattern Recognition stages and EEG variability
Machine learning systems typically have different stages, that can therefore also be found when analyzing EEG data with these different techniques. First we have to acquire the data. Good knowledge on the sensors is a plus, although this stage is often neglected. In case of the EEG data acquisition you also have to be very careful during the experimental campaign, as described by my colleague Alejandro Riera in this blog post. Further stages of the system are signal pre-processing, feature extraction, classification, and performance evaluation.
Variability can be tackled at different stages. For instance, inter-subject variability is handled at the data acquisition stage. Here we need to ensure that the cognitive condition or task is the same for all subjects, so that no discriminative features appear by chance in the signal. Data pre-processing includes artifact detection, correction, filtering in the band of interest, and noise reduction, among others. In this stage, we target intra-subject temporal variability. So for instance one usual method is to divide the signals in smaller time intervals denoted as epochs or trials. To work with different time intervals also allows us to estimate such temporal variability and/or to discard the intervals most contaminated by artifacts.
Features and Classifiers for EEG
Feature extraction can decrease variability in all three domains. You will have to find the right features after visualizing the raw and pre-processed data in order to find out which variability sources you would like to get rid of in the computed features. For instance, if you observe variability in alpha band power linked to variability in the beta band power, you could decide to work with the ratio of them. Other feature extraction approaches in EEG data analysis use trainable spatial filters, particularly to tackle the variability in this domain, i.e. the variability of signals when jumping from one electrode to another. Although such approaches use classification procedures, even supervised ones, I think they have to be considered as feature extraction approaches. This kind of feature extraction stage is very specific of EEG data analysis. Recently it has been even extended to tackle variability in all three temporal, spectral, and spatial domains. I am not aware of the use of such trainable feature extraction stages in any other application field of Machine Learning.
And we come to classification- the king of machine learning, albeit overrated in my opinion, given the fact that you will never achieve the right performance without good features. I think overrating the importance of classifiers in a machine learning system is due to the easiness of publishing on the topic. Still, there are differences among classifiers, so you should select a robust one, and this is something you can only decide based on experimental work (see paragraph below). One way of approaching the intrinsic variability of EEG data is to use classifier ensembles at this stage. This way you can decompose your classification problem in several ones, where each classifier in the ensemble takes care of a more invariable data subset. This has been very well elucidated in a recent approach for p300 detection in the traditional speller application.
Don't forget to measure performance
Performance evaluation is the very last stage of a Machine Learning application, and one of the most important ones, given the lack of a priori superiority of classifiers or feature approaches. As stated by the funny “No Free Lunch” and the “Ugly Duckling” theorems you can not establish the general superiority of one classifier or feature approach over another without experimental work. These theorems set down pattern recognition as an experimental science by linking the performance of an approach with the particular datasets being analyzed. This means you have to test your system. Also, it is essential to find a good performance measure for the data sets on hand as well as for the generalization of future data of the same problem, which has not been seen by the system before. It is important to improve performance but it is just as important to get the actual picture, and by this I mean the picture of the performance in a real-world application. Here you have to be as honest as possible with yourself, trying not to twist the results and following a well-designed test procedure. Otherwise you could end up finding out that dead salmons are capable of distinguishing emotional states in human faces. This is a real challenge given the great number of parameters a machine learning system presents.
photo credit: Sharon Drummond via photopin cc