Machine Learning for Natural Language Generation

My main interest and Phd topic is Summarisation of time-series data, which refers to the task of automatically generating text from variables whose values change over time. In particular, I’m interested in data-driven methods of content determination, the task of choosing what to say, i.e. what information is to be included in a report. Content selection decisions based on trends in time-series data determine the selection of the useful and important variables that should be conveyed in a summary. The decisions of variable selection can be influenced by other variables that their values are correlated with; can be based on the appearance or absence of other variables in the summary; and can be based on the variables’ behaviour over time. Moreover, some variables may have to be discussed together in order to achieve some communicative goal, such as providing explanations or motivating users.

I have developed two approaches to time-series data summarisation:

  • Multi-label Classification which is successful when the generation goal is to replicate the gold standard.
  • Content Selection optimisation which is useful when the goal is to increase user satisfaction.

Dimitra Gkatzia, Helen Hastie and Oliver Lemon. Comparing Multi-label classification with Reinforcement Learning for Summarisation of Time-series data. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL). Baltimore, USA, 2014 (2014 acceptance rate: 26.2%). [pdf]

User-adaptive Systems: Addressing known and unknown users’ preferences

I’m investigating ways of adapting a system to different users simultaneously. In contrast to traditional adaptive systems, that tailor their output to several user by employing User Model Techniques or by developing different systems for different user groups (e.g. doctors and patients or relatives), I’m looking into ways of finding the middle ground between the preferences of different user groups. Further investigation on when this type of systems are feasible and useful will allow for less developmental effort and time. Automatic feedback generation is one concrete example that such a system can be useful, as teachers and students need to have access to the same feedback summary, in order to be able to discuss it further. On the other hand, such systems might not be useful in the medical domain, as clinicians need access to detailed physiological and medical data, which might not be understandable by patients.

Dimitra Gkatzia, Helen Hastie and Oliver Lemon. Multi-adaptive Natural Language Generation using Principal Component Regression. In Proceedings of the 8th International Natural Language Generation Conference (INLG). Philadelphia, USA, 2014. [pdf]

Health Informatics

Dimitra Gkatzia, Verena Rieser, Alexander McSporran, Alistair McGowan, Alasdair Mort and Michaela Dewar. Generating Verbal Descriptions from Medical Sensor Data: A Corpus Study on User Preferences. BCS Health Informatics Scotland. Glasgow, UK, 2014. [pdf]

Assisting University Students through Automatic Feedback Generation

Dimitra Gkatzia, Helen Hastie, Srinivasan Janarthanam and Oliver Lemon. Generating student feedback from time-series data using Reinforcement Learning. In Proceedings of the 14th European Workshop on Natural Language Generation (ENLG), Sofia, Bulgaria, 2013. [pdf] [ppt]

Computer Human Interaction: Location based apps

I’m interested in predicting the suitability of an area for running given the location features using machine learning approaches. My research visit at Aalto University in November 2015 concerned this task.