Overview of Research for the Ultrax2020 project

Currently in Speech and Language Therapy, technological support is sparse. In the previous Ultrax project we developed technology that enhances the ultrasound image of the tongue, making it clearer and easier to interpret. We showed that by using ultrasound to to provide visual feedback of the tongue in real-time, children can learn to produce speech sounds which have previously been impossible for them. 

This follow-on project, Ultrax2020, takes that work further by developing the ultrasound tongue tracker into a tool for diagnosing specific types of speech sound disorders (SSD) and evaluating how easy it is to use ultrasound in NHS clinics.  Specifically, the aim is to develop tools to support both initial diagnosis and for objective monitoring of progress throughout the therapy term. 

As with the preceding Ultrax project, this project comprises two complementary research tracks:

Technology Track

Technological challenges

The key objective of Ultrax2020 is to develop an ultrasound-based diagnostic assessment for identifying and categorising speech errors in children with SSDs. As a tool for clinical practice, this will alleviate the practical problems associated with time-consuming hand analysis, as well as the accuracy problems that inevitably come from relying on live subjective judgments as currently done. To achieve this goal, our research programme will address multiple technical challenges, including most significantly:

1. Automatic identification of speech-segments of interest. Assessment begins with recordings of wordlists containing certain phonetic properties. Since word lists should be randomised to avoid priming effects, it is time consuming for clinicians to identify words and frames of interest from continuous recordings for analysis. Our first challenge is to develop automated frontend processing of ultrasound and audio data for SLT assessment and report generation.

2. Tracking of tongue-shapes for diagnosis. While the image enhancements we implemented in the previous Ultrax project show that ultrasound visual feedback offers an effective intervention, further work is required to meet the challenge of tongue tracking (both offline and realtime) for the purpose of extracting diagnostic data.  A significant part of our research work will comprise ways to streamline and optimise our previously developed tongue tracking method.  

3. Classification of tongue-shapes for differential diagnosis and quantification of progress in therapy.  For this, two types of classification are needed in fact: 

 A) Within-speaker shape classification – to determine for example, whether children with perceptual homophony produce two (or more) phonemes in an identical manner or whether there is a covert contrast present. Likewise, we aim to determine whether tongue shapes have changed post-therapy, so quantifying improvement; 

 B) Cross-speaker classification – we will develop a method not only for classifying tongue shapes as disordered (i.e. different from those occurring in typical children), but also for identifying the nature of the abnormality. 

In addition, work will be undertaken with the view to making classification available for real-time use.

Classification approaches

The project objectives call for analysis of ultrasound tongue patterns in well-defined ways, such as: does a given tongue pattern conform to the intended phone class or to a different one? Can two tongue shapes be classified as belonging to the same phone class? How similar are two tongue patterns, in terms of the standard articulation of known phone classes? Such questions can readily be tackled using machine-learning models for classification. 

Attempts at quantification, comparison and (to some extent) classification of tongue patterns hitherto reported in the literature have largely relied upon expert-derived ad hoc metrics from hand-labelled tongue contours. Though we will implement selected examples of this approach as a baseline for our other approaches, most research effort will be invested in developing machine learning models. This will include both generative density models such as mixture density networks and variants of NADE, and discriminative models such as Deep Convolutional Neural Networks (DCNNs). We will investigate competing models which work with differing input data:

1. Image-based classification: “deep” machine learning models have frequently outperformed the previous state-of-the-art shallow models over the past 5 years. In shallow modelling, expert-derived algorithms first extract features from raw signals (e.g. images), and are then modelled separately. In deep modelling, raw signals are instead fed directly into a deep-structured machine learning model, which may then learn internal hierarchical representations of the data which most suit the task in hand. Inspired by recent successes in speech and image processing in this direction, we will develop classification models using raw ultrasound images as input data. This will make separate feature extraction (i.e. contour labelling) unnecessary. Image-based classification models will also have access to richer information than just tongue surface contour alone, as a full ultrasound frame contains significant extra information, including: transient appearance of the soft palate and sublingual floor; shadows of the mandible and hyoid bone; tongue tissue structures; and artefacts of ultrasound reflections due, for example, to retroflex configuration of the tongue. This offers image-based classification models a significant potential advantage.

2. Contour-based classification: application of DCNNs in other domains tells us they typically require a lot of training data to perform well. It may be we will not have enough data available within this project to adequately model the range of phones and vocal tract physiologies. Only empirical experimentation can answer this. However, we shall mitigate this potential risk by developing competing classification models in parallel. Using comparable deep neural network models, these shall take tongue contours as their input features instead.

Final evaluation

By the end of the project we will be in the position of comparing 4 methods for classification of lingual articulation: i) baseline contour-based metrics from the literature; ii) contour-based machine learning classification models; iii) image-based machine learning models; iv) “gold standard” results provided by the SLTs and the project’s Clinical Partner.

Clinical Track – data collection, testing, analysis

Data collection

We are fortunate to have excellent data available from previous ultrasound-based projects, so work on tongue tracking and shape classification can begin immediately. However, we will also supplement this with data recorded by our NHS partners throughout the project. This will both increase the amount of data available for training models generally, and also allows us to vary the type of data recorded in response to findings as the project progresses.

Children aged 5-18 with all types of SSDs (primary and secondary) will be recruited by our clinical partners in NHS Lothian, NHS Grampian and NHS Greater Glasgow and Clyde. Each partner will recruit children most likely to benefit from ultrasound visual feedback on an on-going basis in the middle 24 months of the project. With a target of recruiting 20 children per centre per year, we aim to collect data and assess a total of 120 children.  Therapy will take place over a minimum of six sessions per child.

Each clinic will run a series of cases studies, allowing us to responsively update the ultrasound interface throughout the project.  Thus, newly developed technology can be introduced for evaluation by SLTs as it becomes available. At all stages in this process we will collect consistent assessment data.  This will allow  us to develop the diagnostic ultrasound technology at the beginning and end of therapy to allow us to identify types of errors and to quantify progress. 

Testing and analysis

To evaluate whether we have met our technological objectives, our NHS partners will evaluate the ultrasound systems live in the clinics.  We will analyse the data thus collected systematically off-line to answer multiple questions.

In-clinic analysis: Some previous research has suggested it is possible for SLTs to identify errors using visual inspection alone.  Our own group, however, has found that even highly-trained SLTs who may be experienced with ultrasound visual feedback often give erroneous feedback live to clients. A key question then will be to test this objectively and so determine to what extent SLTs can identify errors in real-time in the clinic. We will devise an assessment format for our clinical partners to record their initial observations from the live ultrasound. This will allow us to compare initial subjective observations with both the off-line quantitative analysis and later with the tongue-shape classifiers developed in the Technology Track.

Off-line, objective analysis: We will systematically analyse the ultrasound data by hand-labelling segments of interest using our partner’s Articulate Assistant Advanced (AAA) software, after which quantitative analysis will be undertaken to identify error types. The duration of the segments will be annotated and splines automatically fitted to every frame. Multiple splines will be exported to a
workspace to allow comparison of tongue shapes. We will do this within-speaker to allow us to both qualitatively and quantitatively (using curve comparison methods) compare tongue shapes.

End goal

The ultimate goal of our research is that technology developed during the Ultrax2020 project will be used by Speech and Language Therapists (SLTs) to assess and diagnose SSDs automatically, leading to quicker, more targeted and cost-effective intervention.