Overview of Research for the Ultrax2020 project

Currently in Speech and Language Therapy, technological support is sparse. In the previous Ultrax project we developed technology that enhances the ultrasound image of the tongue, making it clearer and easier to interpret. We showed that by using ultrasound to to provide visual feedback of the tongue in real-time, children can learn to produce speech sounds which have previously been impossible for them. 

This follow-on project, Ultrax2020, takes that work further by developing the ultrasound tongue tracker into a tool for diagnosing specific types of speech sound disorders (SSD) and evaluating how easy it is to use ultrasound in NHS clinics.  Specifically, the aim is to develop tools to support both initial diagnosis and for objective monitoring of progress throughout the therapy term. 

As with the preceding Ultrax project, this project comprises two complementary research tracks:

Technology Track

Technological challenges

The key objective of Ultrax2020 is to develop an ultrasound-based diagnostic assessment for identifying and categorising speech errors in children with SSDs. As a tool for clinical practice, this will alleviate the practical problems associated with time-consuming hand analysis, as well as the accuracy problems that inevitably come from relying on live subjective judgments as currently done. To achieve this goal, our research programme will address multiple technical challenges, including most significantly:

1. Automatic identification of speech-segments of interest. Assessment begins with recordings of wordlists containing certain phonetic properties. Since word lists should be randomised to avoid priming effects, it is time consuming for clinicians to identify words and frames of interest from continuous recordings for analysis. Our first challenge is to develop automated frontend processing of ultrasound and audio data for SLT assessment and report generation.

2. Tracking of tongue-shapes for diagnosis. While the image enhancements we implemented in the previous Ultrax project show that ultrasound visual feedback offers an effective intervention, further work is required to meet the challenge of tongue tracking (both offline and realtime) for the purpose of extracting diagnostic data.  A significant part of our research work will comprise ways to streamline and optimise our previously developed tongue tracking method.  

3. Classification of tongue-shapes for differential diagnosis and quantification of progress in therapy.  For this, two types of classification are needed in fact: 

 A) Within-speaker shape classification – to determine for example, whether children with perceptual homophony produce two (or more) phonemes in an identical manner or whether there is a covert contrast present. Likewise, we aim to determine whether tongue shapes have changed post-therapy, so quantifying improvement; 

 B) Cross-speaker classification – we will develop a method not only for classifying tongue shapes as disordered (i.e. different from those occurring in typical children), but also for identifying the nature of the abnormality. 

In addition, work will be undertaken with the view to making classification available for real-time use.

Classification approaches

The project objectives call for analysis of ultrasound tongue patterns in well-defined ways, such as: does a given tongue pattern conform to the intended phone class or to a different one? Can two tongue shapes be classified as belonging to the same phone class? How similar are two tongue patterns, in terms of the standard articulation of known phone classes? Such questions can readily be tackled using machine-learning models for classification. 

Attempts at quantification, comparison and (to some extent) classification of tongue patterns hitherto reported in the literature have largely relied upon expert-derived ad hoc metrics from hand-labelled tongue contours. Though we will implement selected examples of this approach as a baseline for our other approaches, most research effort will be invested in developing machine learning models. This will include both generative density models such as mixture density networks and variants of NADE, and discriminative models such as Deep Convolutional Neural Networks (DCNNs). We will investigate competing models which work with differing input data:

1. Image-based classification: “deep” machine learning models have frequently outperformed the previous state-of-the-art shallow models over the past 5 years. In shallow modelling, expert-derived algorithms first extract features from raw signals (e.g. images), and are then modelled separately. In deep modelling, raw signals are instead fed directly into a deep-structured machine learning model, which may then learn internal hierarchical representations of the data which most suit the task in hand. Inspired by recent successes in speech and image processing in this direction, we will develop classification models using raw ultrasound images as input data. This will make separate feature extraction (i.e. contour labelling) unnecessary. Image-based classification models will also have access to richer information than just tongue surface contour alone, as a full ultrasound frame contains significant extra information, including: transient appearance of the soft palate and sublingual floor; shadows of the mandible and hyoid bone; tongue tissue structures; and artefacts of ultrasound reflections due, for example, to retroflex configuration of the tongue. This offers image-based classification models a significant potential advantage.

2. Contour-based classification: application of DCNNs in other domains tells us they typically require a lot of training data to perform well. It may be we will not have enough data available within this project to adequately model the range of phones and vocal tract physiologies. Only empirical experimentation can answer this. However, we shall mitigate this potential risk by developing competing classification models in parallel. Using comparable deep neural network models, these shall take tongue contours as their input features instead.

Final evaluation

By the end of the project we will be in the position of comparing 4 methods for classification of lingual articulation: i) baseline contour-based metrics from the literature; ii) contour-based machine learning classification models; iii) image-based machine learning models; iv) “gold standard” results provided by the SLTs and the project’s Clinical Partner.

Clinical Track – data collection, testing, analysis

Data collection

We are fortunate to have excellent data available from previous ultrasound-based projects, so work on tongue tracking and shape classification can begin immediately. However, we will also supplement this with data recorded by our NHS partners throughout the project. This will both increase the amount of data available for training models generally, and also allows us to vary the type of data recorded in response to findings as the project progresses.

Children aged 5-18 with all types of SSDs (primary and secondary) will be recruited by our clinical partners in NHS Lothian, NHS Grampian and NHS Greater Glasgow and Clyde. Each partner will recruit children most likely to benefit from ultrasound visual feedback on an on-going basis in the middle 24 months of the project. With a target of recruiting 20 children per centre per year, we aim to collect data and assess a total of 120 children.  Therapy will take place over a minimum of six sessions per child.

Each clinic will run a series of cases studies, allowing us to responsively update the ultrasound interface throughout the project.  Thus, newly developed technology can be introduced for evaluation by SLTs as it becomes available. At all stages in this process we will collect consistent assessment data.  This will allow  us to develop the diagnostic ultrasound technology at the beginning and end of therapy to allow us to identify types of errors and to quantify progress. 

Testing and analysis

To evaluate whether we have met our technological objectives, our NHS partners will evaluate the ultrasound systems live in the clinics.  We will analyse the data thus collected systematically off-line to answer multiple questions.

In-clinic analysis: Some previous research has suggested it is possible for SLTs to identify errors using visual inspection alone.  Our own group, however, has found that even highly-trained SLTs who may be experienced with ultrasound visual feedback often give erroneous feedback live to clients. A key question then will be to test this objectively and so determine to what extent SLTs can identify errors in real-time in the clinic. We will devise an assessment format for our clinical partners to record their initial observations from the live ultrasound. This will allow us to compare initial subjective observations with both the off-line quantitative analysis and later with the tongue-shape classifiers developed in the Technology Track.

Off-line, objective analysis: We will systematically analyse the ultrasound data by hand-labelling segments of interest using our partner’s Articulate Assistant Advanced (AAA) software, after which quantitative analysis will be undertaken to identify error types. The duration of the segments will be annotated and splines automatically fitted to every frame. Multiple splines will be exported to a
workspace to allow comparison of tongue shapes. We will do this within-speaker to allow us to both qualitatively and quantitatively (using curve comparison methods) compare tongue shapes.

End goal

The ultimate goal of our research is that technology developed during the Ultrax2020 project will be used by Speech and Language Therapists (SLTs) to assess and diagnose SSDs automatically, leading to quicker, more targeted and cost-effective intervention.


Our findings so far include:

1. Development of a software prototype for SLT use. Through an iterative cycle of technology integration and testing with SLT partners, we are preparing a prototype clinical software package.

2. Collection of ultrasound tongue imaging data in an SLT context. New data is required to complement the ultrasound data we have now, reflecting the new diagnostic aim of this project. The SLT partners are collecting this, and it is being annotated, analysed, and used for machine learning and signal processing.

3. Formulation of an assessment and data collection protocol. To support the SLT data collection work, we have devised a protocol for assessment and data collection. This includes assessment wordlists, shape templates, and a therapy manual. In addition, we have held training and feedback days for the Ultrax2020 ultrasound visual biofeedback methodology.

4. Development of automatic identification of speech segments of interest. We have developed signal processing and machine learning algorithms to identify segments of interest from ultrasound and speech recordings.

5. Provision of advice and training for ultrasound-based speech therapy. We are giving advice and training on clinical use of ultrasound to our clinical partners, and to the wider SLT community as a general policy of advocacy.

6. Provision of data as a resource to the research community. We have released the Ultrasuite repository of ultrasound and acoustic data from child speech therapy sessions. This also includes a set of annotations, some manual and some automatically produced, and tools to process, transform and visualise the data.

7. Development of new signal processing and machine learning algorithms for the processing of ultrasound and acoustic data from child speech therapy sessions. This includes algorithms for speaker segmentation, for speech recognition, for probe geometry identification, and for automatic alignment of ultrasound and audio data.


The ultimate goal of our research is that Ultrax2020 will be used by Speech and Language Therapists (SLTs) to assess and diagnose SSDs automatically, leading to quicker, more targeted intervention. There are three large groups set to benefit from our research:

1. Children and Adults with Speech Sound Disorders

2. Speech and Language Therapists

3. Students of phonetics or Speech and Language Therapy

Our approach to impact in this project is through working with four NHS trusts and with an SME, Articulate Instruments.

Our work with the NHS trusts is focused on requirements capture, data collection, and in the use of the ultrasound technology in therapy. Our primary interactions with the NHS trusts have come through the ongoing collaboration, particularly with one of the project researchers, as well as through annual workshops bringing together project staff and project partners. These have enabled us to understand better the requirements of clinical partners, and have informed the design of our research.

We work very closely with Articulate Instruments at a technical level, with collaborations on implementing new algorithms in their commercial product. This is the main route through which research advances made in the project can reach users.