Audio Notes: Blending Azure AI Speech and Azure Language to Create a Micro SaaS : jamie_maguire

September 09, 2023 September 09, 2023

Audio Notes: Blending Azure AI Speech and Azure Language to Create a Micro SaaS
by: jamie_maguire
blow post content copied from Jamie Maguire
click here to view original post

In an earlier blog post, I introduced Audio Notes.

This is a new a SaaS experiment that uses artificial intelligence, speech to text, and text analytics to automatically summarise audio and create concise notes from your recordings.

You can find more information about the project in these earlier blog posts. These set the scene and show which Azure AI services make this end-to-end solution possible:

In this blog post (Part 4), you will see how I Blend Azure AI Speech Services and Azure AI Language Services to Create the Micro SaaS Audio Notes. These are used to create a solution that automatically creates concise notes from the spoken word.

Read on to see how all this hangs together.

Process and Overview

To recap, the entire end to end process consists of the following steps:

Clicking Start Recognition in an ASP.NET web page activates Azure AI Speech Services using the JavaScript SDK. Several events are subscribed to.
The microphone is activated and anything that is said is sent to Azure AI Speech Services for transcription. Spoken words or phrases are then captured and injected into a div in the user interface.
A service class TextAnalyticsService, encapsulates all core documents summarization capabilities. Transcribed audio in string format is passed to this class.
The service class invokes Azure AI Language services using the AI.TextAnalytics SDK, passing in the transcribed audio.
The transcribed audio string is then summarized by Azure Language Services which returns an object that contains a list of the strings which represent the main points that were mentioned in in the speech transcript.

The sequence diagram below shows each step and interactions between the main components that form the solution:

User Interface

A simple web interface is used. This lets you start / stop the process, watch the transcription happen in real-time using a console, and view the final output of the transcription process.

You can also generate summarized content from the audio that was transcribed and stored in the Recognized Text textbox.

You can see the web page here:

A description of how each of the main elements now follow.

Start Recognition and Stop Recognition Buttons

A method in the Azure AI Speech Services JavaScript SDK is called to start continuous speech recognition when Start Recognition button is clicked:

speechRecognizer.startContinuousRecognitionAsync();

Similarly, when the Stop Recognition button is clicked, the recognition mechanism is stopped using the following method in the JavaScript SDK:

stopRecognizeOnceAsyncButton.addEventListener("click", function () {

  // Make the following call at some point to stop recognition:
  speechRecognizer.stopContinuousRecognitionAsync();

  startRecognizeOnceAsyncButton =  document.getElementById("startRecognizeOnceAsyncButton");
  stopRecognizeOnceAsyncButton = document.getElementById("stopRecognizeOnceAsyncButton");

  startRecognizeOnceAsyncButton.disabled = false;
  stopRecognizeOnceAsyncButton.disabled = true;
});

Audio Console

When words are processed and detected by the recognizing event, the text is concatenated in real-time and displayed in the audio console:

The following JavaScript makes this possible:

speechRecognizer.recognizing = (s, e) => {
  console.log(`RECOGNIZING: Text=${e.result.text}`);
  phraseDiv.value += e.result.text + '\r\n';
};

See a demo of this in action in this earlier blog post here.

Recognized Text

When speech has been completely been recognized, and the entire transcript has been captured, the output is displayed in the Recognized Text textbox:

The following JavaScript makes this possible:

speechRecognizer.recognized = (s, e) => {
                    // Indicates that recognizable speech was not detected, and that recognition is done.
                    if (e.result.reason === SpeechSDK.ResultReason.NoMatch) {
                        var noMatchDetail = SpeechSDK.NoMatchDetails.fromResult(e.result);

                        console.log('(recognized event NO MATCH)  Reason: ' + SpeechSDK.ResultReason[e.result.reason] + ' NoMatchReason: ' + SpeechSDK.NoMatchReason[noMatchDetail.reason]);
                    }
                    else {
                            console.log('RECOGNIZED: ' + SpeechSDK.ResultReason[e.result.reason] + ' Text: ' + e.result.text);
                            var recognizedDiv = $("#recognizedDiv");
                            recognizedDiv.append(e.result.text + '\r\n');
                    }
                };

Some error handling is included in the above JavaScript to detect if no match was possible. At this point, the transcribed audio is ready to be sent for document summarization.

Summarise Transcript

The Summarise Transcript button is used to send the recognized speech result to Document Summarization for processing:

The following JavaScript is used to perform this:

<script>

        $(document).on("click", "#btnSummarize", function (e) {
            var data = $("#recognizedDiv").val();

            $.ajax({
                url: "/SpeechToText/SummarizeText",
                data: {
                    document: data
                },
                success: function (result) {
                    console.log(result);
                    var summarized = '';

                    for (var i = 0; i < result.sentences.length; i++) {
                        console.log(result.sentences[i].text);
                        summarized = summarized + result.sentences[i].text + '.' +  '</br>';
                    }
                    $("#mainPageContainer").html(summarized);
                },
                error: function () {
                    console.log('error');
                }
            });
        });
    </script>

Summarised Content

The response from the AJAX call contains the summarized content. This is displayed in the div at the bottom of the page:

The earlier script simply loops through the list of summarized sentences:

for (var i = 0; i < result.sentences.length; i++) {
  console.log(result.sentences[i].text);
  summarized = summarized + result.sentences[i].text + '.' +  '</br>';
}
$("#mainPageContainer").html(summarized);

A Closer Look, Under the Hood

We can take a closer look at the code that is invoked when the JavaScript method call is made. Specifically, we look at the controller and service class.

Controller

Here we can see the ASP.NET controller endpoint that is invoked:

public async Task<JsonResult> SummarizeTextAsync(string document)
        {
            try
            {
                var MyConfig = new ConfigurationBuilder().AddJsonFile("appsettings.json").Build();
                var TextAnalyticsEndpoint = MyConfig.GetValue<string>("AppSettings:TextAnalyticsEndpoint");
                var TextAnalyticsAPIKey = MyConfig.GetValue<string>("AppSettings:TextAnalyticsAPIKey");

                TextAnalyticsService ta = new TextAnalyticsService(TextAnalyticsEndpoint, TextAnalyticsAPIKey);

                var reply = await ta.SummarizeSpeechTranscript(document);

                return Json(reply);
            }
            catch(Exception ex)
            {
                return new JsonResult("Error:" + ex.Message);
            }
        }

We read the Azure AI Language endpoint and API key from the appsettings.json file. The transcribed text is passed in as a parameter (document) and passed onto a service class TextAnalyticsService.

Service Class

The class TextAnalyticsService encapsulates all required logic that’s responsible for performing document summarization.

This is done in a method SummarizeSpeechTranscript .

You can see this here:

public async Task<ExtractiveSummarizeResult?> SummarizeSpeechTranscript(string document)
        {
            var batchInput = new List<string>
            {
                document
            };

            TextAnalyticsActions actions = new TextAnalyticsActions()
            {
                ExtractiveSummarizeActions = new List<ExtractiveSummarizeAction>() { new ExtractiveSummarizeAction() }
            };

            // Start analysis process.
            AnalyzeActionsOperation operation = await _client.StartAnalyzeActionsAsync(batchInput, actions);        

            await operation.WaitForCompletionAsync();
            // View status.
            Console.WriteLine($"AnalyzeActions operation has completed");
            Console.WriteLine();

            // View results.
            await foreach (AnalyzeActionsResult documentsInPage in operation.Value)
            {
                IReadOnlyCollection<ExtractiveSummarizeActionResult> summaryResults = documentsInPage.ExtractiveSummarizeResults;

                foreach (ExtractiveSummarizeActionResult summaryActionResults in summaryResults)
                {
                    if (summaryActionResults.HasError)
                    {
                        Console.WriteLine($"  Error!");
                        Console.WriteLine($"  Action error code: {summaryActionResults.Error.ErrorCode}.");
                        Console.WriteLine($"  Message: {summaryActionResults.Error.Message}");
                        continue;
                    }

                    foreach (ExtractiveSummarizeResult documentResults in summaryActionResults.DocumentsResults)
                    {
                        if (documentResults.HasError)
                        {
                            Console.WriteLine($"  Error!");
                            Console.WriteLine($"  Document error code: {documentResults.Error.ErrorCode}.");
                            Console.WriteLine($"  Message: {documentResults.Error.Message}");

                            continue;
                        }

                        Console.WriteLine($"  Extracted the following {documentResults.Sentences.Count} sentence(s):");
                        Console.WriteLine();

                        foreach (ExtractiveSummarySentence sentence in documentResults.Sentences)
                        {
                            Console.WriteLine($"  Sentence: {sentence.Text}");
                            Console.WriteLine();
                        }

                        return documentResults;
                    }
                }
            }
            return null;
        }

An object of type ExtractiveSummarizeResult is returned. This contains, among other things, a list of summarized sentences that best explain what the entire transcript was referring to.

We can see the properties and methods this object contains here:

public partial class ExtractiveSummarizeResult : TextAnalyticsResult
    {
        private readonly IReadOnlyCollection<ExtractiveSummarySentence> _sentences;


        /// <summary>
        /// Initializes a successful <see cref="ExtractiveSummarizeResult"/>.
        /// </summary>
        internal ExtractiveSummarizeResult(
            string id,
            TextDocumentStatistics statistics,
            IList<ExtractiveSummarySentence> sentences,
            IList<TextAnalyticsWarning> warnings)
            : base(id, statistics)
        {
            _sentences = (sentences is not null)
                ? new ReadOnlyCollection<ExtractiveSummarySentence>(sentences)
                : new List<ExtractiveSummarySentence>();


            Warnings = (warnings is not null)
                ? new ReadOnlyCollection<TextAnalyticsWarning>(warnings)
                : new List<TextAnalyticsWarning>();
        }


        /// <summary>
        /// Initializes an <see cref="ExtractiveSummarizeResult"/> with an error.
        /// </summary>
        internal ExtractiveSummarizeResult(string id, TextAnalyticsError error) : base(id, error) { }

        /// <summary>
        /// The warnings that resulted from processing the document.
        /// </summary>
        public IReadOnlyCollection<TextAnalyticsWarning> Warnings { get; } = new List<TextAnalyticsWarning>();

        /// <summary>
        /// The collection of summary sentences extracted from the input document.
        /// </summary>
        public IReadOnlyCollection<ExtractiveSummarySentence> Sentences
        {
            get
            {
                if (HasError)
                {
                    throw new InvalidOperationException(
                        $"Cannot access result for document {Id}, due to error {Error.ErrorCode}: {Error.Message}");
                }
                return _sentences;
            }
        }
    }

The property we’re interested in is Sentences:

   public IReadOnlyCollection<ExtractiveSummarySentence> Sentences
        {
            get
            {
                if (HasError)
                {
                    throw new InvalidOperationException(
                        $"Cannot access result for document {Id}, due to error {Error.ErrorCode}: {Error.Message}");
                }
                return _sentences;
            }
        }

It’s this property that we loop through when the JavaScript call is made and data is returned to the front-end:

for (var i = 0; i < result.sentences.length; i++) {
                        console.log(result.sentences[i].text);
                        summarized = summarized + result.sentences[i].text + '.' +  '</br>';
                    }

You can find the Azure AI Language SDK here.

Demo

We can see a demo of the entire solution in action here:

Next Steps

With everything in place, the prototype can now be published to Azure.

I’ll be repurposing the existing process I developed when publishing the online journal and mood tracker tool Daily Tracker.

Summary

In this blog post, we’ve seen how to blend Azure AI Speech and Language Services to create a solution that automatically creates concise notes from the spoken word.

We’ve seen how to:

capture speech in real-time
perform real-time, continuous, speech-to-text-transcription
use document summarisation to generate concise notes from the spoken word

Stay tuned for an announcement when this prototype will be published.

Salesforce Reader