Language Tutor


The idea from Heather is that you could have a language tutor that could interactively coach a student to continually improve their language skills. While, certainly, this is not a new idea, it is related to Heather’s area of expertise, i.e. how to effectively communicate concepts to students & teachers, how to get them to learn & retain information.

Product Ideas

  1. It should be multimodal, using pictures, audio, text, & video to iteratively test & assist the learner.
  2. It should know which learner it is dealing with so that it continually tailors its training.
  3. It should be engaging, e.g. maybe it could teach you about anything that you want to know about while it also teaches you fluency in some language. Of course, this instruction should also be multimodal.
  4. It should be able to target specific skill areas for more rapid development, i.e. if the student really wants to learn how to order at restaurants as soon as possible, then the instruction should attempt to get the student there as quickly as possible.
  5. It should be able have a particular conversation with the student, e.g. Teacher, I’d like to talk about classical architecture with you in Spanish. Then the teacher would start a conversation with the student, continuing where that particular student left off last time or starting from the beginning if need be. Where the student struggled, the teacher would assist & offer repetition later in the conversation or in other unrelated conversations where applicable.
  6. It should be able to take in a video or image & teach the observer how to describe & discuss it.

Product Questions

What should its name be? Teacher, … Β  What device would this teacher work through? iPad, laptop, phone, echo show type device with all mediums possible, chromebook, … Β  Is the original intention for use in the classroom? Which medium is the best one for the classroom? Β  How could we obtain customers? What do the potential customers want? Β  What is the simplest product that has demonstrable value? Is it audio questioning with feedback on the response?

Development Questions

What are the building blocks?

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     
 β”‚  Transcribe  β”‚        Translate  
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          Text

                         Translate                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           Audio                            β”‚   Topical    β”‚
                                                            β”‚ Conversation β”‚
                                                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Conversation in a different language. Conversation with a different historical figure. Β  Which order should things be built in? Β  What is the simplest product we could get working quickly? Β  How could we encode & compress the history of each student in order to iteratively build on that history toward mastery? If a 30 minute conversation in a new language is ~5k words & we can fit 1M words into context, then we can build up to 100 hours of context into model prompt. It’s likely that we can keep only the last X hours of conversation as context. This implies tradeoffs between perfect previous information & speed, cost, etc… Β  How fast can a local model be? How small can a local model be? Local models really change the cost game. Is it worth trying to become an expert in what Ollama is doing in order to be able to customize everything to run locally where possible?

What is the online chat bot competition called? What can I learn from it?

Transcribe

We’ll need to be able to compare what was said against the correct word or phrase. Will we be able to use probabilities from the model to say how excellent the pronunciation was? Transcription is one way to do this, get a transcription, keep track of the level of confidence & which words are less certain, & feed the transcription to another model that can score it against the target & offer feedback.

Japanese

Start with Japanese. How would you, as an agent teach someone Japanese if they knew nothing about it? Could you lead them in an immersive tale that required them to use language tasks to complete?

Audio

What size are audio samples, i.e. can they be sent quickly & cheaply to & from APIs?

Are audio models small enough to run on user devices?

from io import BytesIO
import pyaudio
import soundfile as sf
from time import sleep
 
 
def record_frames(
    chunk=1024,  # Record in chunks of 1024 samples
    sample_format=pyaudio.paInt16,  # 16 bits per sample
    channels=2,
    rate=44100,  # Record at 44100 samples per second
    seconds=10,
):
    client = pyaudio.PyAudio()
 
    print("START recording...")
    audio_in_stream = client.open(
        format=sample_format,
        channels=channels,
        rate=rate,
        frames_per_buffer=chunk,
        input=True,
    )
 
    for i in range(0, int(rate / chunk * seconds)):
        yield audio_in_stream.read(chunk)
 
    audio_in_stream.stop_stream()
    audio_in_stream.close()
    client.terminate()
    print("STOP recording...")
 
 
def record_stream(
    stream,
    chunk=1024,  # Record in chunks of 1024 samples
    sample_format=pyaudio.paInt16,  # 16 bits per sample
    channels=2,
    rate=44100,  # Record at 44100 samples per second
    seconds=10,
):
    client = pyaudio.PyAudio()
 
    print("START recording...")
    audio_in_stream = client.open(
        format=sample_format,
        channels=channels,
        rate=rate,
        frames_per_buffer=chunk,
        input=True,
    )
 
    for i in range(0, int(rate / chunk * seconds)):
        stream.write(audio_in_stream.read(chunk))
 
    audio_in_stream.stop_stream()
    audio_in_stream.close()
    client.terminate()
    print("STOP recording...")
 
    stream.seek(0)
    return stream
 
 
def audio_frames_out(
    frames, sample_format=pyaudio.paInt16, channels=2, rate=44100  # 16 bits per sample
):  # Record at 44100 samples per second
 
    p = pyaudio.PyAudio()
    print("START writing to audio out..")
    audio_out_stream = p.open(
        format=sample_format,
        channels=channels,
        rate=rate,
        output=True,
    )
 
    for frame in frames:
        audio_out_stream.write(frame)
    print("STOP writing to audio out...")
 
    audio_out_stream.close()
    p.terminate()
 
 
def audio_stream_out(
    audio_stream,
    sample_format=pyaudio.paInt16,
    channels=2,
    rate=44100,  # 16 bits per sample
    chunk=1024,
):  # Record at 44100 samples per second
 
    p = pyaudio.PyAudio()
    print("START writing to audio out..")
    audio_out_stream = p.open(
        format=sample_format,
        channels=channels,
        rate=rate,
        output=True,
    )
 
    while len(data := audio_stream.read(chunk)):
        audio_out_stream.write(data)
    print("STOP writing to audio out...")
    audio_stream.seek(0)
 
    audio_out_stream.close()
    p.terminate()
 
 
from gradio_client import Client
 
 
def speech_to_speech(frames, origin="English", target="Japanese"):
    client = Client(
        "https://josh-ramer-seamless-m4t-v2-large.hf.space/--replicas/cvmts/"
    )
    result = client.predict(frames, origin, target, api_name="/s2st")
    print(result)
    return result
 
 
def speech_to_text(frames, origin="English", target="Japanese"):
    pass
 
 
def text_to_speech(text, origin="English", target="Japanese"):
    pass
 
 
def talk_to_sensei(predictor, audio_stream):
    prediction = predictor.predict(audio_stream)
    return prediction
 
 
audio_stream = record_stream(BytesIO(), seconds=3)
# audio_stream_out(audio_stream)
talk_to_sensei(predictor, audio_stream.getvalue())
audio_stream.close()
 
frames = list(record_frames(seconds=3))
# audio_frames_out(frames)
wav_stream = BytesIO()
with wave.open(wav_stream, "wb") as ws:
    ws.setnchannels(2)
    ws.setsampwidth(pyaudio.get_sample_size(pyaudio.paInt16))
    ws.setframerate(44100)
    ws.writeframes(b"".join(frames))
 
wav_stream.seek(0)
talk_to_sensei(
    predictor,
    {"inputs": wav_stream.read(), "parameters": {"tgt_lang": "jpn"}},
)

Deploy models to AWS using the SageMaker SDK.

import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.serializers import DataSerializer
 
role = "SageMaker-Ai-Engineer"
 
# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "facebook/seamless-m4t-v2-large",
    "HF_TASK": "audio-to-audio",
}
 
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version="4.37.0",
    pytorch_version="2.1.0",
    py_version="py310",
    env=hub,
    role=role,
)
 
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,  # number of instances
    instance_type="ml.g4dn.xlarge",  # ec2 instance type
    serializer=DataSerializer(content_type="audio/x-audio"),
)
 
predictor = HuggingFacePredictor(
    endpoint_name="huggingface-pytorch-inference-2024-04-20-04-30-12-337",
    serializer=DataSerializer(content_type="audio/x-audio"),
    # deserializer=DataDeserializer(content_type=)
)