Smart Speaker Basic Features

Smart speakers are intelligent devices with a variety of functions. Here are some of their common features:

Voice Interaction Capability


Smart speakers can be activated through specific wake-up phrases (such as ‘Xiao Ai Tongxue’, ‘Xiao Du Xiao Du’, ‘Tian Mao Jingling’, etc.). Once the user utters the wake-up phrase, the speaker begins to receive voice commands. This feature utilizes voice recognition technology to accurately capture the wake-up phrase amidst complex environmental sounds. For example, in a living room environment with television noise and conversation, the user only needs to clearly state the wake-up phrase for the smart speaker to respond.


Voice Command Recognition and Comprehension


It can understand various voice commands from users, including simple ones like ‘play music’, ‘check the weather’, ‘set an alarm’, and also handle more complex commands such as ‘play Jay Chou’s ‘Qi Li Xiang’ at 50% volume’. This requires the built-in Natural Language Processing (NLP) technology of the smart speaker to perform syntactic analysis and semantic understanding of voice commands. For instance, when a user says ‘I want to listen to a certain song by a certain singer’, the smart speaker analyzes the singer’s name and song title in the command and then searches and plays it from its music library.


Voice Response and Feedback


Smart speakers respond to user queries or feedback on operations in the form of voice. For example, when a user asks ‘What’s the weather like today?’, it will reply ‘Today is sunny, with temperatures ranging from 20 to 25 degrees Celsius’. This voice feedback is convenient for users to obtain information when their hands are busy or when it’s inconvenient to look at a screen (if there is one).


Multimedia Functions


Music Playback


Smart speakers can connect to various music streaming services (such as QQ Music, Kugou Music, Spotify, etc.), allowing users to play their favorite music through voice commands. They support multiple music playback modes, such as single song loop, shuffle play, and playlist play. For example, a user can say ‘Play my favorite playlist’, and the smart speaker will play according to the playlist previously set up in the music app. At the same time, it can also recommend music based on the user’s listening history and preferences.


Radio Broadcasting


It can play various radio programs, including local radio stations (such as traffic radio, news radio, etc.) and online radio. Users can switch radio frequencies or select specific radio programs through voice commands. For example, ‘Play local traffic radio station’, and the smart speaker will search and play the corresponding radio content, allowing users to stay informed about traffic conditions and more.


Audiobook playback is a great assistant for book lovers who prefer listening. It can play a variety of audiobooks, such as novels, biographies, and children’s stories. Users can search for the desired audiobooks by voice commands, specifying the book title, author, or category. For instance, by saying ‘Play the audio version of ‘Ordinary World”, the smart speaker will search its audiobook library and start playing, making it convenient for users to listen to books while doing housework or resting.



Live service functionality, such as weather queries, allows the smart speaker to provide real-time weather information and forecasts for the coming days. It can offer accurate weather data based on the user’s location (through phone positioning or user-set location information). For example, when a user wakes up in the morning and asks ‘What’s the weather like today?’, the smart speaker will quickly check and inform the user of the day’s weather conditions, helping them decide on clothing and travel plans.


Alarm and reminder settings can be set via voice commands. For example, ‘Set an alarm for 7 a.m. tomorrow’ or ‘Remind me of a meeting at 3 p.m.’ The smart speaker will issue reminders at the specified time, with notification methods including voice announcements, ringtones, or a combination of both. For instance, when the set meeting reminder time arrives, the smart speaker will say ‘You have a meeting at 3 p.m., please arrange your time accordingly’, ensuring users do not miss important events.


Smart home control is a significant feature of the smart speaker. It can connect and interact with various smart home devices (such as smart lights, smart sockets, smart curtains, smart air conditioners, etc.). Users can control these smart home devices by voice commands, such as turning them on and off, adjusting brightness (for smart lights), and regulating temperature (for smart air conditioners). For example, when a user comes home and says ‘Turn on the lights’, the connected smart lights will illuminate; saying ‘Set the air conditioner to 26 degrees’ will cause the smart air conditioner to adjust the temperature accordingly, providing a convenient home control experience.


Knowledge query functionality allows the smart speaker to answer various encyclopedia questions, such as ‘What is the diameter of the Earth?’ or ‘What is photosynthesis?’. It extracts information from pre-integrated knowledge databases (such as Baidu Baike, Wikipedia, etc.) or online knowledge resources to answer users. For example, when a user asks ‘Which is the highest mountain in the world?’, the smart speaker will respond ‘Mount Everest, with an elevation of approximately 8848 meters (snow level)’, helping users gain knowledge.


Translation services provide simple language translation capabilities. Users can say ‘Translate ‘I love you’ into English’, and the smart speaker will reply ‘In English, ‘I love you’ is ‘I love you’.’
It supports mutual translation between multiple languages, facilitating users’ use in scenarios such as learning and traveling.


News and information acquisition can provide users with news and information. It can filter and broadcast news according to users’ interest preferences (such as sports news, technology news, entertainment news, etc.). Users can obtain relevant information through voice commands such as ‘Play sports news’. For example, the smart speaker will broadcast sports news content such as ‘In today’s football match, a certain team defeated a certain team. What is the score?’, allowing users to know the latest news in time.


Decomposition of smart speaker


According to structure, a smart speaker can be decomposed into:


Shell: Protects internal components and is usually made of materials such as plastic, metal or wood, with certain aesthetics and durability.


Speaker: Converts electrical signals into sound and is one of the core components of a smart speaker. The quality and performance of the speaker directly affect the sound quality.


Microphone: Used to receive users’ voice commands. Usually, multiple microphones are used to form an array to improve the accuracy of voice recognition.


Motherboard: Integrates core components such as the control circuit, audio processing chip and wireless communication module of the smart speaker and is the control center of the smart speaker.


Power supply: Provides power support for the smart speaker, usually by using an internal battery or an external power adapter.


Other components: May also include other auxiliary components such as indicator lights, buttons and sensors to realize various functions of the smart speaker.


Smart speakers of different brands and models may have differences in structure, but the above are the basic structural components of a smart speaker.


Decomposition of smart speaker from the perspective of embedded hardware:


Microphone array


The microphone array is a key component for smart speakers to receive users’ voice commands. It consists of multiple microphones, and there are commonly 2 to 7 microphones. These microphones are arranged in a certain geometric shape, such as linear arrangement or circular arrangement. Its main function is to collect sound signals in different directions and distances. Through beamforming technology, the microphone array can enhance the receiving effect of sound from the user’s direction and suppress noise from other directions.


For example, when a user speaks in a noisy environment, beamforming technology can process the sound signals received by multiple microphones, so that the smart speaker can hear the user’s voice more clearly, just like installing an ‘auditory spotlight’ that can point to the source of the user’s sound for the smart speaker. According to the beamforming method, microphone arrays can be divided into analog beamforming and digital beamforming.


Analog beamforming processes sound at the analog signal stage, while digital beamforming processes sound signals after digitizing them first.



Digital beamforming offers higher flexibility and precision and can better adapt to complex acoustic environments.


Speaker System: The speaker is the component of a smart speaker that converts electrical signals into sound signals. It is mainly composed of parts such as a diaphragm, voice coil, and permanent magnet. When an audio current passes through the voice coil, the voice coil vibrates under the action of Lorentz force in the magnetic field of the permanent magnet, and then drives the diaphragm to vibrate and produce sound.


The speaker system of a smart speaker usually includes one or more speaker units to achieve the playback of sounds of different frequencies. For example, some smart speakers adopt a two-way frequency design, that is, there is a woofer dedicated to playing low frequencies and a tweeter responsible for playing mid and high frequencies, which can provide richer and better quality sound. According to the sound generation principle, it can be divided into electro-dynamic speakers (the most common), electromagnetic speakers, piezoelectric speakers, etc.


The advantages of electro-dynamic speakers are good sound quality and a wide power range, which can meet the requirements of smart speakers for different volumes and sound qualities.



Main Control Chip: The main control chip is the ‘brain’ of a smart speaker. It is responsible for coordinating and controlling the work of various hardware components. It integrates functional units such as a central processing unit (CPU) and a digital signal processor (DSP). The CPU is mainly used to run the operating system and various applications of the smart speaker, such as software for voice recognition and natural language processing.


The DSP is used to process audio signals and voice signals. For example, it preprocesses the voice signal received by the microphone and optimizes the audio signal to be played. The main control chip is also responsible for managing other hardware resources such as the storage unit and communication interface of the smart speaker. For example, it can control the read and write operations of data in memory and exchange data with the external network through the Wi-Fi interface.


Common chip examples: Smart speaker chips like those from MediaTek have high-performance CPU and DSP cores and can support multiple voice interaction and audio playback functions. These chips also support multiple communication protocols such as Wi-Fi and Bluetooth, making it convenient for smart speakers to connect to the network and other devices.



Storage Unit: The storage unit is used to store data such as the operating system, applications, voice models, and audio files of the smart speaker. It includes random access memory (RAM) and flash memory (Flash Memory). RAM is mainly used to store temporary data during the operation of the smart speaker, such as the running program code and intermediate results of voice recognition. When the smart speaker is powered off, the data in RAM will be lost.


Flash memory is used for long-term storage of system software, user configuration information, audio resources, and other data. For instance, the voice recognition models and operating system update files of smart speakers are stored in flash memory, ensuring that these data are not lost even when the speaker is powered off. The RAM capacity of smart speakers typically ranges from several hundred MB to several GB, depending on the complexity of the smart speaker’s functions.


Flash memory capacity usually varies from several GB to tens of GB, accommodating a larger number of audio files and other data. The read and write speeds of the storage unit can also affect the performance of smart speakers; for example, fast flash memory read and write speeds can enable smart speakers to launch applications and load audio files more quickly.



Communication modules mainly consist of Wi-Fi and Bluetooth modules. Wi-Fi modules allow smart speakers to connect to home wireless networks, thereby accessing various services and resources on the internet. They communicate wirelessly with wireless routers, adhering to the IEEE 802.11 standard protocol (such as 802.11n, 802.11ac, etc.) for data transmission. Bluetooth modules are primarily used for short-range device connections.


For example, smart speakers can pair with smartphones via Bluetooth, allowing audio content from the phone (such as music, voice calls, etc.) to be transmitted to the smart speaker for playback. Bluetooth modules communicate according to Bluetooth technology standards (such as Bluetooth 4.0, Bluetooth 5.0, etc.). Wi-Fi connections are suitable for smart speakers to obtain cloud-based voice recognition services, content resources (such as music, audiobooks, etc.


), and to perform software updates. Bluetooth connections are more convenient for users to quickly transmit audio from personal devices to the smart speaker for playback without a Wi-Fi network or when they prefer not to use Wi-Fi, and Bluetooth connections can also be used for some simple device control scenarios.



Power management modules are responsible for providing stable power supply to the various hardware components of smart speakers. They mainly include power adapter interfaces, battery charging circuits (if there is an internal battery), voltage regulation circuits, etc. When a smart speaker is connected to the mains via a power adapter, the power management module converts the mains power into a direct current voltage suitable for the internal components of the smart speaker.


If the smart speaker has an internal battery, the power management module also manages battery charging, such as monitoring battery levels, charging status, and preventing overcharging and over-discharging. Additionally, voltage regulation circuits ensure that the voltage supplied to each hardware component remains stable despite changes in battery levels or fluctuations in external power supplies.



Battery Types and Endurance Features (if applicable): Smart speakers typically use lithium batteries, which have the advantages of high energy density and low self-discharge rates. The endurance time varies depending on the power consumption and battery capacity of the smart speaker, generally ranging from several hours to dozens of hours. For instance, some compact smart speakers can achieve a battery life of around 6-8 hours when playing music at medium volume.


Buttons and Indicator Lights: Buttons are one of the interfaces for users to physically interact with smart speakers. Common buttons include power, volume adjustment, and microphone mute. The power button is used to turn on and off the smart speaker, the volume adjustment button allows users to manually adjust the volume of the audio playback, and the microphone mute button temporarily blocks the microphone when users do not want the smart speaker to receive voice commands.


Indicator lights are mainly used to display the working status of the smart speaker. For example, when the smart speaker is booting up, the indicator light may flash; when it successfully connects to the Wi-Fi network, the indicator light will display a specific color or flash pattern; and when the microphone is muted, there will also be corresponding indicator light prompts. These indicator lights convey the status information of the smart speaker to users through different colors, flash frequencies, and other means.


The design of buttons and indicator lights usually takes into account user convenience and aesthetics. They are generally located on the top, side, or bottom of the smart speaker and have clear labels for easy operation by users. Some smart speakers also use touch-sensitive buttons, which have a more streamlined appearance and can achieve multiple functions through different touch methods, such as tapping and long pressing.



Smart Speaker Implementation Solutions: Implementing a smart speaker with simple voice recognition and control functions can be achieved using STM32. For complex voice interaction functions, tools such as Baidu voice recognition, large models, and speech synthesis are required.


Implementing a smart speaker based on ESP32 can utilize open-source voice recognition libraries, such as ESP-Skainet, for wake word detection and simple voice recognition. This library, based on deep learning technology, can achieve a certain level of voice recognition on ESP32 locally, reducing dependence on cloud services and improving response speed and privacy. Similarly, for complex voice interaction functions, tools such as Baidu voice recognition, large models, and speech synthesis are required. ESP-ADF supports the integration of various voice recognition services, such as Baidu DuerOS and Amazon Alexa.


Leave a Comment

Your email address will not be published. Required fields are marked *