AIRE Club #44: Multimodal AI Combines Multiple Data Types
At the AIRE Club held at sTARTUp Day, Martin Rebane (AI Lead) and Ida Maria Orula (AI Developer) from Sparkup at Tartu Science Park explained how multimodal artificial intelligence works.
To begin, the speakers outlined the key difference between traditional and multimodal artificial intelligence.
Traditional artificial intelligence works with one modality, meaning one input type and data processing.
Traditional AI:
- Processes one data type, whether image, text, or audio
- Clear focus, such as detection or classification
- Attention to detail, excellent for precise, focused analysis
- Easier to manage, meaning simpler training and measurement
- Fast, good and convenient for real-time use
Traditional AI is ideal when quick and accurate responses are needed with clear input available.
Multimodal Artificial Intelligence
Multimodal AI analyzes multiple inputs simultaneously, such as image, text, and audio together.
The system processes different data types in parallel, then integrates them through a so-called fusion layer. The result is not simply classification or detection, but rather reasoning, decision-making, and comprehensive understanding.
Multimodal AI strengths:
- Processes image, text, and audio simultaneously
- Fusion layer combines knowledge from different modalities
- Generates answers, decisions, predictions, and reasoning
- Understands situations from multiple perspectives, providing broader context
Real-World Applications in Production
The session highlighted that multimodal AI is most beneficial in manufacturing and industrial environments:
- For quality control, visual inspection can be combined with other data
- Pattern analysis to find anomalies
- Decision-making based on different data sources
Multimodal AI could be used in thermal cameras, which would help assess the visual appearance of products through standard cameras and evaluate temperature distribution in products.
Such integration helps catch defects that single-modality systems would miss—for example, products may look visually fine but have temperature fluctuations that could indicate potential product quality issues.
Capabilities Still Developing
An audience member asked whether multimodal AI can read body language. The answer is that it’s in development—while the technology attempts to interpret human gestures, this capability is still maturing.



