IBM

Multimodal AI - Speech- MSc and PHD-Summer internship 2026- Research Lab

Posted: 4 minutes ago

Job Description

IntroductionAt IBM work is more than a job - it’s a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better, but to attempt things you’ve never thought possible. Are you ready to lead in this new era of technology and solve some of the world’s most challenging problems? If so, let’s talk.Your Role And ResponsibilitiesIf you’re a student excited about the intersection of large language models with speech and audio analysis—and want to contribute to research with both academic and industrial impact—this internship is for you.Our team at IBM Research develops models, algorithms, and technologies that drive IBM products and advance the broader AI community. We publish papers, release open-source models, and file patents based on our work.As An Intern, You’ll Tackle Real-world Problems Using Cutting-edge Deep Learning Methods To Advance The State Of The Art In Speech Understanding And Generation. You’ll Collaborate Closely With Researchers, Leverage Large-scale GPU Compute, And Focus On One Of The Following AreasSpeech and Audio — Advancing recognition, analysis, and generation of natural speech and audio for more expressive, human-like interaction. Research spans generative and conversational AI, speech synthesis, and multimodal representation learning.Multimodal and Foundation Models — Exploring large-scale, unified models that jointly learn from text and audio. Topics include self-supervised learning, realistic data synthesis, expressive speech generation, and tokenization strategies.The goal of the internship is to produce a high-quality research outcome and publish in a leading AI venue (e.g., ICLR, Interspeech, NeurIPS, ACL, ICML).This is a 3-month, full-time summer internship at our Haifa or Tel Aviv research sites (flexible).Sample Of 2025 Publications By The GroupGranite Speech, ASRU 2025ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models, COLM 2025Spoken question answering for visual queries, Interspeech 2025Continuous Speech Synthesis using per-token Latent Diffusion, ASRU 2025A Non-autoregressive Model for Joint STT and TTS, ICASSP 2025Required Technical And Professional Expertise M.Sc. or Ph.D. student with knowledge in Machine Learning and Multimodal Large Language Models. Strong background using modern methods, deep knowledge of the recent literature, prior CV/ML/DL/LLMs publications are an advantage. Strong Python coding skills. Experience with Transformers and LLMs is an advantage. A team player with great social skills and willingness to collaborate.Please add your grade sheet to your applicationPreferred Technical And Professional ExperiencePublication/s at top-tier peer-reviewed conferences or journals.

Job Application Tips

  • Tailor your resume to highlight relevant experience for this position
  • Write a compelling cover letter that addresses the specific requirements
  • Research the company culture and values before applying
  • Prepare examples of your work that demonstrate your skills
  • Follow up on your application after a reasonable time period

You May Also Be Interested In