Word learning is basic to foreign language acquisition, however time consuming and not always successful. Empirical studies have shown that traditional (visual) word learning can be enhanced by gestures. The gesture benefit has been attributed to depth of encoding. Gestures can lead to depth of encoding because they trigger semantic processing and sensorimotor enrichment of the novel word. However, the neural underpinning of depth of encoding is still unclear. Here, we combined an fMRI and a behavioral study to investigate word encoding online. In the scanner, participants encoded 30 novel words of an artificial language created for experimental purposes and their translation into the subjects native language. Participants encoded the words three times: visually, audiovisually, and by additionally observing semantically related gestures performed by an actress. Hemodynamic activity during word encoding revealed the recruitment of cortical areas involved in stimulus processing. In this study, depth of encoding can be spelt out in terms of sensorimotor brain networks that grow larger the more sensory modalities are linked to the novel word. Word retention outside the scanner documented a positive effect of gestures in a free recall test in the short term.