Web data, news and textbooks offer informative but unstructured multimodal text. The ability to translate multimodal text into a semantic representation that is amenable to further reasoning is a fundamental problem in modern AI. In this project we design systems that can understand and use multimodal text through multiple interconnected components: semantic interpretation, multimodal alignment, knowledge acquisition and reasoning.