Video platforms provide access to a vast number of music videos and their size grows rapidly. Facing such a large data volume, the textual search approaches its limits when music videos of a certain type are in demand, but the title is not specific enough. Most music videos are official music videos and professionally filmed concert videos. Besides these types, new music video types like user recorded concert videos and lyric videos have found their way on video platforms in the last decade. Video tags can aid the search for videos, but require manual assignment of the types to every video, which is a tedious and expensive task. By utilizing multimedia content analysis methods, the music videos can be categorized automatically and the predicted types can be assigned to the videos. The fundamental modalities of music videos are the audio and video stream, but various different features can be extracted from them. Often contextual information is also available in the form of metadata. Despite the richness of information provided by music videos, most previous approaches focused on a single modality only. This thesis aims to improve the prediction accuracies of individual modalities, by combining them in the developed multimodal music video classification system. For the underlying supervised machine learning task, a labeled music video dataset was thoroughly assembled. The multimodal nature of music videos allowed to extract audio, applause, video motion, video structure, image, text, and metadata features. To reduce the processing requirements, the calculated features were evaluated and only the most suitable ones for music video classification were selected. For every modality several classification algorithms and their parameters were evaluated using grid search. The best performing algorithm was subsequently used to classify the music videos in the test dataset. Based on the results of the individual modalities, the performance of the combined modalities could be evaluated. To combine the features and predictions of all modalities, several fusion techniques were applied. Based on the insights gained from their evaluations, a novel fusion setup was developed. The results show that the classification with the proposed fusion setup is superior to using individual modalities as well as any other combination approach. The best single modality result could be outperformed by 4.44% to an almost perfect music video classification of 98.33%.