Software is an indispensable part of our modern technology and has become ever-present and pervasive in our daily lives. As a consequence, more and more software is being developed and often less time can be allocated for the individual development phases. A major contributor to this is the growing demand for customized software solutions that are tailored to specific customer requirements. To keep up with this development, techniques like Software Product Line Engineering (SPLE) have been devised, where engineers cope with the increasing need for software not by building each product from anew, but instead by trying to maximize the reuse of the available assets in building product families. Besides SPLE, other techniques to handle variable software with different levels of upfront investment have been proposed and, in some cases, applied in industry. Nevertheless, what all variable systems have in common is the combinatorial explosion of feature interactions with the number of features. This is a problem when working with variable software, because software engineers have to keep these interactions in mind in every development phase. An area where this is especially challenging is testing variable software, because it is usually not feasible to test all variants of the system. Therefore, not all potential feature interactions can be tested. To deal with this dilemma techniques like Combinatorial Interaction Testing (CIT) have been applied to variable software systems. A main goal for testing approaches for variable software is to select a set of variants from the system that cover relevant feature interactions. In this thesis, the main focus is on finding ways to improve the development phases for Software Product Lines (SPLs) with a particular emphasis on testing. The first objective during the work on this thesis was to support an incremental process to develop a variable software system, without the high upfront investment of a full-blown SPL. In the process of this work we were able to identify where features and feature interactions are implemented by analyzing source code. We are convinced that such information is also useful for testing variable software. To gain better insight in the existing research on the topic of testing variable software, we performed a Systematic Mapping Study in which we reviewed publications on the topic of applying CIT to SPLs. This helped us to understand the state of the art for testing variable systems and think about ways to expand the rich amount of work on the topic. Moreover, the experience we gained from performing this study helped us in our collaboration with an industry partner, a company that maintains a variable software system for heavy duty machinery. We gathered variability information that was scattered across various organizational units in the company, to help with identifying relevant software variants to test. In order to be able to further improve testing approaches for variable software we planned to look into the implementation of publicly available SPLs, with the goal of better understanding how features and their interactions are realized. For this purpose, we performed an empirical study where we analyzed the source code of these SPLs and gained interesting information about their structure. One finding was, that it seemed to be more likely for an interaction to exist in source code, if there also existed interactions with the same features but of lower order. The order of an interaction refers to the number of features in the interaction minus one. Based on this we performed a study to see if we can predict higher order interactions from known interactions of the lower next order and discussed its implications for testing. During the work on all these topics we repeatedly were made aware of the necessity for comparing different testing approaches to one another. While there exist some benchmarks for this for non-variable software, comparing the effectives of testing approaches for variable software is still done rather arbitrary. Therefore, we set out to create a process that could help to generate the necessary data for such a benchmark. The main goal was to be able to compare the fault-detection capabilities of different testing approaches.