Use A/B testing to compare two design versions.
Based on the original design for Geo-fence, an engineer provided feedback that he could not determine how to preview the map and what the eye icon represented. To solve these problems, I created “Design B.” The new design looks quite different than the original design, so I wanted to compare them to see which is more user-friendly. Our product wants to help users to save their time during the operation process. Therefore, time spent and misclick rate are two important indicators to evaluate our design.
The purpose of this experiment is to verify the following: (1) that Design B performs better in map preview than Design A, (2) that Design A and Design B have the same level of performance in checking devices and editing information, (3) and that Design B has greater overall usability than Design A.
The participants were twenty-one native English speakers with ages ranging from 30 to 50. It's a between-subjects experiment design, which means each participant only interacted with a single user interface. Ten participants interacted with Design A while the other eleven participants interacted with Design B. After finishing the experiment, they each received $3 for their participation.
The lihi.io is a shortened URL service for A/B testing. It randomly assigns participants into different destination URLs, and repeat users will return to their assigned destination page.
The Maze is an online user testing platform that provides actionable insights for prototypes, bringing confidence to the design process.
(3) System Usability Scale (SUS)
The System Usability Scale (SUS) is a reliable way to measure usability. It consists of a 10 item questionnaire using a Likert scale. However, due to a limitation of Maze’s design, we limited our questionnaire to 5 items.
Each participant received a lihi.io URL which randomly assigned them into the Design A or Design B questionnaire. Each questionnaire has two sections. In the first section, each participant is expected to complete a series of tasks using the interface prototype. In the second section, each participant completes a Likert scale for 5 items regarding the system usability. The answers for this Likert scale ranged from “Strongly Agree” to “Strongly Disagree.”
In the map preview experiment, the participants’ reactions to Geo-fence's index page and previewing one specific place's map were recorded. Design B performed much better regarding success rate, average time, and misclick rate. The results are presented below.
|Name||Successful rate||AVG time||Misclick rate|
|Design A||71.4%||17 sec||86%|
|Design B||87.5%||6 sec||38%|
In the device checking experiment, the participants were asked to navigate Geo-fence's index page to check one specific location's devices. They were asked to answer if a specific device belonged to one specific location to determine if they could perceive the appropriate information. According to the results, Design B performed better regarding the success rate and the average time to completion. However, Design A had a better performance in the misclick and correct rate.
|Name||Successful rate||AVG time||Misclick rate||Correct rate|
|Design A||18.2%||22 sec||30.5%||100%|
|Design B||30%||8 sec||36.5%||60%|
In the information editing experiment, the participants’ reactions to Geo-fence's index page and editing one specific place's information were recorded. Design A produced a far better result than Design B in success rate, average time, and misclick rate. The results are presented below.
|Name||Successful rate||AVG time||Misclick rate|
|Design A||90.9%||4.5 sec||18%|
|Design B||30%||6.6 sec||30.3%|
The testing consisted of 5 questions derived from the System Usability Scale. In this questionnaire, Q2 and Q3 are reverse questions, so I converted them into a new number according to the System Usability Scale. The results are presented in the following table.
According to the results, participants determined that Design B had better overall usability, specifically regarding Q1, Q2, Q4, and Q5. Conversely, Design A only reported better performance for Q3. However, no significant difference in statistics exists between these two designs due to the small participant pool.
In Design A, the SUS score is 42.2 while Design B’s SUS score is 48.5. The two scores are lower than the average SUS score, which is 68. The reasoning behind the low scoring is due to participants reporting a lack of knowledge, familiarity, and confidence when using the system.
Though this is a pilot study with a small number of participants, Design B performed better overall than Design A. Design B solves the map previewing issue and gives participants a better impression of the system. However, Design B still has room for improvement. First, the presentation of devices in Design B is not clear enough, so it has a lower correct rate in recognizing one specific device. Second, when participants were editing information, the “more” icon is not visible enough, so it performed worse regarding editing information.
In the System Usability Scale, the system received a less than average score because participants felt they needed background knowledge in order to use the system and lacked confidence when using it. Therefore, I recommend that we utilize text, a video, or a workshop to introduce the product to first-time system users. Additionally, it would be useful to provide a “help” service to aid users when they don't know what to do.