A/B Testing and Eye Tracking
eyetracking cover

Context
This project was made for CSCI 1300: User Interfaces and User Experiences, a class at Brown University taught by Jeff Huang. Our goal was to research how affordances affect user behavior by analyzing data through A/B testing and eye tracking.
My teammates were Miranda Mo, Jennifer He, and Milanca Wang

Taxi Booking Site
For this project, we created two new and improved versions of a taxi booking site, as shown in Image 1. For Version A, we created used a vertical layout. We added grid boxes around each taxi option and added a drop shadow to emphasize the different sections. Additionally, we used different font sizes for the name of the taxi and the description to add distinction between the different pieces of information. For Version B, we used a horizontal layout that placed all of the taxi options next to each other contained within the whole screen. We also added grid boxes around each taxi to make the content more readable. For both of these versions, we decided to use the same color palette so that color was not a factor that played into how users would interact with the site. Instead, we wanted to focus on how our two different layouts would affect behavioral trends such as click through rate, time to click, dwell time, and return rate.
The two versions of our site can be viewed here (either Version A or Version B will be displayed each time page is refreshed/opened)
original site design
Part 1: A/B Testing
First, we made a series of hypotheses on how the two versions of our sites would perform based on four metrics (click through rate, time to click, dwell time, and return rate). Then, we tested our hypotheses by having users navigate through one of our two sites and analyzed the data to conclude whether our results are statistically significant, using either Chi-Squared test of T-test.

Hypotheses
Click through rate
    Null: The click-through rate of Version A will be equal to that of Version B.
    Alternative: The click-through rate of Version A will be greater than that of Version B because the layout of Version A is cleaner and entices the user to select a service offered.
Time to Click
    Null: The time to click will be the same in Version A and Version B.
    Alternative: The time to click a button on Version A will be shorter than that of Version B because the button on Version A is larger, and there is less content on the initial screen for Version A compared to Version B.
Dwell time
    Null: The dwell time will be the same in Version A and Version B.
    Alternative: The dwell for Version A will be shorter than that of Version B because users of Version B see all taxi companies on the page at once and are more likely to choose to click on a company that they will likely stick with.
Return Rate
    Null: The return rate of Version A will be equal to that of Version B.
    Alternative: The return rate for Version A will be higher than the return rate for Version B because users of Version B see more content on the screen and are more likely to read the information for all of the taxi companies before clicking on the reserve button. On the other hand, users of Version A see less content on the screen and are more likely to click on the reserve button before fully reading information for the other taxi companies, thus causing them to return back to the web page.

version A redesign
version B redesign


Data Analysis
Click through rate
We began by creating a pivot table in Excel with count of page load time (how many times page was loaded) and max of click time (0 if they didn’t click on any links), tallied those with max click time of 0, and subtracted them from the total number of entries to find the percentage of people who clicked.
We chose to use the Chi-squared test for click through rate because this is categorical data (yes or no for each user), as opposed to having to take the average of some data.
Observed
Click No click Total
Interface A 29 5 34
Interface B 28 10 38
Total 57 15 72

Expected
Click No click Total
Interface A 26.9 7.1 34
Interface B 30.1 7.9 38

Final Calculation
Interface A 85.3%
Interface B 73.7

The calculated chi-squared value was 1.4897 with a degree of freedom of 1 and p-value of 3.84.
We did not reject the null hypothesis, which was that the click-through rate will be the same for Version A and Version B, because our calculated value (1.4898) was less than the reference value obtained from a p-value of 0.05 and df = 1. The results of Chi-squared show that the difference between the click-through rates between Version A and Version B was not statistically significant.

Time to Click
We calculated the average time to click by deleting people with no clicks, subtracting the page load time of a click entry with the click time, then averaging users’ results for Version A and B.
Final Calculation
Interface A 8846.86
Interface B 9119.82

We chose to use a t-test to see if our data was statistically significant because we are comparing the average time to click for interface A with the average time to click for interface B.
We did not reject the null hypothesis, which was that the average time of the click rate will be the same for Version A and Version B, because our calculated p-value (0.1490) was less than the reference value obtained from a p-value of 0.05 and df = 55 (1.673). The results of the t-statistic test show that the difference between the average time to click rate between Version A and Version B was not statistically significant.
We are not confident in the calculated 95% interval (-271.9594 ± 3052.7042) since it contains 0. The results of the t-statistic test show that the difference between the average time to click rate between Version A and Version B was not statistically significant.

Dwell Time
We first deleted all the users who did not click on the website and return to the page. For each of the users who clicked on a link and returned to the page, we subtracted the first click time from the next page load time to calculate the dwell time of their first navigation away from our website.
Final Calculation
Interface A 24.3863 seconds
Interface B 76.3807 seconds

We chose to use a t-test to see if our data was statistically significant because we are comparing the average dwell time for interface A with the average dwell time for interface B.
We calculated the t-statistic and gott=0.7360. Looking up the value for df=36 with p = 0.05 in a t-statistic table, we see that we would need a t-value of 1.688 for our data to be statistically significant. Since the absolute value of our t value is lower than 1.688, that means that our data is not statistically significant, meaning that there is not a significant difference between the dwell time for interface A and B. Therefore we did not reject the null hypothesis that the dwell time would be the same in Version A and Version B because our results were not statistically significant.

Return Rate
We found the total number of people who clicked on a button for each version and subtracted the maximum click time from the maximum page load time, with positive results indicating that the user returned, and for negative results we manually checked if they returned. Then we divided the percentage of people who returned by the total number of people who clicked on a button.
We chose to use the Chi-squared test to decide whether the difference between version A and B are statistically significant because we are checking for the percentage of users who return after clicking on a button; it is categorical data.
Observed
Total Users Who Clicked and Returned Total Users Who Clicked and Did Not Return Total Users Who Clicked
Interface A 18 11 29
Interface B 20 8 28
Total 38 19 57

Expected
Clicked and Return Click and No Return
Interface A 23.3 10.4
Interface B 18.1 8.6

Final Calculation
Interface A 63.3%
Interface B 72.4

The degree of freedom is 1 and the p-value is 3.84. The percentage points of the Chi-Squared distribution shows that since 0.5616 is not greater than 3.84, meaning that the difference between versions A and B is not statistically significant. Thus, we do not reject the null hypothesis.

Part 2: Eyetracking
For eye tracking portion of this project, we were able to use an eye tracker on two participants, one for each version of the site. Then, we used Javascript and Python to create a heat-map and an animated replay of the users’ eye movements based off of data collected from each eye tracking sessions (shown on the next page).
Hypotheses
We instructed the user to book the cheapest taxi. Our hypothesis is that users of Version A will focus on the descriptions and spend more time searching for the taxi that they want to click due to having to scroll through options. Meanwhile, users of Version B are able to view all of the information at once since it is all contained on one screen, so they will not look at the information as closely and click on a button in a shorter amount of time. Since our instruction of the eye-tracking experiment is different from that of the A/B testing data collection (to simply choose a taxi), our hypotheses differ a little from Part 1.

Version A: Eye Tracking
version a eyetracking
Version B: Eye Tracking
version b eyetracking
Version A: Heat Map
version a heatmap
Version B: Heat Map
version b heatmap
Interpretation of Visualizations
The eye tracking data shows that the user for Version A focused on the description section for each section as they scrolled through the site while the user for Version B does not focus on the description as much and looks at the images before they click on a button. Additionally, the number of data points show that the user for Version A spent a longer time on the site than the user for Version B. Thus, our findings support our hypothesis.

Part 3: Comparison
Based on the data analysis and eye-tracking results, we propose that Memphis Taxis Co. uses Interface A due to its higher click-through rate and lower click time. The percentage of click-through rate of Interface A (85.3%) is higher than that of Interface B (73.7%), which might be caused by the cleaner design in simplifying information retrieval and the decision-making process. Moreover, the average time to click for Interface A (8847.86 ms) is lower than that of Interface B (9119.82 ms), which suggests that Interface A’s vertical layout helps people consider options individually and make quicker decisions, unlike Interface B where users reads and compares each option before deciding. The eye-tracking heatmap shows that Interface A’s user focused on several descriptions before making her final decision and had a longer click time than Interface B’s eye-tracking user, contrary to the results of our A/B Test, which is due to a different command given to the user. Therefore, despite the results that advocates for the use of Interface A, we should conduct more tests to achieve statistically significant results and use the same commands for A/B data collection and eye-tracking.
When comparing data for eye-tracking and A/B testing, we noticed that in eye tracking, there is no measurement for behavioral trends such as return rate and dwell time because once the user clicks on a button, the eyetracking session ends. The advantage with using A/B testing over eye tracking is that with A/B testing is that you are able to get more data related to time and when users leave the site, such as return rate and dwell time. On the other hand, the advantage of using eye tracking over A/B testing is that you can find where on the screen users are looking which helps inform the areas on the screen that grab the users’ attention most, thus informing particular design choices on the screen that should be made. The visual representation of the heat map can be used to make direct comparisons between versions to show whether certain changes have an effect on eye movement.
One metric that could be used unethically would be click through rate if images or text are intentionally misleading in order to get users to click on it. Another metric that could be used unethically would be conversion rate which is the number of conversions (people who completed a desired goal) divided by the total number of visitors. This can be used to make users make unwanted purchases without taking into consideration what is most desirable for the user.

Conclusion
This project effectively combines design with data collection. We used the eye tracker to create a heatmap which offered different data visualization from that of the A/B Testing results. Both provided insightful results to how users interact with the two different sites. To further build onto this project, we would keep our hypotheses for A/B Testing and eye-tracking more consistent and conduct tests on a larger group of users that are not in our class.




Other Projects