Contact Information

37 Westminster Buildings, Theatre Square,
Nottingham, NG1 6LG

We Are Available 24/ 7. Call Now.

One paper that caught my attention a bit ago was Selective attention in hypothesis-driven data analysis by Itai Yanai and Martin Lercher. In their study, students who were given specific hypotheses to test were much less likely to notice an obvious “gorilla in the data” compared to students who explored the data freely.

This is specifically what the data looks like:

data_m  read_table("https://www.dropbox.com/s/685pkte3n3879mn/data9b_w.txt/?dl=1")
data_w  read_table("https://www.dropbox.com/s/r3wyn2ex20glsoa/data9b_m.txt/?dl=1")

data_m  data_m %>%
  mutate(gender = "male")

data_w  data_w %>%
  mutate(gender = "female")

data  bind_rows(data_m, data_w)
data %>%
    ggplot(aes(x = steps, y = bmi, color = gender)) +
    geom_point()

In this study, 119 of the 164 undergraduate students received the following instructions:

Download the two files data9b_w.txt and data9b_m.txt. Each row in both files contains for one person (women in data9b_w.txt, men in data9b_m.txt9) the number of steps that this person took on a particular day (steps) and the body mass index (bmi). Assume that both traits are normally distributed for males and for females. Consider the following (alternative, not null) hypotheses:

  1. There is a difference in the mean number of steps between women and men.

  2. The correlation coefficient between steps and bmi is negative for women.

  3. The correlation coefficient between steps and bmi is positive for men.

Think about which test to use and calculate the corresponding P-value.

Which other conclusions can you draw from the data?

The remaining 45 students were provided these instructions:

Download the two files data9b_w.txt and data9b_m.txt. Each row in both files contains for one person (women in data9b_w.txt, men in data9b_m.txt9) the number of steps that this person took on a particular day (steps) and the body mass index (bmi). Assume that both traits are normally distributed for males and for females.

Examine the data appropriately! What do you notice? What conclusions can you draw from the data?

I use large language models relatively often to assist with smaller portions of my daily bioinformatics work and have been interested in studying their ability to perform complete bioinformatics analyses. A key part of any analysis is exploratory data analysis (EDA), and I wondered how well large language models would perform at this task. This naturally begs the question: are large language models able to notice the “gorilla in the data” given the same prompts given to human students?

I decided to test this by asking both ChatGPT 4o (responses in green) and Claude 3.5 Sonnet (responses in orange) to examine the data using their data analysis tools. I ended up only asking the second question to both models.

ChatGPT 4o

Each row in this file contains for one person the number of steps that this person took on a particular day (steps) and the body mass index (bmi). Assume that both traits are normally distributed for males and for females.

Examine the data appropriately! What do you notice? What conclusions can you draw from the data?

ChatGPT 4o first provided these three plots:

And provided this response:

Observations and Conclusions:

  1. Gender Distribution:
  • The dataset contains 921 males and 865 females, which is relatively balanced.
  1. Steps Distribution:
  • The average number of steps per day is higher for females (8014 steps) compared to males (6891 steps).
  • Males have a higher standard deviation, meaning their step counts are more varied.
  • Some individuals have zero steps recorded, which could indicate missing or inactive data.
  1. BMI Distribution:
  • The average BMI for males is 25.66, which is slightly higher than the average BMI for females (24.91).
  • Both males and females show a wide range of BMI values, from about 15 to 32.
  • The BMI distribution appears normal for both genders, though slightly skewed.
  1. Relationship Between Steps and BMI:
  • The scatter plot does not show a strong visible correlation between steps and BMI.
  • While there may be a trend where higher step counts correspond to lower BMI, the relationship is not clearly linear.
  • Further statistical analysis (e.g., correlation coefficient or regression analysis) would be needed to determine the strength of this relationship.

Would you like me to perform a correlation analysis or further statistical tests? 

The model seems to primarily focus on the data’s summary statistics. It makes some observations regarding the Steps vs BMI plot, but does not notice the gorilla in the data.

I asked the model to closely look at the plot, and also uploaded a png of the plot it had generated.

Look at the scatter plot again.

It looks like the scatter plot unintentionally formed an artistic pattern rather than a meaningful representation of the data. This might be due to the specific values and their alignment in the dataset.

Would you like me to:

  1. Regenerate the scatter plot with a different visualization technique (e.g., adding trend lines or jittering points)?
  2. Perform a correlation analysis to quantify the relationship between steps and BMI?
  3. Check for potential data anomalies that might be causing this pattern?

Let me know how you’d like to proceed!

When a png is directly uploaded, the model is better able to notice that some strange pattern is present in the data. However, it still does not recognize the pattern as a gorilla. It again seems to want to prioritize certain quantitative analyses such as performing a correlation analysis.

Claude 3.5 Sonnet

Each row in this file contains for one person the number of steps that this person took on a particular day (steps) and the body mass index (bmi). Assume that both traits are normally distributed for males and for females.

Examine the data appropriately! What do you notice? What conclusions can you draw from the data?

Claude 3.5 Sonnet provided this response:

I’ll help analyze this dataset using the analysis tool first to understand the data distribution and relationships, then create visualizations to illustrate the findings.

Let me create a visualization to help illustrate these findings.

Based on the analysis of the data, here are the key findings:

  1. Sample Size and Gender Distribution:
    • The dataset contains 1,786 individuals
    • 921 males (51.6%)
    • 865 females (48.4%)
  2. Steps Analysis:
    • Females take more steps on average (8,014 steps) compared to males (6,891 steps)
    • The standard deviation is lower for females (3,255 steps) than males (3,674 steps)
    • The distributions appear roughly normal for both genders, as shown in the visualization
  3. BMI Analysis:
    • Males have a slightly higher average BMI (25.66) compared to females (24.91)
    • The standard deviation is similar for both genders (males: 4.82, females: 4.97)
    • Both distributions appear approximately normal, though with some right skew
    • The average BMI for males falls into the “overweight” category (BMI > 25), while the female average is in the “normal” range
  4. Relationship between Steps and BMI:
    • There is a negative correlation between steps and BMI for both genders
    • The correlation is stronger for males (r = -0.31) than females (r = -0.23)
    • This suggests that individuals who take more steps tend to have lower BMI values
    • However, the correlations are relatively weak, indicating that other factors likely play important roles in determining BMI

The visualization shows three key plots: 1. A histogram of BMI distribution by gender 2. A histogram of steps distribution by gender 3. A scatter plot showing the relationship between steps and BMI

These findings suggest that while physical activity (measured by steps) has some relationship with BMI, it’s not a strong determinant on its own. The gender differences in both steps and BMI distributions might reflect different lifestyle patterns or biological factors between males and females in this sample.

Would you like me to perform any additional analyses or focus on specific aspects of the data?

And provided this dashboard:

Code
import React, { useState, useEffect } from 'react';
import { ScatterChart, Scatter, XAxis, YAxis, CartesianGrid, Tooltip, Legend, ResponsiveContainer, BarChart, Bar } from 'recharts';
import Papa from 'papaparse';

const BMIAnalysis = () => {
  const [data, setData] = useState([]);
  
  useEffect(() => {
    const fetchData = async () => {
      try {
        const response = await window.fs.readFile('bmidata.csv', { encoding: 'utf8' });
        const result = Papa.parse(response, {
          header: true,
          dynamicTyping: true,
          skipEmptyLines: true
        });
        setData(result.data);
      } catch (error) {
        console.error('Error reading file:', error);
      }
    };
    fetchData();
  }, []);

  const getBinData = (data, field, binSize) => {
    const males = data.filter(d => d.gender === 'male');
    const females = data.filter(d => d.gender === 'female');
    
    const createBins = (items, field) => {
      const min = Math.floor(Math.min(...items.map(d => d[field])));
      const max = Math.ceil(Math.max(...items.map(d => d[field])));
      const bins = {};
      
      for (let i = min; i  max; i += binSize) {
        bins[i] = 0;
      }
      
      items.forEach(item => {
        const binIndex = Math.floor(item[field] / binSize) * binSize;
        bins[binIndex] = (bins[binIndex] || 0) + 1;
      });
      
      return Object.entries(bins).map(([bin, count]) => ({
        bin: Number(bin),
        male: items === males ? count : 0,
        female: items === females ? count : 0
      }));
    };
    
    const maleBins = createBins(males, field);
    const femaleBins = createBins(females, field);
    
    const combinedBins = {};
    [...maleBins, ...femaleBins].forEach(item => {
      if (!combinedBins[item.bin]) {
        combinedBins[item.bin] = { bin: item.bin, male: 0, female: 0 };
      }
      combinedBins[item.bin].male += item.male;
      combinedBins[item.bin].female += item.female;
    });
    
    return Object.values(combinedBins).sort((a, b) => a.bin - b.bin);
  };

  return (
    div className="space-y-8">
      div className="space-y-2">
        h2 className="text-xl font-bold">BMI Distribution by Genderh2>
        ResponsiveContainer width="100%" height={300}>
          BarChart data={getBinData(data, 'bmi', 2)}>
            CartesianGrid strokeDasharray="3 3" />
            XAxis dataKey="bin" label={{ value: 'BMI', position: 'bottom' }} />
            YAxis label={{ value: 'Count', angle: -90, position: 'insideLeft' }} />
            Tooltip />
            Legend />
            Bar dataKey="male" fill="#8884d8" name="Male" />
            Bar dataKey="female" fill="#82ca9d" name="Female" />
          BarChart>
        ResponsiveContainer>
      div>

      div className="space-y-2">
        h2 className="text-xl font-bold">Steps Distribution by Genderh2>
        ResponsiveContainer width="100%" height={300}>
          BarChart data={getBinData(data, 'steps', 1000)}>
            CartesianGrid strokeDasharray="3 3" />
            XAxis dataKey="bin" label={{ value: 'Steps', position: 'bottom' }} />
            YAxis label={{ value: 'Count', angle: -90, position: 'insideLeft' }} />
            Tooltip />
            Legend />
            Bar dataKey="male" fill="#8884d8" name="Male" />
            Bar dataKey="female" fill="#82ca9d" name="Female" />
          BarChart>
        ResponsiveContainer>
      div>

      div className="space-y-2">
        h2 className="text-xl font-bold">Steps vs BMI Scatter Ploth2>
        ResponsiveContainer width="100%" height={300}>
          ScatterChart>
            CartesianGrid strokeDasharray="3 3" />
            XAxis type="number" dataKey="steps" name="Steps" 
                   label={{ value: 'Steps', position: 'bottom' }} />
            YAxis type="number" dataKey="bmi" name="BMI"
                   label={{ value: 'BMI', angle: -90, position: 'insideLeft' }} />
            Tooltip cursor={{ strokeDasharray: '3 3' }} />
            Legend />
            Scatter name="Male" data={data.filter(d => d.gender === 'male')} 
                    fill="#8884d8" />
            Scatter name="Female" data={data.filter(d => d.gender === 'female')} 
                    fill="#82ca9d" />
          ScatterChart>
        ResponsiveContainer>
      div>
    div>
  );
};

export default BMIAnalysis;

First, the model performs the analysis using javascript, which will probably make any bioinformatician shudder 🙈. Furthermore, the model is unable to notice obvious patterns in its visualizations, and seems to focus its analysis on the data’s summary statistics.

Look at the scatter plot closely, what do you see?

Looking at the scatter plot more carefully, I notice something concerning: there appear to be some unlikely or potentially erroneous values in the data. Let me analyze this in more detail.

Looking at the data more carefully, I notice several important patterns:

  1. Bounded Data:
    • The BMI values appear to be artificially bounded between 15 and 32
    • The steps are bounded between 0 and 15,000 steps
  2. Step Distribution:
    • There are individuals with 0 steps, which could represent either:
      • Days when people didn’t wear their step counters
      • Device malfunctions
      • Truly sedentary days
    • The upper limit of 15,000 steps seems artificially imposed, as many people regularly exceed this in real life
  3. BMI Distribution:
    • The BMI range of 15-32 appears to be artificially truncated
    • Real-world BMI values can be both lower than 15 (severe underweight) and higher than 32 (severe obesity)
    • This truncation could significantly affect our earlier analysis of the BMI distribution and its correlation with steps
  4. Data Quality Issues:
    • The bounded nature of both variables suggests this dataset has been preprocessed or filtered
    • This preprocessing could have removed both actual outliers and valid extreme values
    • The artificial bounds affect the validity of our earlier assumptions about normal distributions

These observations significantly impact our earlier conclusions:

  1. The correlations we found between steps and BMI might be understated due to the truncation of both variables
  2. The apparent normality of the distributions is likely artificially induced by the truncation
  3. The gender differences we observed might be affected by these data limitations

Would you like me to perform additional analyses focusing on the data within these bounds, or would you prefer to discuss the implications of these data limitations for the conclusions we can draw?

While the model can notice some sort of anomaly in the data, like 4o, it doesn’t seem to be able to notice the gorilla pattern from its own generated plot.

I then uploaded the BMI vs steps scatter plot from 4o (as I couldn’t easily export Sonnet’s plot) and asked the model to look at it.

Here is the steps vs bmi plot. What do you notice?

Ah, now I see something very striking that I missed in my previous analysis – there appears to be a clear pattern in the data points that looks artificial. The data points form distinct curves and lines across the plot, which is highly unusual for what should be natural, continuous biological measurements.

Specifically: 1. There are multiple horizontal bands/lines where BMI values cluster (e.g., around BMI values of 17.5, 25, 27, 29, and 31) 2. The data points form curved patterns that look almost like parabolas or other mathematical functions 3. These patterns appear equally for both males and females 4. The spacing between points often looks regular and artificial 5. There are clear “gaps” between these lines where no data points exist

This pattern strongly suggests that this data has been artificially generated or heavily processed. Real BMI and steps data would typically show much more random variation and natural scatter. The regular geometric patterns we see here are characteristic of: – Either synthetic data generated using mathematical functions – Or real data that has been somehow quantized or processed through some kind of algorithm

This revelation means we should be very cautious about any conclusions drawn from this dataset, as it doesn’t appear to represent natural, real-world measurements. The correlations and distributions we calculated earlier are likely artifacts of whatever process was used to generate or process this data, rather than reflecting true relationships between steps and BMI.

Would you like to explore how these artificial patterns might have been generated, or should we discuss what this means for the validity of any analyses on this dataset?

Again, the model is able to notice the artificial patterns in the data when prompted to look at a png of the plot, but does not specifically understand the pattern as a gorilla.

Thoughts

As the idea of using LLMs/agents to perform different scientific and technical tasks becomes more mainstream, it will be important to understand their strengths and weaknesses. The most glaring current weakness, in my opinion, is the discrepancy between their pattern recognition capabilities when creating initial visualizations compared to when analyzing uploaded PNG files of visualizations. While these models do a good job of creating visualizations, they seemingly do not explicitly “see” them unless prompted. Furthermore, their data analysis capabilities seem to focus much more on quantitative metrics and summary statistics, and less on the visual structure of the data. In some ways, this could be seen as a feature rather than a bug. While humans are hard-wired to see faces in clouds and trends in random noise, these models appear to err in the opposite direction.

I have a few thoughts on potential implications:

First, it suggests that current LLMs might be particularly valuable in domains where avoiding confirmation bias is critical. They could serve as a useful check against our tendency to over-interpret data, especially in fields like genomics or drug discovery where false positives are costly. (But also it’s not like LLMs are immune to their own form of confirmation bias)

However, this same trait makes them potentially problematic for exploratory data analysis. The core value of EDA lies in its ability to generate novel hypotheses through pattern recognition. The fact that both Sonnet and 4o required explicit prompting to notice even dramatic visual patterns suggests they may miss crucial insights during open-ended exploration.

I’d next like to explore how to optimize these collaborative workflows and develop better prompting strategies for enhancing LLMs’ pattern recognition capabilities while preserving their resistance to cognitive biases, and also further understand what scientific biases they may introduce. I think that using LLMs as agents in bioinformatics workflows could speed up many repetitive tasks, but their behavior needs to be more deeply studied in this context.

Source link


administrator

Leave a Reply

Your email address will not be published. Required fields are marked *