Jaccard Coefficient Calculations
Activity: The table shows the pathological test results for three individuals.
Name | Gender | Fever | Cough | Test-1 | Test-2 | Test-3 | Test-4 |
---|---|---|---|---|---|---|---|
Jack | M | Y | N | P | N | N | A |
Mary | F | Y | N | P | A | P | N |
Jim | M | Y | P | N | N | N | A |
Calculate Jaccard coefficient for the following pairs:
- (Jack, Mary)
- (Jack, Jim)
- (Jim, Mary)
Solution: The Jaccard coefficient, also known as the Jaccard index or Jaccard similarity coefficient, is a statistic used in clustering and other forms of data analysis to measure the similarity and diversity of sample sets and it is given by:
Jaccard = (f01 + f10) / (f01 + f10 + f11)
-
To calculate the Jaccard coefficient, we first convert the asymmetric variables to binary values and re-write the table. Since Gender is a symmetric variable (that is, male, female have the same weight), it is not converted.
-
So, let Y & P = 1; N & A = 0 Let’s recalculate it using the binary values:
-
Now, we’ll rewrite the table with these binary values:
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M 1 0 1 0 0 0 Mary F 1 0 1 0 1 0 Jim M 1 1 0 0 0 0 -
For (Jack, Mary)
Jack: (1, 0, 1, 0, 0, 0)
Mary: (1, 0, 1, 0, 1, 0)
Jaccard = (1+0)/(1+0+2)= 0.33
-
For (Jack, Jim)
Jack: (1, 0, 1, 0, 0, 0)
Jim: (1, 1, 0, 0, 0, 0)
Jaccard = (1+1)/(1+1+1)= 0.67
- For (Jim, Mary)
Jim: (1, 1, 0, 0, 0, 0)
Mary: (1, 0, 1, 0, 1, 0)
Jaccard = (2+1)/(2+1+1)= 0.75
Conclusions
-
Jack and Mary have a Jaccard coefficient of 0.33, which means that they have a relatively low degree of similarity in their test results. Only 33% of the binary attributes are the same between them.
-
Jack and Jim have a Jaccard coefficient of 0.67, indicating a higher degree of similarity compared to Jack and Mary. 67% of their binary attributes match, suggesting a moderate level of similarity.
-
Jim and Mary have the highest Jaccard coefficient among the pairs, with a value of 0.75. This suggests a relatively high degree of similarity in their test results, with 75% of their binary attributes being the same.
Note that, in practical applications, a Jaccard coefficient of 1 is often used to represent complete similarity or a perfect match between two sets.