Execute Fisher’s exact test with T-SQL

Fisher’s probability is function depend on cut-off value. In this article, you could draw scatter plot ROC curve and get Fisher’s probability with query below. In ROC curve, you would plot false positive rate on horizontal axis and sensitivity on vertical axis. You could plot cut-off value on horizontal axis and Fisher’s probability on vertical axis. You could get best cut-off value with minimum probability in line graph if minimum probability would be smaller than 0.05.

CREATE TABLE [dbo].[T_DATA]
    (    ID       nchar(8)         NOT NULL, PRIMARY KEY
    ,    Test     decimal(4, 2)    NOT NULL
    ,    Outcome  nchar(1)         NOT NULL
    );
GO
ALTER TABLE T_DATA ADD CONSTRAINT CK_Outcome CHECK (Outcome = '0' OR Outcome = '1');
GO

Run the code below to create function, converting from factorial to sum of logarithm.

CREATE FUNCTION [dbo].[LOG_FACT](
    @SrcNumber FLOAT
)
RETURNS FLOAT
BEGIN
    DECLARE @DesNumber   FLOAT
    SET @DesNumber = LOG(1)
    WHILE @SrcNumber > 0
    BEGIN
        SET @DesNumber = @DesNumber + LOG(@SrcNumber)
        SET @SrcNumber = @SrcNumber - 1
    END
    RETURN @DesNumber
END
GO

Run the code below to create stored procedure that creates cross table from T_DATA.

CREATE PROCEDURE [dbo].[sp_Cut_by_Test]
    @CutOff  decimal(4, 2)
AS
BEGIN
WITH Cross_Table AS
(
  SELECT COUNT(*) AS 'N'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff AND T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'a'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff AND T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'b'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff AND T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'c'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff AND T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'd'
    ,    SUM(CASE WHEN                            T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'a+c'
    ,    SUM(CASE WHEN                            T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'b+d'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff                          THEN 1 ELSE 0 END) AS 'a+b'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff                          THEN 1 ELSE 0 END) AS 'c+d'
    FROM T_DATA
)
  SELECT @CutOff
    ,    Cross_Table.[N]
    ,    Cross_Table.[a]
    ,    Cross_Table.[b]
    ,    Cross_Table.
    ,    Cross_Table.[d]
    ,    Cross_Table.[a+c]
    ,    Cross_Table.[b+d]
    ,    Cross_Table.[a+b]
    ,    Cross_Table.
    ,    Cross_Table.[a]/Cross_Table.[a+c] AS 'Sensitivity'
    ,    Cross_Table.[d]/Cross_Table.[b+d] AS 'Specificity'
    ,    1 - Cross_Table.[d]/Cross_Table.[b+d] AS 'FalsePositive'
    FROM Cross_Table;
END
GO

Run the code below to create stored procedure that calculates Fisher’s probability. @Start means minimum value of test, @End means maximum value of test and @Step means step value of test from @Start to @End. For example, @Start 2.0, @End 4.0 and @Step 0.1, respectively.

CREATE PROCEDURE [dbo].[FisherExactTest]
    (    @Start  decimal(4, 2)
    ,    @End    decimal(4, 2)
    ,    @Step   decimal(4, 2)
    )
AS
BEGIN
    CREATE TABLE #Result
    (    [CutOff]    decimal(4, 2)    NOT NULL
    ,    N    int    NOT NULL
    ,    a    int    NOT NULL
    ,    b    int    NOT NULL
    ,    c    int    NOT NULL
    ,    d    int    NOT NULL
    ,    [a+c]    int    NOT NULL
    ,    [b+d]    int    NOT NULL
    ,    [a+b]    int    NOT NULL
    ,        int    NOT NULL
    ,    Sensitivity    FLOAT    NOT NULL
    ,    Specificity    FLOAT    NOT NULL
    ,    FalsePositive    FLOAT    NOT NULL
    )
    DECLARE @CutOff  decimal(4, 2)
    SET @CutOff = @Start
    WHILE @CutOff <= @End
    BEGIN
       INSERT INTO #Result EXEC sp_Cut_by_Test @CutOff
       SET @CutOff = @CutOff + @Step
    END
  SELECT CutOff
    ,    N
    ,    a
    ,    b
    ,    c
    ,    d
    ,    [a+c]
    ,    [b+d]
    ,    [a+b]
    ,    [c+d]
    ,    Sensitivity
    ,    Specificity
    ,    FalsePositive
    ,    EXP(dbo.LOG_FACT([a+b])
         +   dbo.LOG_FACT([c+d])
         +   dbo.LOG_FACT([a+c])
         +   dbo.LOG_FACT([b+d])
         -   dbo.LOG_FACT(N)
         -   dbo.LOG_FACT(CASE WHEN a = 0 THEN 1 ELSE a END)
         -   dbo.LOG_FACT(CASE WHEN b = 0 THEN 1 ELSE b END)
         -   dbo.LOG_FACT(CASE WHEN c = 0 THEN 1 ELSE c END)
         -   dbo.LOG_FACT(CASE WHEN d = 0 THEN 1 ELSE d END)) AS [FisherExact_Test]
    FROM #Result
   ORDER BY CutOff
END

References;
How to calculate Fisher’s exact test with logarithm?
How to calculate four numbers from marginal total in cross tabulation?
Stored procedure to calculate factorial with natural logarithm

T-SQLでFisherの直接確率検定を行う

 Fisher の確率はカットオフ値の関数であると以前書きました.今回は下記のクエリで ROC 曲線を描き,Fisherの直接確率検定を行います.ROC 曲線では横軸に偽陽性率を取り,縦軸に感度を取ります.カットオフ値を横軸に取り,Fisher の確率を縦軸にグラフを描くと,グラフの最小値に該当するカットオフ値が求めるものとなります.ただし,p < 0.05 であることが条件です.

CREATE TABLE [dbo].[T_DATA]
    (    ID       nchar(8)         NOT NULL, PRIMARY KEY
    ,    Test     decimal(4, 2)    NOT NULL
    ,    Outcome  nchar(1)         NOT NULL
    );
GO
ALTER TABLE T_DATA ADD CONSTRAINT CK_Outcome CHECK (Outcome = '0' OR Outcome = '1');
GO

 下記の関数は階乗を対数の和に変換します.

CREATE FUNCTION [dbo].[LOG_FACT](
    @SrcNumber FLOAT
)
RETURNS FLOAT
BEGIN
    DECLARE @DesNumber   FLOAT
    SET @DesNumber = LOG(1)
    WHILE @SrcNumber > 0
    BEGIN
        SET @DesNumber = @DesNumber + LOG(@SrcNumber)
        SET @SrcNumber = @SrcNumber - 1
    END
    RETURN @DesNumber
END
GO

 下記のストアドプロシージャはカットオフ値を変数化して T_DATA からクロス表を作成します.

CREATE PROCEDURE [dbo].[sp_Cut_by_Test]
    @CutOff  decimal(4, 2)
AS
BEGIN
WITH Cross_Table AS
(
  SELECT COUNT(*) AS 'N'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff AND T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'a'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff AND T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'b'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff AND T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'c'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff AND T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'd'
    ,    SUM(CASE WHEN                            T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'a+c'
    ,    SUM(CASE WHEN                            T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'b+d'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff                          THEN 1 ELSE 0 END) AS 'a+b'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff                          THEN 1 ELSE 0 END) AS 'c+d'
    FROM T_DATA
)
  SELECT @CutOff
    ,    Cross_Table.[N]
    ,    Cross_Table.[a]
    ,    Cross_Table.[b]
    ,    Cross_Table.
    ,    Cross_Table.[d]
    ,    Cross_Table.[a+c]
    ,    Cross_Table.[b+d]
    ,    Cross_Table.[a+b]
    ,    Cross_Table.
    ,    Cross_Table.[a]/Cross_Table.[a+c] AS 'Sensitivity'
    ,    Cross_Table.[d]/Cross_Table.[b+d] AS 'Specificity'
    ,    1 - Cross_Table.[d]/Cross_Table.[b+d] AS 'FalsePositive'
    FROM Cross_Table;
END
GO

 下記のストアドプロシージャはクロス表から Fisher の確率を求めます.@Start とは test の最小値であり, @End は test の最大値, @Step は @Start から @End までの増分のことです.例えばそれぞれ @Start を 2.0 とし,@End を 4.0 とし,@Step を 0.1 とするなどです.

CREATE PROCEDURE [dbo].[FisherExactTest]
    (    @Start  decimal(4, 2)
    ,    @End    decimal(4, 2)
    ,    @Step   decimal(4, 2)
    )
AS
BEGIN
    CREATE TABLE #Result
    (    [CutOff]    decimal(4, 2)    NOT NULL
    ,    N    int    NOT NULL
    ,    a    int    NOT NULL
    ,    b    int    NOT NULL
    ,    c    int    NOT NULL
    ,    d    int    NOT NULL
    ,    [a+c]    int    NOT NULL
    ,    [b+d]    int    NOT NULL
    ,    [a+b]    int    NOT NULL
    ,        int    NOT NULL
    ,    Sensitivity    FLOAT    NOT NULL
    ,    Specificity    FLOAT    NOT NULL
    ,    FalsePositive    FLOAT    NOT NULL
    )
    DECLARE @CutOff  decimal(4, 2)
    SET @CutOff = @Start
    WHILE @CutOff <= @End
    BEGIN
       INSERT INTO #Result EXEC sp_Cut_by_Test @CutOff
       SET @CutOff = @CutOff + @Step
    END
  SELECT CutOff
    ,    N
    ,    a
    ,    b
    ,    c
    ,    d
    ,    [a+c]
    ,    [b+d]
    ,    [a+b]
    ,    [c+d]
    ,    Sensitivity
    ,    Specificity
    ,    FalsePositive
    ,    EXP(dbo.LOG_FACT([a+b])
         +   dbo.LOG_FACT([c+d])
         +   dbo.LOG_FACT([a+c])
         +   dbo.LOG_FACT([b+d])
         -   dbo.LOG_FACT(N)
         -   dbo.LOG_FACT(CASE WHEN a = 0 THEN 1 ELSE a END)
         -   dbo.LOG_FACT(CASE WHEN b = 0 THEN 1 ELSE b END)
         -   dbo.LOG_FACT(CASE WHEN c = 0 THEN 1 ELSE c END)
         -   dbo.LOG_FACT(CASE WHEN d = 0 THEN 1 ELSE d END)) AS [FisherExact_Test]
    FROM #Result
   ORDER BY CutOff
END

参照記事
対数を用いてFisherの直接確率検定を計算するには
周辺度数からクロス表を作成するには
対数により階乗を計算するストアドプロシージャを作成する

How to calculate four numbers from marginal total in cross tabulation?

When you execute Fisher’s exact test with cross tabulation, the marginal total is constant. Therefore, you could calculate the remaining three numbers if you could get ‘a’, the number of true positive. Because all four numbers are 0 or greater than 0, the range of ‘a’ is between 0 and the smaller one of either the number of ‘TRUE’ or the number of ‘positive’.

Fisher’s probability is function dependent on ‘a’. Fisher’s probability follows super geometric distribution. If ‘positive’ and ‘negative’ are separated by cut-off value in continuous variable, changing cut-off value makes a change in numbers of positive, negative and true-positive. The numbers of true and false never change regardless of change in cut-off value. Therefore, Fisher’s probability is function dependent on cut-off value.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a a + N – P – T N – P
Marginal total T N – T N 

At first you know only about numbers of ‘N’ meaning of grand total, ‘T’ meaning of true and ‘P’ meaning of positive.

  TRUE FALSE Marginal total
POSITIVE     P
NEGATIVE      
Marginal total T   N 

Next, you can calculate numbers ‘N – P’ meaning of negative and ‘N – T’ meaning of false. Then you have got marginal total.

  TRUE FALSE Marginal total
POSITIVE     P
NEGATIVE     N – P
Marginal total T N – T N 

If you could get ‘a’, you would get false negative ‘T – a’ and false positive ‘P – a’.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a   N – P
Marginal total T N – T N 

At last, you could get true negative ‘a + N – P – T’.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a a + N – P – T N – P
Marginal total T N – T N 

周辺度数からクロス表を作成するには

 クロス表から Fisher の直接確率検定を行う場合,周辺度数 (marginal total) が一定ですので,真陽性の度数 a さえ決まれば残りは自動的に決まります.下記の四分表はそれを示しています.すべての度数は 0 以上の整数ですから,a の取りうる範囲は 0 から T または P のいずれか小さい方までです.

 ここで重要なことは,Fisher の直接確率検定による確率 p は a の関数になっていることです.これを超幾何分布と言います.仮に POSITIVE と NEGATIVE とが何らかの連続変数の閾値によって分けられている場合,閾値を変化させることで周辺度数である陽性の度数 P,陰性の度数 N – P  および真陽性の度数 a も変化します.真の度数 T および偽の度数 N – T は閾値によって変化することはありません.つまり Fisher の直接確率検定による確率 p は閾値の関数になっています.

 通常ですと論文には総数の N, 真の度数 T, 陽性の度数 P, 更に感度と特異度が記載されており,ここから四分表を再現できるようになっています.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a a + N – P – T N – P
Marginal total T N – T N 

 最初に分かっているのは下記のように総数 N,陽性の度数 P,真の度数 T のみです.

  TRUE FALSE Marginal total
POSITIVE     P
NEGATIVE      
Marginal total T   N 

 次に陰性の度数 N-P,偽の度数 N-T を計算で求めます.これで周辺度数 (marginal total) が得られます.

  TRUE FALSE Marginal total
POSITIVE     P
NEGATIVE     N – P
Marginal total T N – T N 

  a を与えると,偽陰性と偽陽性が求められます.a は TRUE に感度をかけて求めます.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a   N – P
Marginal total T N – T N 

  最後に真陰性が得られます.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a a + N – P – T N – P
Marginal total T N – T N 

How to calculate Fisher’s exact test with logarithm?

Chi-square test is known to compare between ratios with two-by-two table. But you couldn’t use chi-square test if total number was smaller than 20 or expected value was smaller than 5.

Even if you couldn’t use chi-square test, you could use Fisher’s exact test and calculate accurate p-value. Although the test has reliability, it requires huge amount of calculation with factorial function and software may overflow. You would easily calculate it with conversion to the logarithm first. Next, you could add or subtract the logarithm. At last, you could convert the result to the power of e, the base of natural logarithm.

  TRUE FALSE Marginal total
POSITIVE a b a + b
NEGATIVE c d c + d
Marginal total a + c b + d N 
\displaystyle \begin{array} {rcl} P &=& \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{N!a!b!c!d!}\vspace{0.2in}\\&=& \exp \left[ LN \left( \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{N!a!b!c!d!} \right) \right]\vspace{0.2in}\\ &=& \exp [ LN((a+b)!) + LN((c+d)!) + LN((a+c)!) + LN((b+d)!)\vspace{0.2in}\\& & - LN(N!) - LN(a!) - LN(b!) - LN(c!) - LN(d!) ]\end{array}

対数を用いてFisherの直接確率検定を計算するには

 四分表を用いて2つの比率が等しいか異なるかを検定する方法としてはχ二乗検定が知られています.しかしχ二乗検定は総数が20未満の場合や期待値が5未満のマスがある場合には用いてはならないとされています.

 そのような場合でも,Fisherの直接確率検定を用いれば正確なp値が求まります.ただし階乗計算を要するために,オーバーフローを起こす可能性があります.実際 EXCEL のワークシート関数でも階乗の引数は 170 までです.これは EXCEL に限った問題ではなく,数値のデータ型の定義域の限界と考えられます.

 この問題法は対数を用いて乗除の計算を加減に変換することです.

  1. 階乗の乗除の対数をとる
  2. 対数により乗除が加減に変換されるため実際の計算をここで行う
  3. 得られた結果を指数に戻す
  TRUE FALSE Marginal total
POSITIVE a b a + b
NEGATIVE c d c + d
Marginal total a + c b + d N 
\displaystyle \begin{array} {rcl} P &=& \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{N!a!b!c!d!}\vspace{0.2in}\\&=& \exp\left[LN \left( \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{N!a!b!c!d!} \right) \right]\vspace{0.2in}\\ &=& \exp [LN((a+b)!) +LN((c+d)!) +LN((a+c)!) +LN((b+d)!)\vspace{0.2in}\\& & -LN(N!) -LN(a!) -LN(b!) -LN(c!) -LN(d!) ]\end{array}

How to calculate Odds ratio, relative risk and 95% confidence interval by two-by-two table?

You can determine statistically significance with Odds ratio, relative risk and 95% confidence interval by two-by-two table which treats such binomial values which called as outcome as true or false, for example, enabled or disabled and death or survival and such binomial values which called as predictor as positive or negative, for example, with intervention or without, with exposure risk factor or without and active drug or placebo. 

Odds ratio (OR) is used for case control study. When OR is 1, the predictor has no effect. When OR is greater than 1, the predictor has positive effect to outcome. If it is smaller than 1, the predictor has opposite effect to outcome.

Relative risk (RR) is used for prospective cohort study and randomized controlled trials. It means that when RR is greater than 1 the predictor has positive effect to outcome and when RR is smaller than 1 the predictor has negative effect to outcome.

If 1 is between lower bound and upper bound of 95% confidence interval of Odds ratio and relative risk, it is not statistically significant.

  TRUE FALSE Marginal total
POSITIVE a b a + b
NEGATIVE c d c + d
Marginal total a + c b + d N 

You could calculate Odds ratio and 95% confidence interval with formula below;

\displaystyle OR=\frac{\frac{a/(a+c)}{c/(a+c)}}{\frac{b/(b+d)}{d/(b+d)}}=\frac{ad}{bc}

 

\displaystyle 95\%C.I.=\exp\left(\ln OR \pm 1.96\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}\right)

You could calculate relative risk and 95% confidence interval with formula below;

\displaystyle RR=\frac{a/(a+b)}{c/(c+d)}

 

\displaystyle 95\%C.I. = \exp\left(\ln{RR}\pm1.96\sqrt{\frac{1}{a}+\frac{1}{c}-\frac{1}{a+b}-\frac{1}{c+d}} \right)

四分表からオッズ比と相対危険度および95%信頼区間を計算する

 有効か無効か,生存か死亡かなどの値が TRUE か FALSE の二項分布する名義変数をアウトカムとし,危険因子の暴露ありかなしか,介入ありかなしか,実薬か偽薬かなどの値が POSITIVE か NEGATIVE の二項分布する名義変数を予知因子とする場合,四分表により予知因子がアウトカムに及ぼす影響が統計的有意か否かをオッズ比や相対危険度の95%信頼区間により判定できます.

 オッズ比は症例対照研究に用いられ,オッズ比が1だと予知因子は何の効果もないことになります.オッズ比が1より大きいと予知因子によるアウトカムの発生が増えることになります.逆に1より小さければアウトカムの発生を抑制することになります.

 相対危険度は前向きコホート試験やランダム化比較試験で用いられ,1より大きければ予知因子はアウトカムの発生を高め,1より小さければ予知因子はアウトカムの発生が低くなることになります.

 オッズ比,相対危険度共に95%信頼区間が1を跨いでいる場合には統計的有意とはいえません.

  TRUE FALSE Marginal total
POSITIVE a b a + b
NEGATIVE c d c + d
Marginal total a + c b + d N 

オッズ比ORは以下の式で求まります.

\displaystyle OR=\frac{\frac{a/(a+c)}{c/(a+c)}}{\frac{b/(b+d)}{d/(b+d)}}=\frac{ad}{bc}

オッズ比の95%信頼区間は以下の式で求まります.

\displaystyle 95\%C.I.=\exp\left(\ln OR \pm 1.96\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}\right)

相対危険度RRは以下の式で求まります.

\displaystyle RR=\frac{a/(a+b)}{c/(c+d)}

相対危険度の95%信頼区間は以下の式で求まります.

\displaystyle 95\%C.I. = \exp\left(\ln{RR}\pm1.96\sqrt{\frac{1}{a}+\frac{1}{c}-\frac{1}{a+b}-\frac{1}{c+d}} \right)