

REVIEW ARTICLE 

Year : 2015  Volume
: 1
 Issue : 3  Page : 136141 

Basics of biostatistics for understanding research findings
Satyanarayana Labani, Smita Asthana
Division of Epidemiology and Biostatistics, Institute of Cytology and Preventive Oncology, Indian Council of Medical Research, Noida, Uttar Pradesh, India
Date of Web Publication  30Sep2015 
Correspondence Address: Satyanarayana Labani Division of Epidemiology and Biostatistics, Institute of Cytology and Preventive Oncology, Indian Council of Medical Research, I7, Sector39, Noida, Uttar Pradesh India
Source of Support: Nil., Conflict of Interest: There are no conflicts of interest.  Check 
DOI: 10.4103/23947438.166310
The aim of this communication is to give an overview of basic biostatistics procedures that are helpful in understanding medical research findings. There are several books on this topic now and several articles are written on this subject in various reputed journals on individual topics of interest or as a series of chapter articles. On the contrary, this article attempts to cover summary of basic biostatistics in a descriptive manner with the attempt to provide the reader the essential basis of research methodology. This may also be useful for medical undergraduate/postgraduate (UG/PG) students and biomedical junior faculty in understanding advancement of knowledge in their area of specialization. This article as such is not complete on the basics of the subject attempted to present. The reader is, however, advised further reading of reference books for more details.
Keywords: Basics, biostatistics, research methodology
How to cite this article: Labani S, Asthana S. Basics of biostatistics for understanding research findings. MAMC J Med Sci 2015;1:13641 
Introduction   
In medical practice, few patients out of a group with a similar condition may not get cured or respond to a treatment regimen chosen. This failure to the response which led to uncertainty in the cure of patients might be due to the contribution of inherent natural variations or due to sampling variations among patients included for such assessment. Research is often conducted in any discipline including medicine for investigation of unknown facts through a collection of qualitative and quantitative information called “data.” In research, data are often collected from a fraction of subjects called “sample” from a large target group called “population.” The sample in research is required to be a random sample from target population for its representativeness. A random or probability sample fairly assumes generalizability of findings obtained from a sample to the entire population. In the process of random sampling, the inclusion of a particular subject into the sample cannot be predicted, this is similar to the process of the lottery.
Research findings obtained from collected data are communicated in scientific journals and subsequently the evidence that emerges from research is included in the medical text books. The findings mostly in numerical format are represented in commonly used term in pleural called statistics. The term statistics in singular form is a science involving activities such as planning, data collection, and analysis of data and interpretation of findings in making valid conclusions. Such a role of statistics in the management of uncertainties in diagnosis or prognosis that emerge from biomedical data is called biostatistics. This communication is an overview from biostatistics to help in understanding medical research findings and to appreciate the use of research methodology as a tool in medical advancement. An essential broader view of research methodology and biostatistical and epidemiological tools required in understanding research findings are presented.
Research Questions and Designs of Studies   
Research is performed on hot topics or novel research questions that interest the research community. It can be in the form of (i) a confusing phenomenon that required solution, (ii) an unsolved mystery, (iii) development of a new technology, and (vi) an alternative or better solutions to existing problem. In the process of research, prior requirements for designing a study are: (i) Framing a research question or hypothesis and (ii) details of needed methods to answer the research question. Research questions arise with the accumulated experience in the related area or through literature search in the related fields to find research gaps. Study design is a structured approach to address a specific research question. Studies to examine patterns of disease are performed on a descriptive design. The studies of determining suspected causes of disease are performed in an analytical study design. These two designs are observational in nature. On the other hand experimental or clinical trials that compare treatment modalities are performed in an intervention approach. The formats of such designs are illustrated in [Table 1] and [Table 2]. The impact of natural variation on medical decision can be controlled by using a design. Designs are scientific plans to collect and compile evidence. The main thrust of a design is that the sample of observations is sufficient and free from bias. Recruitment of subjects in a sample depends on the study design chosen to answer the research question.
Sampling of Subjects in Various Designs   
In observational studies sampling from target population is performed to ensure the presence of randomness in selection of subjects included, while in clinical trials random allocation or randomization of subjects is performed to different groups of intervention under comparison. The sampling method varies according to the specific type of observational study whether it is a descriptive, crosssectional, cohort, or casecontrol study design.
Sampling in descriptive design
For a descriptive or a crosssectional study, the method of sampling could be either simple, systematic, stratified, cluster or multistage, or a combination of these. Simple random sampling gives equal chance of selection of a unit or an individual for getting included in the sample on the bases of random number generated by the computers. In systematic sampling the selection of individuals is performed at the regular intervals depending on the sampling fraction. Suppose we want to select a sample of 50 from 500 units, then the sampling fraction is 50/500 or 1/10, that is, one unit out of every 10 is to be selected. One number (random start) is randomly selected out of first 10, then 10 is systematically added each time, for example, if the first number is 5, the others are 15, 25, 35, etc. Suppose there are different categories or strata for which data need to be collected separately. This is done by performing simple random sampling in different strata and is called stratified random sampling. This method is used in a survey, when adequate representation of categories are needed such as rural and urban areas or low middle and higher income groups of a community. For a sample of random numbers in the selection of subjects in a study a website on random number generator maybe referred.^{[1]}
Sampling in prospective design
Group of subjects or cohort with exposed and/or unexposed to a particular characteristic are followed up over a period of time in future and observed to see whether they develop the outcome of interest or not. For example, a group of elderly women exposed to or unexposed to human papillomavirus (HPV) are cohorts. For sampling in this setup, a crosssectional study on prevalence of exposure could serve as a basis when the exposure is relatively not rare. In a rare exposure case, for sampling methods, or approaches as required in the casecontrol setting may be useful.
Sampling in casecontrol design
A casecontrol study enrols cases with disease and without disease as against a prospective study where the exposure present and exposure absent are followed up for the development of the outcome of interest. As compared to prospective nature in a cohort study, a casecontrol study retrospectively observes for frequency of presence of exposure in cases and controls to come to a conclusion.
Selection of cases
Two important aspects in the selection of cases are the representativeness and the method of selection. The cases could be prevalent cases obtained from a crosssectional study or incidence cases obtained from a cohort study and the other sources could be cases from a medical care facility or a disease registry.
Selection of controls
The essential qualities needed for selection of control are: (i) The control must be at risk of getting the disease and (ii) the control should resemble the case in all respect except for the absence of disease. This means that comparability is more important than representativeness in the selection of controls. Ideally, there should be one control for each case in a casecontrol study. In general, there is no further gain in statistical power if the number of controls is more than four. Selection of controls is usually performed through individual matching of cases or group matching of cases. An example could be matching of a factor, for example, age  a particular single age matching or a 5 years age group matching between cases and control.
Sampling in randomized control trial
Randomized control trials (RCT) are experiments on human the trials are conducted in four phases. Phase I trials are performed on normal voluntaries to evaluate maximum tolerance dose. Phase II evaluates primary effects and side effects of patients. Phase III trials are actual randomized control trials for evaluating the efficacy on intervention. Phase IV trial constitutes the postmarketing surveillance. These trials evaluate the impact of specific intervention in improving the health outcome and involve time, complexity, and cost. In clinical research, RCTs are considered the gold standard studies (level 1 evidence).
The sampling of subjects in RCTs involves two important techniques viz blinding and randomization. Blinding is concealment of knowledge of treatment allocation from patients or care providers or data analyzers. On the other hand, randomization is the actual approach of allocation of patients in a clinical trial by introducing a deliberate element of chance into the assignment. The advantages of this are: (i) To ensure that each subject has an equal chance of assignment to any intervention under the study trial, (ii) to produce comparable groups and to attain validity for statistical tests, and (iii) to ensure that the groups are alike in all important aspects and differ only in intervention each group receives. A confounder is a nuisance factor that comes into the way in the study of the association between a risk factor and disease. Ethical considerations such as ensuring and optimizing the potential befits while minimizing the potential harms to the participants are essential. For allocation of subjects in a trial in different groups, free web sites are available for various allocation designs.^{[2]}
Estimation and Test of Hypothesis   
An understanding of uncertainty begins with an estimation of central and interval values along with measures of variability. The basis for estimation of measures is dependent on whether the data are qualitative or quantitative. Data basically are obtained by interviewing and examination of patients and by noting down the reports of investigations on a uniform format such as – questionnaire, schedule or proforma called tools for data collection. The quantitative measures such as mean, standard deviation (SD), etc., and qualitative feature in frequency or proportion or a ratio or rate in percent, etc., are computed to summarize data. These give an initial understanding of hidden uncertainty. Such measures obtained from a sample are called as statistics and the same measures computed on the entire population are called parameter. Those statistics are point estimates of population parameters. Statistical inference [Figure 1] which deals with the estimation of population parameters and statistical tests of significance is drawn on the basis of sample statistics, and the findings are expected to be applicable for the entire target population. The entire population is never studied in any research setting, but only samples are drawn from the target population. For the purpose of estimation and test of hypothesis knowledge of some theoretical distributions are necessary.
Use of normal or Student's tdistributions in medical decisions
Most biological variables have nearly symmetric and bellshaped distribution. In practice, as a rule of thumb, the verification of data is performed using a histogram for any asymmetrical shape of distribution, presence of too much difference in the central values, mean–median–mode, that is not being approximately equal, and large magnitude of SD to the extent of more than or equal to mean. These are some useful approaches in guessing gross nonnormality present in the data, if any. An essential property of normal distribution is that the range mean –2SD to mean +2SD include 95% of observations [Figure 2]. This property is very useful in the estimation and test of hypothesis components of statistical inference on medical data. The normal or Gaussian distribution has another important property that even if the observations are far from normal or symmetrical, the sample means will have a normal distribution for a large sample size. The distribution is called standard normal distribution when the variable x under study is transformed into Z (Z = [x – µ]/σ) this converted Z has zero mean and variance one. When the population SD is not available or unknown and replaced by the sample SD, the distribution is almost similar and has a different name called Student's tdistribution [Figure 3]. For large samples, ttests give virtually identical results in comparison to normal tests. As compared to a normal distribution, a quantity called degrees of freedom (df) which depends on sample size is used in tdistribution. For example, df is sample size minus the number of the parameter under estimation. For estimating the mean with a sample of n observations, the df is n − 1.
Estimation
The sample estimates such as mean or proportion tend to vary from sample to sample due to sampling variability or sampling fluctuation. It is important to understand how much uncertainty is conferred upon on point estimates such as central values mean or proportion. The measure of sampling variability called standard error (SE) can be estimated for mean and proportion or any other estimate of interest. If the estimates are computed in an interval as against a single value, such an interval is called confidence interval (CI) an example of computation of SE of mean and 95% CI of mean is illustrated below. Replacement of SD with SE in the expression of mean ± 2SD provides 95% CI for mean (i.e., mean ± 2SE).
Suppose we have data on the cholesterol level in 300 children of 3–12 years of age then what is the 95% CI of mean? The computed mean and SD of the cholesterol level is 130 and 25, respectively. The SE of mean for the sample size of n = 300 is computed as: SD/√(n) or 25/√(300) or 1.45. Now 95% CI for mean cholesterol level is (mean ± 2SE) or (mean − 2SE) to (mean + 2SE) or (130 − 2 × 1.45) to (130 + 2 × 1.45) or 127–133. This is interpreted as there is a 95% chance of population mean of cholesterol level in children of 3–12 years of age to be included in the interval (127–133). The interval when used with (mean ± 2SD) or 130 ± 2 × 25 or (130 + 50) to (130 – 50) or 80 – 180 is the 95% range of observations in the sample under study. This shows the clear distinction between interval of observations and CI.
Concept of statistical significance and
P
value
Statistical significance is closely related to confidence statement such as 95% CI. A threshold of 95% confidence indicates that there remains an uncertainty of 5% which could result into a critical region that becomes basis for hypothesis testing. In statistical inference, hypotheses are formulated so that the hypotheses to be tested can be refuted this is called a null hypotheses or statistical hypotheses. The null indicates zero and in the null hypothesis either no difference or zero difference is assumed. For any null hypothesis there could be a onesided or twosided alternatives. Suppose our interest is to examine whether the hemoglobin level of children with chronic diarrhea is same as that of healthy children. This is an example for onesided test because the Hb level in the chronic diarrhea is not expected to be higher than the normal Hb level among healthy children. An example of twosided hypothesis is Hb level in children undergoing two types of feeding practices. [Table 3], [Table 4], [Table 5] depict errors in decision making in the context of marketing a new drug, diagnostic, and statistical test settings. The probability of wrongly rejecting a true null hypothesis is an error (type I) in statistical decisionmaking. This is also referred to as P value. The value of this error is generally kept at 0.05. This threshold of 5% is also called the level of significance. A result is called as "statistically significant" when P < 0.05. The other important concept relates to not rejecting a false null hypothesis is of another error (type II) in statistical decisionmaking. The type I error and type II errors can be viewed as false positive and false negative respectively, in the setting of diagnostic accuracy testing. Rejecting a false null hypothesis is called as the power of the test. The power is also called probability of getting a statistically significant result.
General Significance Test Procedure   
The basis of any test procedure is to judge a sample mean with a hypothesized value (μ) in relation to the SE of mean in the one sample context test criterion. The test criterion (based on Student's ttest) in relation is:
With (n − 1) df. This ratio is used to reject or not to reject the null hypothesis depending on the computed value of t. The null hypothesis is rejected if the calculated value of t is more than the critical value of t given in the tdistribution table corresponding to a prefixed level of significance either for a onetailed or twotailed test. The calculated value of t more than the critical value (1.96 for very large N) indicates that P is less than a threshold level of significance such as 0.05 (5%). The value of P less than the threshold probability (0.05) is interpreted as significant (P < 0.05). The above basic procedure is the same in any test of significance and difference is with the test statistic for different comparison. The details of tests for situations such as testing a sample mean in comparisons to a hypothetical population mean and, testing of two means are the common situations in test of hypothesis using ttest for quentative data. There could be two settings in such test of hypothesis one is between two independent groups and the other is in paired group data. For assessing significance in qualitative data tests such as Chisquare, Fishers exact, and Mc Nemar tests are used. The situations where data are not normally distributed in quantitative type, nonparametric tests such as rank tests are used in assessing statistical significance. The situations where more than two means are to be compared a procedure called analysis of variance (ANOVA) with Ftest for assessing overall significance and several other choices for pair wise comparisons are used critical values are available for these distributions in decisions making. For details of all these tests various other references may be referred.^{[3],[4]}
Assessment of strength of association
There are more profound uncertainties in the assessment of relationship between disease and exposure. For categorical variables, the association between disease and exposure is measured as relative risk (RR) or risk difference and odds ratio (OR). The strength of association for continuous variables are correlation coefficient (R) and coefficient of determination (Rsquare) are computed.
Relative Risk   
The RR is measured as ratio of incidence rate among the exposed to the unexposed. The RR or rate ratio for the event of outcome such as disease would be calculated using exposure and un exposure categories. Consider a prospective study on followup of women with HPV and without HPV to observe the outcome of cervical precancerous state of cervical intraepithelial neoplasia grade (CIN). The hypothetical data tabulated on 600 women is shown in [Table 6]. Relation between status of HPV and CIN may be obtained RR = 19.25, and 95% CI = 7.1–51.9 Chisquare = 76.1, P < 0.001.
The exact P value obtained from the statistical package is very low. The RR = 19.25 is interpreted as follows. HPV presence has a 19fold risk of developing CIN as compared to the absence of HPV. The null value of RR = 1 did not include in the 95% CI also indicated the significance of RR = 19.25. The risk difference is the difference in incidence or risk between exposure and nonexposure.
Odds Ratio   
Casecontrol studies assess the frequency of exposure in cases with disease and controls without the disease. These are called odds, and their ratio is OR. The OR is approximately same as RR when the disease is rare. Consider a casecontrol study that evaluates the role of low birth weight in early neonatal mortality. A hypothetical data tabulated in 2 × 2 contingency forms is shown in [Table 7].  Table 7: Relationship between low birth weight and early neonatal mortality
Click here to view 
OR, its 95% CI and Chisquare test are as follows. OR = 47 × 55/15 × 11 = 15.6 and 95% CI of OR is 6.5–37.4. Chisquare = 45.1, and P = 0.000 is very low at 1 df and can be reported as P < 0.001. The 95% CI is computed using a statistical packages,^{[5],[6]} etc. The interpretation of OR = 15.6 indicate that the odds of death in neonates with low birth weight (<2000 g) is 15.6 times the odds of death in neonates without low birth weight. This is not the same as the RR.
Correlation coefficient
Scatter diagrams ^{[4]} are important for initial exploration of the relationship between two quantitative variables. The relationship between two quantitative variables to assess the strength of degree of linear or straight line relationship is called correlation relationship or Pearson's correlation coefficient. The value of correlation coefficient lies between −1 and +1 indicating negative and positive correlations. On the other hand, the relationship between two or more quantitative variables in a structural form for the prediction of one variable when the other variable is given is called regression. The square of the correlation coefficient is called as the coefficient of determination and interpreted as the percent of variation explained in one variable (dependent) by the other variable (independent) on which it is regressed.
Evaluation of Diagnostic Test Performance   
Statistical measures for assessing the performance of a clinical (screening/diagnostic) test are sensitivityspecificity, positive and negative predictive values. Sensitivity and specificity are useful to identify or to rule out the disease and indicate the inherent quality of the test. These indicators do not depend on the prevalence of the disease in a population. Contrary to this, predictive values are dependent on the prevalence of the disease in a community on which the test is applied. Predictive values speak about the probability that the test will give the correct diagnosis.
Sensitivity, specificity, and predictive values
Sensitivity of a screening/diagnostic test means the ability of the test to correctly identify those patients who have the disease. The specificity of a screening/diagnostic test refers to the ability of the test to correctly ruling out persons without the disease. Positive and negative predictive values: Positive predictive value is the proportion of patients with positive test results who are correctly diagnosed. Negative predictive value is the proportion of patients with negative test results who are correctly diagnosed.
Positive and negative likelihood ratios
The positive likelihood ratio is the probability of a person who has the disease testing (sensitivity) positive divided by the probability of a person who does not have the disease testing positive (1specificity). The negative likelihood ratio is the probability of a person who has the disease testing negative (1sensetivity) divided by the probability of a person who does not have the disease testing negative (specificity).
Sample size determination
The number of subjects decided to be included in a sample of a research investigation is called sample size. Sample size plays an important role in estimation and test of the hypothesis. The sample size of the proposed investigation should be calculated with the help of essential information based on scientific knowledge. The determination of sample size depends on a variety of considerations. These in the setup of estimations are as follows: (i) The proposed method of sampling through which sample of subjects to be enrolled is required as sample size is determined based on simple random sampling, (ii) the level of precision around which the estimate is desired to fall in, (iii) knowledge of variability through SD is required in estimation of mean setting, and (iv) the desired confidence level such as 95% or 99%. In the determination of sample size for test of hypothesis situations the required considerations are: (i) The desired magnitude of difference that is considered to be clinically significant, (ii) the assumption of normal distribution of data being considered and the extent of variability through SD when the interest is of quantative variable, (iii) the level of significance or maximum type I error tolerable and the required statistical power for a specified clinically important difference, and (iv) the alternative hypothesis considered to be a onetailed or twotailed test. Free statistical packages such as Epiinfo and other available websites can be used to determine sample size after providing the required input for that purpose.^{[7]}
Conclusions   
Findings from the medical research require to be understood in order to put the emerged evidence into medical practice. Apart from the area of medicine in which investigations are performed, knowledge of biostatistics as part and parcel of research methodology is essential. Beginning with research question, how the design of the study, sample of observations chosen, proceeding for further data analysis, and interpretation to provide final conclusions of the research investigation given in this article is helpful in the understanding of advancement taking place in a particular area of medicine. This brief overview of the subject would serve a quick summary methods for UG and PG medical students in their short projects and thesis works.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References   
1.  
2.  
3.  Indrayan A, Satyanarayana L. Biostatistics for Medical, Nursing and Pharmacy Students ISBN: 8120330544: PHI Publishers New Delhi: India; 2006. 
4.  Indrayan A, Satyanarayana L. Simple Biostatistics for MBBS, PG Entrance and USMLE. 4 ^{th} ed. Delhi: Academa Publishers; 2013. 
5.  
6.  IBM Corp. Released 2013. IBM SPSS Statistics for Windows, Version 22.0. Armonk, NY: IBM Corp. 
7.  
[Figure 1], [Figure 2], [Figure 3]
[Table 1], [Table 2], [Table 3], [Table 4], [Table 5], [Table 6], [Table 7]
