How technology can improve the medicine: machine learning methods used to detect cervical cancer
Only in the United States about 11,000 women are diagnosed each year with cervical cancer. According to American Cancer Society’s predictions there will be be about 13,240 new cases of invasive cancer diagnosed in the United States in 2018. About 4,170 women will die from this disease. Cervical cancer was once one of the most common causes of cancer death for American women.
At the same time, the disease can be cured at an early stage of its development and many tests and examinations allow for quick diagnosis. One of the ways to diagnose this type of cancer is to perform a cervical biopsy. Unfortunately, this is a very invasive test for a woman.
Cervical cancer — a quiet killer…
Cervical cancer develops painlessly and for a long time. It may not show any symptoms for many years. It is not an inherited or genetically conditioned disease. A commonly occurring human papillomavirus called HPV is responsible for the development of cervical cancer. Every woman, regardless of her age, is exposed to her carcinogenic types. There are many types of HPV, but only some of them are carcinogenic and cause cervical cancer. The infection may occur during sexual intercourse, as well as in the case of direct contact with the skin of an infected person. All women who have started sexual life may have contact with both low-risk HPV and the most dangerous types. About 80% of sexually active women become infected with HPV at least once throughout their lives.
Prevention — the first step
Cervical cancer can be easily detected even at an early stage of development.
a) Vaccinations against HPV — Primary prevention
An increasingly common method of cervical cancer prevention is widespread vaccination against HPV in people who have not started sexual intercourse. Vaccinations according to the conducted research eliminate the risk of disease to a considerable extent. So far, 10 European countries have already issued official recommendations regarding vaccination against human papillomavirus.
b) Cytology — secondary prevention
Cytology is a test that allows detection of cervical cancer in the early stages. It involves microscopic evaluation of cells collected with a special cervical brush.
Thanks to it, you can diagnose even minor abnormalities in the cervix. Early lesions detected in cervical cells can be completely cured. Cytologic examination does not prevent infection with HPV virus that causes cervical cancer. Instead, it helps to identify the early signs of the disease.
What if it’s too late?
A cervical biopsy is a surgical procedure involving the removal of a small amount of tissue from the cervix. The cervix is the lower, narrow end of the uterus located at the end of the vagina.
A cervical biopsy is usually ordered when irregularities are detected during routine pelvic organ examination or cytological examination. Irregularities may include the presence of human papillomavirus (HPV) or pre-cancer cells. Such conditions may contribute to the development of cervical cancer.
The cervical biopsy can detect pre-cancer cells or cervical cancer. Unfortunately, the procedure is invasive, sometimes painful and usually performed under local or full anesthesia.
Is it possible to avoid it?
The latest technologies at your service, doctor!
Here we are. DLabs — experts in Data Science, Machine Learning and Artificial Intelligence. We have roots in pure and real science while having on board the best specialists in Data Science in Poland. Experienced developers, data scientist and PhDs in mathematics. The idea to improve medical solutions came up to our minds and we found a way to predict the need of doing biopsy.
We were trying to prove that it is possible to recommend a cervical biopsy based on its historical data to the patient. The used method of machine learning (neural networks) works with the efficiency of 88% of cases, which means that in 88 out of 100 cases the algorithm correctly predicted the need to make a biopsy. The task of the algorithm is to support the doctor’s decision-making process, which may decide on a biopsy based on historical data of all of their patients and their cases.
Other advantages are:
a. reducing the cost of biopsies performed by the hospital through less-than-irrelevant biopsy decisions,
b. a smaller number of women exposed to invasive surgery.
We put a hypothesis: based on the historical interview of women, we can model the variable meaning the need for a biopsy to detect changes that indicate cervical cancer.
To make a proper research we made a list of training data to work on which are as follow:
1. 607 women aged 18–84, with an average age of 30 years.
2. History of their sexual contacts (number of partners, number of pregnancies, age of the first sexual intercourse).
3. An indicator whether a person smokes and, if so, how many years.
4. History of contraception (hormonal, intrauterine devices).
5. History of venereal diseases.
6. In the case of women subjected to genetic tests — predisposition to specific types of diseases.
7. Other tests ordered: Hinselmann, Schiller, cytology.
8. In total, there are 23 features in the collection.
Visualizations of sample flag data
We modeled a variable determining whether a woman should have a biopsy to diagnose cervical cancer vs. whether the study will not be required. This variable is a boolean variable and accepts only two values: 0 means no biopsy and 1 means biopsy. The initial data set included 7% of women with biopsy and 93% of women without.
Visualization of an explanatory variable in two dimensions
Initially, the principal components analysis (PCA) was carried out in order to bring a set of data from the 23-dimensional space to the 2-dimensional space. The graph presents data from PCA with the color designation of individual variants of the explained variable.
Problem from the perspective of machine learning
Due to the fact that the variable is explained by the flag variable, the problem is a classification. The basic challenge is the low share of women with the recommended biopsy in the entire set (only 7%). The use of the entire collection could lead to a situation where it would be beneficial for the model to predict all women so that they would not be recommended biopsy — the model would have very good results, but it would not be valuable in the real world. This is a very common problem in data regarding medical issues.
The course of the study
1. Alignment of classes in a set (so that the classifier would pay attention to features, not the size of a given class).
2. Normalization of continuous variables by the min-max method (eg age).
3. Training of a model based on feed-forward neural networks.
4. Evaluation of the quality of the model with 10x cross-validation.
1. Neural network, fully connected.
2. 100 hidden neurons.
3. The learning coefficient is 0.1.
4. Learning took place on 100 epochs.
Numerical results for cross-validation collections
As you can see, although the value of the metrics themselves is correct (high quality metrics, low error metric), they are characterized by a large standard deviation. The deviation could be reduced by, for example, increasing the sample or manipulation of network parameters (minor or more epochs could affect a more accurate model). The field parameter under the ROC curve (so-called AUC) is also visible on the next chart, visualizing the ROC curve.
Features that most affect the explained variable
We have selected ten characteristics and we ranked them from the most to the least important.
1. Carrying out the Schiller test in the past.
2. Carrying out the Hinselmann test in the past.
3. Cytology order.
4. The age of the first sexual intercourse.
5. Number of years of taking hormonal contraception.
6. Number of pregnancies.
7. Diagnosis of genetic predisposition to develop HPV virus.
9. Diagnosed genetic predisposition to the development of cancer.
10. Number of diagnosed venereal diseases.
The variable regarding the need for a biopsy is well modeled based on the features included in the set. The advantage of introducing such a model could be to advise the doctor to make a biopsy decision based on the history of other women, which would reduce the risk of recommending a biopsy, an invasive examination, a person who does not need it.
How technology can improve the medicine: machine learning methods used to detect cervical cancer was originally published in DLabs on Medium, where people are continuing the conversation by highlighting and responding to this story.