Blog DLabs

How technology can improve the medicine: machine learning methods used to detect cervical cancer

Only in the United States, about 11,000 women are diagnosed each year with cervical cancer. According to the American Cancer Society’s predictions, there will be about 13,240 new cases of invasive cancer diagnosed in the United States in 2018. About 4,170 women will die from this disease. Cervical cancer was once one of the most common causes of cancer death for American women.

At the same time, the disease can be cured at an early stage of its development and many tests and examinations allow for quick diagnosis. One of the ways to diagnose this type of cancer is to perform a cervical biopsy. Unfortunately, this is a very invasive test for a woman.

Cervical cancer — a quiet killer…

Cervical cancer develops painlessly and for a long time. It may not show any symptoms for many years. It is not an inherited or genetically conditioned disease. A commonly occurring human papillomavirus called HPV is responsible for the development of cervical cancer. Every woman, regardless of her age, is exposed to her carcinogenic types. There are many types of HPV, but only some of them are carcinogenic and cause cervical cancer. The infection may occur during sexual intercourse, as well as in the case of direct contact with the skin of an infected person. All women who have started sexual life may have contact with both low-risk HPV and the most dangerous types. About 80% of sexually active women become infected with HPV at least once throughout their lives.

Prevention — the first step

Cervical cancer can be easily detected even at an early stage of development.

a) Vaccinations against HPV — Primary prevention

An increasingly common method of cervical cancer prevention is widespread vaccination against HPV in people who have not started sexual intercourse. Vaccinations, according to the conducted research, eliminate the risk of the disease to a considerable extent. So far, 10 European countries have already issued official recommendations regarding vaccination against human papillomavirus.

b) Cytology — secondary prevention

Cytology is a test that allows detection of cervical cancer in the early stages. It involves microscopic evaluation of cells collected with a special cervical brush.

Thanks to it, you can diagnose even minor abnormalities in the cervix. Early lesions detected in cervical cells can be completely cured. The cytologic examination does not prevent infection with the HPV virus that causes cervical cancer. Instead, it helps to identify the early signs of the disease.

What if we’re past that stage?

A cervical biopsy is a surgical procedure involving the removal of a small amount of tissue from the cervix. The cervix is the lower, narrow end of the uterus located at the end of the vagina.

A cervical biopsy is usually ordered when irregularities are detected during a routine pelvic organ examination or a cytological examination. Irregularities may include the presence of human papillomavirus (HPV) or pre-cancer cells. Such conditions may contribute to the development of cervical cancer.

The cervical biopsy can detect pre-cancer cells or cervical cancer. Unfortunately, the procedure is invasive, sometimes painful and usually performed under local or full anesthesia.

Is it possible to avoid it?

The latest technologies at your service, doctor!

Here we are. DLabs — experts in Data Science, Machine Learning and Artificial Intelligence. We have roots in pure and real science while having on board the best specialists in Data Science in Poland. Experienced developers, data scientists, and PhDs in mathematics. The idea to improve medical solutions came up in our minds, and we found a way to predict the need for a biopsy.

We were trying to prove that it is possible to recommend a cervical biopsy to the patient based on historical data. The used method of machine learning (neural networks) works with 88% efficiency, which means that in 88 out of 100 cases, the algorithm correctly predicted the need for a biopsy. The task of the algorithm is to support the doctor’s decision-making process, who may decide on a biopsy based on historical data of all of their patients and their cases.

Other advantages are:

a. reducing the number of biopsies(costs) performed by the hospital by accurately predicting the need for them with data,

b. a smaller number of women exposed to invasive surgery.

We put forth a hypothesis: based on an interview of a woman, we can model the variable, meaning the need for a biopsy to detect changes that indicate cervical cancer.

To conduct proper research, we made a list of training data to work on, which is as follow:

  1. 607 women, ages 18–84, with an average age of 30 years.

  2. History of their sexual contacts (number of partners, number of pregnancies, the age of the first sexual intercourse).

  3. An indicator of whether a person smokes and, if so, for how many years.

  4. History of contraception (hormonal, intrauterine devices).

  5. History of venereal diseases.

  6. In the case of women subjected to genetic tests — a predisposition to specific types of diseases.

  7. Other tests ordered: Hinselmann, Schiller, cytology.

  8. In total, there are 23 features in the collection.

Visualizations of sample flag data

Target variable

We modeled a variable determining whether a woman should have a biopsy to diagnose cervical cancer vs. whether the test will not be required. This variable is a boolean variable and accepts only two values: 0 means no biopsy and 1 means biopsy. The initial data set included 7% of women with biopsy and 93% of women without.

Visualization of an explanatory variable in two dimensions

Initially, the principal components analysis (PCA) was carried out in order to bring a set of data from the 23-dimensional space to the 2-dimensional space. The graph presents data from PCA with the color designation of individual variants of the explained variable.

The problem from the perspective of machine learning

Due to the fact that the variable is explained by the flag variable, the problem is a classification. The basic challenge is the low share of women with the recommended biopsy in the entire set (only 7%). The use of the entire collection could lead to a situation where it would be beneficial for the model to predict all women so that they would not be recommended a biopsy — the model would have very good results, but it would not be valuable in the real world. This is a very common problem in data regarding medical issues.

The course of the study

  1. Alignment of classes in a set (so that the classifier would pay attention to the features, not the size of a given class).

  2. Normalization of continuous variables by the min-max method (e.g., age).

  3. Training of a model based on feed-forward neural networks.

  4. Evaluation of the quality of the model with 10x cross-validation.

  5. Normalization of continuous variables by the min-max method (e.g., age).

Used model

  1. A neural network, fully connected.

  2. 100 hidden neurons.

  3. The learning coefficient is 0.1.

  4. Learning took place on 100 epochs.

Numerical results for cross-validation collections

As you can see, although the value of the metrics themselves is correct (high-quality metrics, low error metric), they are characterized by a large standard deviation. The deviation could be reduced by, for example, increasing the sample or manipulation of network parameters (minor or more epochs could affect a more accurate model). The field parameter under the ROC curve (so-called AUC) is also visible on the next chart, visualizing the ROC curve.

ROC curve

Features that most affect the explained variable

We have selected ten characteristics and we ranked them from the most to the least important.

  1. Carrying out the Schiller test in the past.

  2. Carrying out the Hinselmann test in the past.

  3. Cytology order.

  4. The age of the first sexual intercourse.

  5. Number of years of taking hormonal contraception.

  6. Number of pregnancies.

  7. Diagnosis of the genetic predisposition to develop the HPV virus.

  8. Age.

  9. Diagnosed genetic predisposition for the development of cancer.

  10. Number of diagnosed venereal diseases.

The variable regarding the need for a biopsy is well modeled based on the features included in the set. The advantage of introducing such a model could be to advise the doctor to make a biopsy decision based on the history of other women, which would reduce the risk of recommending a biopsy, an invasive examination, for a person who does not need it.

Data from:


I agree to the processing of my personal data in connection with sending a request via the contact form. Providing the data is voluntary, but necessary to process the query. I have been informed that I have the right to access my data, rectify, delete or limit processing, objection, file a complaint to the supervisory authority and transfer my data. The personal data controller is: DLabs sp. z o.o. with headquarters in Gdańsk (80-261), ul. Gen. de Gaulle'a 3A/2.
* You need to check above agreement to send a message.

Biuro Gdańsk
ul. Gen. de Gaulle'a 3A/2
80-261 Gdańsk

Biuro Toruń
ul. Przedzamcze 8 / 204
87-100 Toruń