New Freely Available Lung and Colon Cancer Image Dataset for ML Researchers

Andrew A Borkowski
3 min readJan 5, 2020

--

Colon Adenocarcinoma

The rapidly advancing field of Machine Learning allows for the analysis of large datasets to gain new insights and connections never before realized. Due to the universality of such a study, Machine Learning has been utilized in a number of various fields to uncover that which is hidden in a sea of complex data. One area, in particular, Healthcare, has a specific opportunity to harness the ability of Machine Learning for analyzing large data sets and using the results in practical application.

Despite the promise, Machine Learning shows in Healthcare, and other related fields, there is a bottleneck that slows the rate of progress. That bottleneck is access to the high-quality datasets needed to train and test the Machine Learning algorithms. Numerous datasets exist, but few are easily accessible to researchers. This situation is mainly due to the nature of Healthcare datasets themselves; identifiable information in the data sets means access to the data is protected by several measures to maintain the privacy of patients.

To address the data-access bottleneck and ensure that we maintain the privacy of our patients, we are providing a Lung and Colon Cancer Histopathological Image Dataset (LC25000) to all ML researchers in which all patient personal information has been scrubbed. All images in the data set are de-identified, HIPAA compliant, validated, and freely available for download to be used by AI researchers in any way they see fit, without having to worry about compromising patient privacy laws. The dataset contains 25,000 color images distributed in 5 classes. Each class contains 5,000 images of the following histologic entities: colon adenocarcinoma, benign colonic tissue, lung adenocarcinoma, lung squamous cell carcinoma, and benign lung tissue.

The significance of the tissues selected for the dataset is not to be ignored. Together, Lung and Colon cancers are the two most common causes of cancer deaths in the United States. With this dataset, data scientists could provide valuable information that, if put into practice, could potentially save millions of lives. Especially in areas profoundly affected by pathologist shortages or a significant lack of resources. We encourage other teams to make their datasets available to help advance the ever-growing synergy between Machine Learning and Healthcare

The LC25000 dataset contains 25,000 color images with five classes of 5,000 images each. All images are 768 x 768 pixels in size and are in jpeg file format. Our dataset can be downloaded as a 1.85 GB zip file LC25000.zip. After unzipping, the main folder lung_colon_image_set contains two subfolders: colon_image_sets and lung_image_sets. The subfolder colon_image_sets contains two secondary subfolders: colon_aca subfolder with 5,000 images of colon adenocarcinomas and colon_n subfolder with 5,000 images of benign colonic tissues. The subfolder lung_image_sets contains three secondary subfolders: lung_aca subfolder with 5,000 images of lung adenocarcinomas, lung_scc subfolder with 5,000 images of lung squamous cell carcinomas, and lung_n subfolder with 5,000 images of benign lung tissues.

You can read more about the LC25000 dataset here and and find a download hyperlink here.

--

--