Data mining or “Knowledge Discovery in Databases” is the process of discovering patterns in large data sets with artificial intelligence, machine learning, statistics, and database systems.
The overall goal of a data mining process is to extract information from a data set and transform it into an understandable structure for further use.
Here is a simple but fascinating example of how data mining helped dissipate wrong assumptions and conclusions about girls, and take action with tremendous social impact.
For long time, the high rate of dropout of girls in schools in developing countries were explained with sociological and cultural hypothesis: girls are not encouraged by indigenous societies, parents treat girls differently, girls are pushed to get married earlier or loaded with much more work than boys. Some others using economic theories, speculated that girls education is not seen by those societies as a good investment.
Then, in the years 90s, came a group of young data miners who plugged into several schools records on absenteeism, and slowly discovered that girls were missing schools for few days every month, with stunning regularity and predictability. A little bit more analysis reveals that girls were missing schools mostly during their menstruation period, and because there were no safe way for them to feel clean and comfortable to come to school during that period.
Consequence, “millions of girls living in developing countries like Uganda skip up to 20% of the school year simply because they cannot afford to buy mainstream sanitary products when they menstruate. This deliberate absenteeism has enormous consequences on girls’ education and academic potential.” -
In western countries and in Asia,
companies and governments are using data mining to make great discoveries. We can do the same in Africa. There are numerous free tools to do so. I have collected the best of them here for you. Try it, start slowly but persist with patience. It could yield amazing and transformational results like Afripads is now helping African girls stay at school. (You can also download the MIT Open course materials on Data Mining here) 1. RapidMiner
RapidMiner is unquestionably the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Thousands of applications of RapidMiner in more than 40 countries give their users a competitive edge.
Built around RapidMiner as a powerful engine for analytical ETL, data analysis, and predictive reporting, the new business analytics server
RapidAnalytics is the key product for all business critical data analysis tasks and a milestone for business analytics.
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
PSPP is a program for statistical analysis of sampled data. It has a graphical user interface and conventional command-line interface. It is written in C, uses GNU Scientific Library for its mathematical routines, and plotutils for generating graphs. It is a Free replacement for the proprietary program SPSS (from IBM) predict with confidence what will happen next so that you can make smarter decisions, solve problems and improve outcomes.
KNIME is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. The open integration platform provides over 1000 modules (nodes)
Orange is an Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.
7. Apache Mahout
Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform.
Currently Mahout supports mainly four use cases: Recommendation mining takes users’ behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
jHepWork (or “jWork”) is an environment for scientific computation, data analysis and data visualization designed for scientists, engineers and students. The program incorporates many open-source software packages into a coherent interface using the concept of scripting, rather than only-GUI or macro-based concept.
jHepWork can be used everywhere where an analysis of large numerical data volumes, data mining, statistical analysis and mathematics are essential (natural sciences, engineering, modeling and analysis of financial markets).
Rattle (the R Analytical Tool To Learn Easily) presents statistical and visual summaries of data, transforms data into forms that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets.
It is a free and open source data mining toolkit written in the statistical language R using the Gnome graphical interface. It runs under GNU/Linux, Macintosh OS X, and MS/Windows. Rattle is being used in business, government, research and for teaching data mining in Australia and internationally.