Currently the data analysis systems are widespread:
Stackexchange.com websites is a growing network of communities which exchange information in various fields. The resource purpose is quality questions and answers collection based on the experience of each community. stackexchange.com questions and answers website is inherently dynamic source of knowledge. As the community’s update information daily, at the moment the website can be considered as a wealth of actual knowledge.
Interest in structuring information and obtaining the overall picture may occur in many cases:
Big data analyses: Apache Hadoop, Apache Mahout (machine learning library)
Cloud services: Amazon Elastic Map Reduce, Amazon S3, Amazon EC2 (for databases and web applications)
Databases: Neo4j graph database, MongoDB cluster
Java and Web technologies: core Java, Map-Reduce API, Struts 2, AngularJS, D3.js visualization library, HTML, CSS, Twitter bootstrap
Unit-testing: JUnit, MRUnit, Approval tests for Map-Reduce
Stackexchange.com dump is data source. For data processing the computers cluster is used. In dependence of data amount the machines number can be changed. The system has been deployed in the Amazon cloud. Amazon Elastic Map Reduce service providing Hadoop clusters on demand was used.
HDFS distributed file system was installed on computers. This system allows to store large files simultaneously on several machines and to make data replication. Apache Hadoop framework deployed on the same computers with HDFS allows to run distributed computing in cluster, using data locality. Over the cluster of computers and Apache Hadoop framework the application for data analysis is installed. It contains the implemented algorithms of data preparation, processing and analysis. The application uses Apache Mahout machine learning algorithms library.
For all stachexchange.com websites data is presented in a single XML-format: questions, answers, comments, user data, ratings, etc.
Algorithm for stackexchange.com data processing and analysis:
The text preprocessing algorithm:
Clustering or uncontrolled classification of text documents is the process of text documents regimentation based on content similarity. Groups should be determined automatically. Traditionally as the document model the linear vector space is used. Each document is represented as a vector, i.e. an array of widely used words. For the calculation for example Euclidean distance could be used. Vector representation is an array of all or most frequently used words in the document or n-grams, i.e. sequences of several words. For vectorization TF-IDF algorithm was chosen, as it takes into account the words frequency of the entire data set, reduces the weight of often used words and conversely increase the weight of unique words for each question.
Distributed version of K-means clustering algorithm:
Data storage
The analysis results are stored in several databases: Neo4j graph-oriented database and document-oriented MongoDB. It allows effectively store the objects structure and interrelations in the graph form, while the objects content (questions texts) is stored in the form of documents.
Database servers have been deployed in Amazon cloud, Amazon EC2 service was used.
Data migration and visualization
Web application displays results in the form of charts, graphs and diagrams. While migration server moves data in HDFS. After data processing the results are stored in HDFS. There is a special import utility that add the results to the databases and is run over Map-Reduce.
During the system operation the data is updated. With each update there is no need to retransfer all the data in HDFS and rerun expensive computation. To solve this problem there is Redis. It is a database operating in memory with high speed. The migration server copies the information system data and stores the hash key of each record. During reloading data in HDFS the migration server makes a request to the database operating in memory and checks whether the current record was loaded earlier.
To display the results web-application was implemented. Application server part is implemented in java and makes requests to the graph-oriented and document-oriented databases. The client part is implemented using JavaScript (AngularJS, JQuery) technologies and HTML, CSS, Twitter bootstrap.
Amounts of data processed by the system
Object | 7z archive volume | The unpacked data volume |
Dump of all data from stackexchange.com website | ~15,7 GB | ~448 GB |
stackoverflow.com website data | ~12,1 GB | ~345 GB |
serverfault.com website data | ~303 MB | ~8,6 GB |
Developed platform allows to analyse the data of stackexchagne.com website and gives the overall picture of this website content. The system collects and downloads initial data, prepares data for processing, clustering and also stores and interprets the results. In the result of system development the following tasks were solved: initial data collection and loading, data storage in the several machines, as well as fault protection in the case of one of hard drives failure, data preparing for processing with use of filtering, vectoring, data processing on the cluster of several machines, results storage and interpretation. Amazon cloud services using provides the great advantage since these services eliminate the need to configure the whole infrastructure manually.
Company’s achievements during the project