The project is a platform intended to make collection and processing of data from various business portals automated. The support of analysis of relationships between account data from various sources is implemented in it.
The introduction of such an automated platform allows to increase the speed and efficiency of the Sales and Marketing Department, which is confirmed by the results of this platform implementation in our company. The developed application makes it possible to significantly increase the amount of information collected, the speed of collection and processing per unit of time, and also automates the analysis of relationships, which eliminates errors associated with the human factor.
The Sales and Marketing Department, which acted as the customer of the project, highly appreciated the practical value of the developed application. And the project continues to develop although its main functionality has already been implemented. The project team plans not only to implement the support of the application’s performance, but also to expand its capabilities taking into account the proposals received from the customer.
The developed application is a crawler that scans and collects data from sites containing information about potential business contacts according to predefined search parameters.
The data source for the crawler can be any business portal/social network/site that contains contact information, for example, LinkedIn, angel.co, crunchbase, etc. Data from all systems is collected, compared and processed to find duplicates, check for the availability of these records in the existing database of customers. After this the data is merged with the existing crawler database. To launch the crawler, one should configure it and specify which sites to collect information from, what information should be collected, i.e. the tuning is quite fine.
Here is an example of a crawler script. We go to angel.co, scan the list of companies with the given condition: “If a company received investments on this site, I want to see if there is a vacancy for a Java developer in its Linkedin profile. If there is such a vacancy, we add the contact data of this company to the database of potential customers.
Selenium WebDriver automation tool is used to read information from the site. Search parameters are stored in Excel files on local or cloud storage, such as Google Drive, and are loaded from the storage when the application is launched.
The application is launched on the company’s server using Jenkins, in the form of individual builds. Each build is configured taking into account the general crawler launch parameters and the selected parameters of data search on the site. After completion of the work, the crawler sends users a report on the results and Excel files containing the data read by e-mail.
The general parameters for launching the application include the amount of data to be read, bot logins and passwords, the name of the file with filters, the date and time the build was launched, etc.
In the current configuration, the application performs two types of data search on the site – search for information about people and search for companies based on the vacancies they published.
The system performs the following functions:
It is possible to launch several search queries at a time through the application (for both types of search, different regions, etc.).
The application itself starts automatically based on the specified parameters, and the user can only influence the result of its work by changing these parameters in advance.
Frameworks, libraries: Lombok, Log4J, Selenium WebDriver, TestNG.
APIs used: Apache POI, JavaMail, Google API (Drive, GMail), REST-assured (2captcha API).
Infrastructure: Apache Subversion (SVN), Jenkins, IntelliJ IDEA.
Identification of errors in the operation of the application is based on the study of the contents of reports and data files sent to users. When examining files, the data actually collected by the crawler and the data on the crawled site are compared.