Business component of the project
The project is a platform intended to make collection and processing of data from various business portals automated. The support of analysis of relationships between account data from various sources is implemented in it.
The introduction of such an automated platform allows to increase the speed and efficiency of the Sales and Marketing Department, which is confirmed by the results of this platform implementation in our company. The developed application makes it possible to significantly increase the amount of information collected, the speed of collection and processing per unit of time, and also automates the analysis of relationships, which eliminates errors associated with the human factor.
The Sales and Marketing Department, which acted as the customer of the project, highly appreciated the practical value of the developed application. And the project continues to develop although its main functionality has already been implemented. The project team plans not only to implement the support of the application’s performance, but also to expand its capabilities taking into account the proposals received from the customer.
Technical description of the project
The developed application is a crawler that scans and collects data from sites containing information about potential business contacts according to predefined search parameters.
The data source for the crawler can be any business portal/social network/site that contains contact information, for example, LinkedIn, angel.co, crunchbase, etc. Data from all systems is collected, compared and processed to find duplicates, check for the availability of these records in the existing database of customers. After this the data is merged with the existing crawler database. To launch the crawler, one should configure it and specify which sites to collect information from, what information should be collected, i.e. the tuning is quite fine.
Here is an example of a crawler script. We go to angel.co, scan the list of companies with the given condition: “If a company received investments on this site, I want to see if there is a vacancy for a Java developer in its Linkedin profile. If there is such a vacancy, we add the contact data of this company to the database of potential customers.
Selenium WebDriver automation tool is used to read information from the site. Search parameters are stored in Excel files on local or cloud storage, such as Google Drive, and are loaded from the storage when the application is launched.
The application is launched on the company’s server using Jenkins, in the form of individual builds. Each build is configured taking into account the general crawler launch parameters and the selected parameters of data search on the site. After completion of the work, the crawler sends users a report on the results and Excel files containing the data read by e-mail.
The general parameters for launching the application include the amount of data to be read, bot logins and passwords, the name of the file with filters, the date and time the build was launched, etc.
In the current configuration, the application performs two types of data search on the site – search for information about people and search for companies based on the vacancies they published.
The system performs the following functions:
- automatic login on the site and verification of authorization;
- processing of obstacles that arise during authorization or crawler operation (captchas, verification codes, etc.);
- reading the necessary information from the obtained search results, as well as from web-pages of people and companies;
- formation and automatic sending of reading results to users by e-mail.
It is possible to launch several search queries at a time through the application (for both types of search, different regions, etc.).
The application itself starts automatically based on the specified parameters, and the user can only influence the result of its work by changing these parameters in advance.
Technologies used on the project
Frameworks, libraries: Lombok, Log4J, Selenium WebDriver, TestNG.
APIs used: Apache POI, JavaMail, Google API (Drive, GMail), REST-assured (2captcha API).
Infrastructure: Apache Subversion (SVN), Jenkins, IntelliJ IDEA.
- The work is carried out in accordance with the Scrum/Agile methodology.
- Manual testing was used on the project. This is explained in particular by the following:
- running Unit tests in parallel with the main builds can lead to exceeding the limits on the number of requests per unit of time (additional bans);
- some situations (for example, failures in site operation) cannot be reproduced to test the crawler automatically.
Identification of errors in the operation of the application is based on the study of the contents of reports and data files sent to users. When examining files, the data actually collected by the crawler and the data on the crawled site are compared.
- Portals for searching and finding business contacts are regularly updated and expand their capabilities. In this case the correct operation of the crawler requires updating from time to time. This requires the involvement of specialists, but the crawler is designed in such a way that we do not need to change a large amount of code (do a lot of refactoring) for updating. It is enough to make minor changes to certain software layers of the crawler.
- The duration of application operation can vary significantly and largely depends on the amount of data being read.
- Automation of the work of JazzTeam Sales and Marketing Department specialists was completed, which made it possible to reduce their labor costs for collecting data on potential customers by several times.
- The application is used every day by the Sales and Marketing Department. We always have access to up-to-date data on vacancies (and the companies that published them) published on various portals, which serves to search for and establish business contacts.
Company’s achievements on the project
- During the first 8 months of the crawler’s operation, the Sales and Marketing Department managed to collect and process the number of leads that exceeded the number of leads collected over the same period of time by several dozen times.
- The crawler has already helped to find several interesting partners and customers for the company, including the ones from new geographic regions.
- Thanks to a thorough study of the interface and features of operation of various portals for searching and finding business contacts, a number of tasks related to optimizing the operation of the crawler and ensuring its continuity during data reading have been successfully solved.