NoSQL approach in data storing

    This article describes NoSQL, an approach of data storing different from the classic RDBMS. Their general distinguishing characteristics are described: ACID, BASE, details and requirements for each type of database (DB). Types of NoSQL databases are revealed, examples (implementations) of each of the types and their field of their application are given. Several types of models are analyzed: “one to many”, “many to one”, and there is a rationale which model is better to use RDBMS or NoSQL solution.

    Comparison of NoSQL and RDBMS approaches

    The best known data model today is probably SQL data model, where the data is organized into relationships (called tables in SQL), where each relationship is an unordered set lines. However, now there is an attempt, if not to replace, but to move the dominance of the relational model. The name for this new approach is NoSQL.
    The main reasons for implementing NoSQL databases:

    • The need for greater scalability than relational databases, including processing of large data sets or large write bandwidth.
    • Preference for free software over commercial products.
    • The desire to get rid of the limitations of relational schemas and the desire to more dynamic and flexible data models.

    Despite the abundance of NoSQL implementations, the main differences from the classic RDBMS solutions are as follows:

    Types of NoSQL solutions

    Depending on the data model, approaches to the distribution of replication, several types of storages can be distinguished:

    • DB based on “key-value” pairs. This is a type of NoSQL database in where data is stored as a collection of “key-value” pairs. The key is a unique identifier. Keys and values ​​can be any simple, complex compound or byte information (images). This type of database has great potential for horizontal scaling. Examples of databases are DynamoDB, Cassandra, Redis, Riak.
    • Document oriented database. This type of database is designed to store hierarchical data structures in the form of documents that are easily readable by people (for example, JSON). In the document base there is no clear regulation on scheme of documents – documents can have the same or different structure. Examples of databases are CouchDB, Couchbase, MongoDB.
    • Graph databases. These databases are used for those tasks where it is necessary to store and track relationships. In graph databases, the main elements are nodes and links (edges). Nodes are used to store entities, and edges are used to store relationships between entities. Work with the database is a bypass of certain types of edges or the entire graph, which happens very quickly due to the storage of this information in the edges. An example of using is: social networks, fraud detection services. Examples of graph databases are: Neo4j, Neptune, AllegroGraph.
    • Search databases. These databases use indexes to classify common characteristics for incoming data, and the main purpose of these databases is to facilitate quick searches. Search databases are optimized for working (searching) with information that can be large and poorly structured. Typically, these databases provide methods for simple search by text and regular expressions, and specific processing of search results. Examples of the database are Elasticsearch, Splunk.

    Problem of choice: RDBMS or NoSQL. Which to choose?

    Let us illustrate the ability to express an object, for example, “resume” in the language of a relational scheme.

    The profile in general is identified by the unique user_id identifier. Fields such as first_name and last_name occur exactly just once for a single user, so they can be made as columns in the users table.

    However, most people during a career have more than one job (position), and periods of training and the number of contact information elements can also change. There is a “one-to-many” relationship between user and these elements. In the SQL model, in the case of the most common normalized representation of “positions”, “education” and “contact information” are placed in separate tables with a link to the users table in the form of a foreign key, as shown in the Picture 1. Representation of a profile.

    For data structures such as “resumes”, which are usually an independent document, a presentation in JSON format is quite suitable.

    Example 1. Representation of a profile as a JSON document:

    {
      "user_id": 251,
      "first_name": "Bill",
      "last_name": "Gates",
      "summary": "Co-chair of the Gates Foundation",
      "region_id": "us:91",
      "industry_id": 131,
      "photo_url": "/p/7/000/253/05b/308dd6e.jpg",
      "positions": [
        {
          "job_title": "Co-chair",
          "organization": "Gates Foundation"
        },
        {
          "job_title": "Co-founder, Chairman",
          "organization": "Microsoft"
        }
      ],
      "education": [
        {"school_name": "Harvard University"},
        {"school_name": "Lakeside School"}
      ]
    }

    The advantage of this format is that it is much simpler than XML and is easy to read for people. This data model is supported by document-oriented databases such as Couchbase. The JSON representation has better locality than the multi-table schema (see Picture 1. Representation of a profile).

    To retrieve a profile in a relational example it is necessary to execute several queries (by request to each table by user_id) or an intricate multilateral join of the users table with its subordinate tables.

    In the JSON representation all the necessary information is in one place, and one request is enough. In this case, “one-to-many” relationship is observed, which means a tree-like data structure.

    Picture 1. Representation of a profile

    In the Picture 1. Representation of a profile, the region_id and industry_id represent links to foreign tables, not text strings like “Seattle Area” or “Philanthropy”.

    This was done for several reasons:

    • Uniformity of style, unambiguity – for example, spelling of city names with the same names, but in different areas.
    • Convenience of modification – the name is stored in only one place; it is easy to rename it throughout the system.
    • Simple localization – when translating the interface into other languages, the values ​​can be easily translated due to the fact that these values ​​are stored in one place.
    • Good search opportunities – this profile can be found by searching for philanthropists of a certain region.

    Whether to store an ID or a text string is a matter of duplicating data. When using ID, information is stored only in one place, and when referenced to it, ID is used everywhere. However, with direct storage of the value, such information is duplicated in each record. Duplicated information may be changed, so you will have to update all available copies. This leads to redundancy of the record and risk of inconsistencies (when only a part of the information is updated). The question about the normalization of the database is arisen.

    Normalisation of database requires from database itself technical ability of “many-to-one” relationships, which does not fit well into the document base like Couchbase, since the document, as a rule, has a tree structure with “one-to-many” relationships. In this case, if there are “many-to-one” relationships, you have to emulate these relationships by putting some duplicate data into separate documents.

    Conclusion

    NoSQL, like any technology, has its advantages and disadvantages. NoSQL databases are an optimal solution if your database requirements are:

    • Speed. NoSQL databases are usually faster, and sometimes much faster, when it comes to writing. Reading operations can also be quite fast depending on which database you are using.
    • It is required to store large amounts of unstructured data. Circuitless design. Relational DBMSs require a well-defined data structure before work starts. NoSQL solutions offer more flexible solutions.
    • Automated (or very simple) replication/scaling. NoSQL databases develop rapidly – developers are actively solving major problems, one of which is replication and scaling. Unlike relational DBMS, NoSQL solutions are easily scalable and work with clusters.
    • There are only “one-to-one” and “one-to-many” relationships between entities.
    • Subscription and support saving, since most NoSQL databases are open source projects.

    Used resources

    http://nosql-database.org/
    https://www.couchbase.com/products/server
    https://habr.com/post/152477/ (in Russian)
    https://en.wikipedia.org/wiki/NoSQL