A Review of Different Database Types: Relational versus Non-Relational – DATAVERSITY
Hadoop has automatic recovery built in such that if one server becomes .. your data includes a large number of many-to-many relationships. (or rows). Database. A database consists of multiple relations; Information about an enterprise is broken up into parts, with each relation storing one part of the. In relational databases, a one-to-many relationship occurs when a parent relationship allows zero child records, a single child record or multiple child records.
This includes compression of source data, but also the intermediate data generated as part of data processing e. Although compression can greatly optimize processing performance, not all compression formats supported on Hadoop are splittable. Because the MapReduce framework splits data for input to multiple tasks, having a nonsplittable compression format is an impediment to efficient processing.
If files cannot be split, that means the entire file needs to be passed to a single MapReduce task, eliminating the advantages of parallelism and data locality that Hadoop provides. For this reason, splittability is a major consideration in choosing a compression format as well as file format.
Snappy Snappy is a compression codec developed at Google for high compression speeds with reasonable compression. Processing performance with Snappy can be significantly better than other compression formats. Unlike Snappy, LZO compressed files are splittable, but this requires an additional indexing step. This makes LZO a good choice for things like plain-text files that are not being stored as part of a container format. Gzip Gzip provides very good compression performance on average, about 2.
Gzip usually performs almost as well as Snappy in terms of read performance. Gzip is also not splittable, so it should be used with a container format. Note that one reason Gzip is sometimes slower than Snappy for processing is that Gzip compressed files take up fewer blocks, so fewer tasks are required for processing the same data.
For this reason, using smaller blocks with Gzip can lead to better performance. Unlike Snappy and Gzip, bzip2 is inherently splittable. This performance difference will vary with different machines, but in general bzip2 is about 10 times slower than GZip. One example of such a use case would be using Hadoop mainly for active archival purposes. Compression recommendations In general, any compression format can be made splittable when used with container file formats Avro, SequenceFiles, etc.
If you are doing compression on the entire file without using a container file format, then you have to use a compression format that inherently supports splitting e. Here are some recommendations on compression in Hadoop: Enable compression of MapReduce intermediate output. This will improve performance by decreasing the amount of intermediate data that needs to be read and written to and from disk.
Pay attention to how data is ordered. Often, ordering data so that like data is close together will provide better compression levels. Remember, data in Hadoop file formats is compressed in chunks, and it is the entropy of those chunks that will determine the final compression.
For example, if you have stock ticks with the columns timestamp, stock ticker, and stock price, then ordering the data by a repeated field, such as stock ticker, will provide better compression than ordering by a unique field, such as time or stock price.
What is a One-to-Many Relationship? - Definition from Techopedia
Consider using a compact file format with support for splittable compression, such as Avro. This, in turn, means that each of the HDFS blocks can be compressed and decompressed individually, thereby making the data splittable. In this section, we will describe the considerations for good schema design for data that you decide to store in HDFS directly. While many people use Hadoop for storing and processing unstructured data such as images, videos, emails, or blog posts or semistructured data such as XML documents and logfilessome order is still desirable.
This is especially true since Hadoop often serves as a data hub for the entire organization, and the data stored in HDFS is intended to be shared among many departments and teams. Creating a carefully structured and organized repository of your data will provide many benefits.
To list a few: A standard directory structure makes it easier to share data between teams working with the same data sets. It also allows for enforcing access and quota controls to prevent accidental deletion or corruption.
Conventions regarding staging data will help ensure that partially loaded data will not get accidentally processed as if it were complete. Standardized organization of data allows for reuse of some code that may process the data.
Some tools in the Hadoop ecosystem sometimes make assumptions regarding the placement of data. It is often simpler to match those assumptions when you are initially loading data into Hadoop. The details of the data model will be highly dependent on the specific use case. For example, data warehouse implementations and other event stores are likely to use a schema similar to the traditional star schema, including structured fact and dimension tables.
Unstructured and semistructured data, on the other hand, are likely to focus more on directory placement and metadata management.
The important points to keep in mind when designing the schema, regardless of the project specifics, are: Develop standard practices and enforce them, especially when multiple teams are sharing the data.
Make sure your design will work well with the tools you are planning to use. For example, the version of Hive you are planning to use may only support table partitions on directories that are named a certain way. This will impact the schema design in general and how you name your table subdirectories, in particular.
Keep usage patterns in mind when designing a schema. Different data processing and querying patterns work better with different schema designs. Understanding the main use cases and data retrieval requirements will result in a schema that will be easier to maintain and support in the long term as well as improve data processing performance. Standard locations make it easier to find and share data between teams. The following is an example HDFS directory structure that we recommend.
This directory structure simplifies the assignment of permissions to various groups and users: This is usually scratch type data that the user is currently experimenting with but is not part of a business process.
Within each application-specific directory, you would have a directory for each ETL process or workflow for that application. Within the workflow directory, there are subdirectories for each of the data sets. In some cases, you may want to go one level further and have directories for each stage of the process: This directory is typically cleaned by an automated process and does not store long-term data.
This directory is typically readable and writable by everyone. Because these are often critical data sources for analysis that drive business decisions, there are often controls around who can read and write this data. Very often user access is read-only, and data is written by automated and audited ETL processes. It is not always necessary to store such application artifacts in HDFS, but some Hadoop applications such as Oozie and Hive require storing shared code and configuration on HDFS so it can be used by code executing on any node of the cluster.
For a given application say, Oozieyou would need a directory for each version of the artifacts you decide to store in HDFS, possibly tagging, via a symlink in HDFS, the latest artifact as latest and the currently used one as current. The directories containing the binary artifacts would be present under these versioned directories.
This will look similar to: This directory would be the best location for storing such metadata. This directory is typically readable by ETL jobs but writable by the user used for ingesting data into Hadoop e.
For example, the Avro schema file for a data set called movie may exist at a location like this: Advanced HDFS Schema Design Once the broad directory structure has been decided, the next important decision is how data will be organized into files.
There are a few strategies to best organize your data.
- Oracle vs. Hadoop
- Hadoop Application Architectures by Gwen Shapira, Jonathan Seidman, Ted Malaska, Mark Grover
- A Review of Different Database Types: Relational versus Non-Relational
We will talk about partitioning, bucketing, and denormalizing here. When the data sets grow very big, and queries only require access to subsets of data, a very good solution is to break up the data set into smaller subsets, or partitions.
Each of these partitions would be present in a subdirectory of the directory containing the entire data set.
This will allow queries to read only the specific partitions i. When placing the data in the filesystem, you should use the following directory format for partitions: In our example, this translates to: Bucketing Bucketing is another technique for decomposing large data sets into more manageable subsets. It is similar to the hash partitions used in many relational databases. In the preceding example, we could partition the orders data set by date because there are a large number of orders done daily and the partitions will contain large enough files, which is what HDFS is optimized for.
However, if we tried to partition the data by physician to optimize for queries looking for specific physicians, the resulting number of partitions may be too large and resulting files may be too small in size. Also, many small files can lead to many processing tasks, causing excessive overhead in processing.
The solution is to bucket by physician, which will use a hashing function to map physicians into a specified number of buckets. This way, you can control the size of the data subsets i. A good average bucket size is a few multiples of the HDFS block size. Having an even distribution of data when hashed on the bucketing column is important because it leads to consistent bucketing. Also, having the number of buckets as a power of two is quite common. The word join here is used to represent the general idea of combining two data sets to retrieve a result.
When both the data sets being joined are bucketed on the join key and the number of buckets of one data set is a multiple of the other, it is enough to join corresponding buckets individually without having to join the entire data sets. This significantly reduces the time complexity of doing a reduce-side join of the two data sets.
This is because doing a reduce-side join is computationally expensive. However, when two bucketed data sets are joined, instead of joining the entire data sets together, you can join just the corresponding buckets with each other, thereby reducing the cost of doing a join. Of course, the buckets from both the tables can be joined in parallel. Moreover, because the buckets are typically small enough to easily fit into memory, you can do the entire join in the map stage of a Map-Reduce job by loading the smaller of the buckets in memory.
This is called a map-side join, and it improves the join performance as compared to a reduce-side join even further.
If you are using Hive for data analysis, it should automatically recognize that tables are bucketed and apply this optimization. If the data in the buckets is sorted, it is also possible to use a merge join and not store the entire bucket in memory when joining.
This is somewhat faster than a simple bucket join and requires much less memory. Hive supports this optimization as well. Note that it is possible to bucket any table, even when there are no logical partition keys. It is recommended to use both sorting and bucketing on all large tables that are frequently joined together, using the join key for bucketing. As you can tell from the preceding discussion, the schema design is highly dependent on the way the data will be queried.
You will need to know which columns will be used for joining and filtering before deciding on partitioning and bucketing of the data. In cases when there are multiple common query patterns and it is challenging to decide on one partitioning key, you have the option of storing the same data set multiple times, each with different physical organization.
Network Security Through Data Analysis by Michael S Collins
This is considered an anti-pattern in relational databases, but with Hadoop, this solution can make sense. For one thing, in Hadoop data is typically write-once, and few updates are expected. Therefore, the usual overhead of keeping duplicated data sets in sync is greatly reduced. In addition, the cost of storage in Hadoop clusters is significantly lower, so there is less concern about wasted disk space. These attributes allow us to trade space for greater query speed, which is often desirable.
Denormalizing Although we talked about joins in the previous subsections, another method of trading disk space for query performance is denormalizing data sets so there is less of a need to join data sets.
In relational databases, data is often stored in third normal form. Such a schema is designed to minimize redundancy and provide data integrity by splitting data into smaller tables, each holding a very specific entity.
This means that most queries will require joining a large number of tables together to produce final result sets. In Hadoop, however, joins are often the slowest operations and consume the most resources from the cluster.
Reduce-side joins, in particular, require sending entire tables over the network. While bucketing and sorting do help there, another solution is to create data sets that are prejoined—in other words, preaggregated. The idea is to minimize the amount of work queries have to do by doing as much as possible in advance, especially for queries or subqueries that are expected to execute frequently.
Instead of running the join operations every time a user tries to look at the data, we can join the data once and store it in this form. Looking at the difference between a typical Online Transaction Processing OLTP schema and an HDFS schema of a particular use case, you will see that the Hadoop schema consolidates many of the small dimension tables into a few larger dimensions by joining them during the ETL process. In the case of our pharmacy example, we consolidate frequency, class, admin route, and units into the medications data set, to avoid repeated joining.
Other types of data preprocessing, like aggregation or data type conversion, can be done to speed up processes as well. Since data duplication is a lesser concern, almost any type of processing that occurs frequently in a large number of queries is worth doing once and reusing.
In relational databases, this pattern is often known as Materialized Views. In Hadoop, you instead have to create a new data set that contains the same data in its aggregated form. And, finally, we saw how denormalization plays an important role in speeding up Hadoop jobs.
HBase Schema Design In this section, we will describe the considerations for good schema design for data stored in HBase. While HBase is a complex topic with multiple books written about its usage and optimization, this chapter takes a higher-level approach and focuses on leveraging successful design patterns for solving common problems with HBase.
For an introduction to HBase, see HBase: Just like a hash table, HBase allows you to associate values with keys and perform fast lookup of the values based on a given key. There are many details related to how regions and compactions work in HBase, various strategies for ingesting data into HBase, using and understanding block cache, and more that we are glossing over when using the hash table analogy. It makes you think in terms of get, put, scan, increment, and delete requests instead of SQL queries.
Put things in a hash table Get things from a hash table Iterate through a hash table note that HBase gives us even more power here with range scans, which allow specifying a start and end row when scanning Increment the value of a hash table entry Delete values from a hash table It also makes sense to answer the question of why you would want to give up SQL for HBase.
The value proposition of HBase lies in its scalability and flexibility. In general, HBase works for problems that can be solved in a few get and put requests. Row Key To continue our hash table analogy, a row key in HBase is like the key in a hash table. One of the most important factors in having a well-architected HBase schema is good selection of a row key.
Following are some of the ways row keys are used in HBase and how they drive the choice of a row key. HBase records can have an unlimited number of columns, but only a single row key. This is different from relational databases, where the primary key can be a composite of multiple columns. This means that in order to create a unique row key for records, you may need to combine multiple pieces of information in a single key.
Another thing to keep in mind when choosing a row key is that a get operation of a single record is the fastest operation in HBase. Therefore, designing the HBase schema in such a way that most common uses of the data will result in a single get operation will improve performance. This may mean putting a lot of information into a single record—more than you would do in a relational database.
This type of design is called denormalized, as distinct from the normalized design common in relational databases. For example, in a relational database you will probably store customers in one table, their contact details in another, their orders in a third table, and the order details in yet another table. In HBase you may choose a very wide design where each order record contains all the order details, the customer, and his contact details.
All of this data will be retrieved with a single get. Distribution The row key determines how records for a given table are scattered throughout various regions of the HBase cluster. In HBase, all the row keys are sorted, and each region stores a range of these sorted row keys. Each region is pinned to a region server i. A well-known anti-pattern is to use a timestamp for row keys because it would make most of the put and get requests focused on a single region and hence a single region server, which somewhat defeats the purpose of having a distributed system.
As we will see later in this chapter, one of the ways to resolve this problem is to salt the keys. In particular, the combination of device ID and timestamp or reverse timestamp is commonly used to salt the key in machine data.
Block cache The block cache is a least recently used LRU cache that caches data blocks in memory. By default, HBase reads records in chunks of 64 KB from the disk. Each of these chunks is called an HBase block.
When an HBase block is read from the disk, it will be put into the block cache. However, this insertion into the block cache can be bypassed if you desire.
The idea behind the caching is that recently fetched records and those in the same HBase block as them have a high likelihood of being requested again in the near future.
RDBMSs have provided for data integrity needs for decades, but the exponential growth of data over the past 10 years or so, along with many new data types have changed the data equation entirely, and so non-relational databases have grown from such a need.
Non-relational databases are also called NoSQL databases. There are literally hundreds, if not thousands, more. Hadoop is also part of this entire discussion, said Serra. A Big Data Toolkit.
I can use Full-Text Search. But what happens if I need to store and analyze a few million web pages? SQL Server can handle that with a nice size server.
But in a situation where users can enter millions of transactions per second, this becomes a serious problem. Main Differences Between Relational and Non-Relational Databases In his presentation, Serra listed multiple slides see the presentation video at the end of this article that detail the many variances in databases, including pros and cons. A short list of the most fundamental elements discussed by Serra includes: Relational Databases Relational databases work with structured data.
Relationships in this system have constraints. There is limitless indexing. Cons Relational Databases do not scale out horizontally very well concurrency and data sizeonly vertically, unless you use sharding. Data is normalized, meaning lots of joins, which affects speed. They have problems working with semi-structured data. Some support ACID transactional consistency. Schema-free or Schema-on-read options. There are now also numerous commercial products available.
Limited support for joins. Data is denormalized, requiring mass updates i. Does not have built-in data integrity must do in code. Non-Relational Stores in Short There are many different kinds of non-relational stores; Serra gave an overview of the main types.