Tool for data compression
You never know which type you may encounter when downloading and sharing files, so it makes sense to have a program on hand that can handle more than one type. Here we take a look at a selection of the best tools, taking into account ones that offer the highest compression rates, and those that support the largest number of file types. One of the most famous names in the world of software utilities, WinZip is still going strong after nearly 30 years, and is still one of the best file compression tools around.
However, you may wonder if you can justify spending money on a compression tool when there are so many free alternatives available. Ultimately it depends on your priorities, but you do get a lot of extras for your money. Other bonus features include the splitting of large zip files to fit different media, advanced file sharing options, cloud support and an advanced zip management system that rivals Windows Explorer.
The interface adapts to suit mouse and keyboard setups or touchscreen devices, and there are backup and security options thrown in to protect your files.
WinZip is an incredibly useful tool to have in your software arsenal, and it's flexible enough to work in the way that suits you best — you can create and extract via the program interface, or using the program window. And if you'd rather not pay money, we've featured the best free alternatives to Winzip. This exclusivity comes at a price that is similar to WinZip. Of course, WinRAR can be used to compress files into many other compressed formats, and the program benefits from the fact that it is available for just about every platform imaginable.
The interface is not the most pleasant to look at, and even if you opt to use the Explorer context menu to create or extract archives, beginners may well feel overwhelmed by the number of options and settings on display. That said, there is a wizard mode that take the hard work out of most tasks. WinRAR's killer feature is undoubtedly full RAR support, but its encryption, speed, self-extracting archive creation and themes if you're into that sort of thing! The first free option in this roundup, 7-Zip is another program with an excellent reputation.
It can handle pretty much any compressed file format you care to throw at it. A real stalwart of the compression world, 7-Zip boasts its own compressed file format, 7z. This not only lets you compress truly gigantic files up to 16 billion gigabytes, according to its developers , but also has an incredibly high compression rate.
However, this does mean making speed sacrifices; 7z can use 'solid compression' to achieve tiny file sizes, but it can be very, very slow. Thankfully, if you venture into Options within the program, you'll find that it's easy enough to get rid of the options you don't need.
Mix and match super-fast and high compression as you see fit. Read more about Oodle's compressors here! Just load a buffer and call Decompress! Use the same tools and package on every platform. Oodle gets the best out of each platform without the need for different solutions. Oodle has the same API and file formats across all platforms! Oodle Network easily drops in to the network stack of any engine. Table 4 shows the performance comparison of different algorithms on quality scores.
The compression ratios for quality scores is between and Slimfastq gives the second to best ratio for all datasets except for the PacBio dataset, for which it does not work. The results clearly indicate that LFQC is the best suitable candidate for compressing quality scores as it gives the best compression ratios for all datasets.
The methods are compared based on compression ratio, compression speed and memory usage during compression. The comparison also includes the ability of the tool to produce exact replica of the original file after decompression. The ratio between the size of the original and the compressed files is calculated for each dataset using all the compression tools.
Table 5 shows the performance of MZPAQ relative to other evaluated tools in terms of compression ratio. The results clearly indicate that MZPAQ achieves the highest compression ratios compared to all the other tools for all datasets. LFQC achieves the second to best compression ratios for smaller file sizes; however, it does not work for larger datasets. All domain-specific tools performed better than general-purpose tools, except for LZMA, which did not work on PacBio data.
Compression speed is the number of compressed MB per second. The decompression speed is computed similarly. In order to conduct the comparison, we run all the tools in single thread mode to allow for direct comparison between all the tools, as some of them do not support multi-threading.
Slimfastq is the fastest tool and provides maximum compression speed for all cases except in the case of PacBio data, which it does not support. LFQC is the slowest for all the datasets it supports. In case of decompression speed. We can see from the results shown in Table 7 that gzip outperformes all the evaluated tools, decompressing at over 45 MB per second for all datasets.
We further notice that general-purpose tools have faster decompression than compression speeds, particularly LZMA. Memory usage refers to the maximum number of memory bytes required by an algorithm during compression or decompression, it represents the minimum memory that should be available for successful execution of a program.
In general, memory usage varies with the type of datasets. Tables 8 and 9 show the maximum memory requirements for compression and decompression, respectively.
The results show that LZMA requires 10 times more memory for compression as compared to decompression. Leon uses almost two times more memory for compression than decompression. In all cases, gzip requires the least amount of memory. Evaluating the effectiveness of high-throughput sequencing data compression tools has gained a lot of interest in the last few years [ 1 , 13 — 15 ].
Comparative reviews of prominent general-purpose as well as DNA-specific compression algorithms show that DNA compression algorithms tend to compress DNA sequences much better than general-purpose compression algorithms [ 1 , 4 ]. For example, Table 10 shows the results of compression for all the benchmark datasets.
We can see that all the evaluated compression tools are not able to compress variable-length reads obtained by Pac Bio except for MZPAQ. In our study, we evaluate various existing efficient algorithms to investigate their ability to compress FASTQ streams.
Moreover, they have been shown to outperform general purpose tools in compressing identifiers and reads. Among the evaluated tools, we select MFCompress for compression of identifier and sequence streams. We also found ZPAQ to be a suitable candidate for compression of quality scores after evaluating all the tools on this stream. A point worth noticing here is that both MFCompress and ZPAQ make use of context modeling, which makes this compression technique very promising for compression of genomic data [ 16 ].
Our evaluation illustrates the significant impact on compression efficiency when we divide FASTQ into multiple data streams and use different compression schemes based on the stream type. In some cases, the compression ratio gain is minor; however, our goal is to create a tool that works best for all types of data. Our evaluation shows that existing tools support only Illumina files containing short and fixed-length reads.
These tools are not optimized to support variable-length reads data from the PacBio platform. Figure 2 shows a comparison of different tools that work for all benchmark datasets. The figure shows that MZPAQ outperforms comparable tools for both the combined identifier-sequence stream as well as the quality scores stream. A key observation here is that the compression ratios for quality scores vary from to while identifier and sequence data compression ratios are in the range of to It is evident that the nature of quality scores makes it challenging to compress them as compared to other streams of FASTQ data.
With general-purpose and domain-specific compression algorithms efficiently compressing identifier and sequences while delivering only moderate compression ratios for quality scores, there is a growing need to develop compression schemes to better compress quality scores [ 17 , 18 ]. Comparison: Compression sizes of different fastq steams in two large datasets using different compression tools. From the experimental results, we can see that the best compression ratio, maximum speed, and minimum memory requirements are competing goals.
In general, higher compression ratios are achieved by programs that are slower and have higher memory requirement. Figures 3 and 4 illustrate the trade-off between compression ratio and the speed and memory usage.
For example, gzip offers the lowest compression ratio but has the best performance in case of speed and memory usage. Better compression-ratio tools cost both time and memory but they provide valuable long term space and bandwidth savings. When data size is crucial, these tools are crucial. Compression ratio vs. Memory usage vs. Figures 3 and 4 clearly demonstrate that almost all compression algorithms, general or domain-specific, have a trade-off between compression ratio, speed, and memory usage.
MZPAQ provides better compression ratios for all platforms, at the cost of higher running time and memory usage. MZPAQ is suitable for areas where the preference is to maximize compression ratio for a long-term storage or faster data transfer. In addition, speed performance can be remarkably enhanced by employing high performance computing.
Based on our analysis of existing compression algorithms, it is obvious that none of these techniques qualify for the one-size-fits-all approach. There is no compression scheme that provides best results in terms of all evaluation metrics we analyzed. For example, datasets that are not well compressed by one algorithm are efficiently compressed by another.
One of the main drawbacks of most algorithms is their compatibility with only specific type of input, greatly restricting their usage by biologists who need to compress different types of data. For example, some tools accept only ACTG, support only fixed read length, or support a subset of platforms. The backbone of modern genetics is DNA sequencing. Thanks to recent advances in sequencing technologies, there has been an exponential increase in the speed and amount of DNA sequenced on a daily basis.
Thus, the need of storage space is also increasing by an equal rate. This implies that if the same trend persists, the cost of DNA sequencing pipeline will be highly influenced by the storage cost, rather than the sequencing itself. In an attempt to solve this problem, developing efficient compression algorithms is crucial. In this paper, we present a compression tool for the most commonly used format for raw data, which is FASTQ.
We first review recent progress related to DNA compression and explore various compression algorithms. To achieve better compression performance, the input is fragmented to expose different kind of information namely identifier strings, quality scores, sequences and other optional fields.
The final objective is achieved by recognizing the statistical properties of every specific kind of information to use an appropriate compression method. We combine existing algorithms and sub-algorithms and achieve the best compression ratios on FASTQ files for all datasets from a recent and well known review.
Comparative analysis of existing tools as well as our tool show that MZPAQ is able to better compress data from all types of platforms as well as compress data of different sizes.
We can conclude that MZPAQ is more suitable when the size of compressed data is crucial such as long-term storage and data transfer to the cloud. At this point, we present a method that focuses on improving compression ratio for all types of FASTQ datasets. Later, effort will be made to target other aspects such as compression speed and memory requirements. Parallel implementation and code optimization can be used to overcome the high compression cost of MZPAQ. Comparison of high-throughput sequencing data compression tools.
Nat Methods; —8. Article Google Scholar. Bioinformatics Oxford, England. Pinho AJ, Pratas D.
0コメント