Aug 3, 2022

RStudio, a professional data science solution

This is the first time I just saw the name R package, and as soon as I clicked and saw the first paragraph, I would instead download and install it immediately as it is a great software. Secondly, I found some excellent youtube videos explaining this powerful software. I learned that we could use professional mathematical and statistical commands directly in the command line and see the result instantly. We can load the whole file (CSV and other formats) into the R system and start querying in the command line, and the loading time is fascinating. (Youtube, 2012)

Editing the data set using R

  • Structure the input using R

The command str(THIS - the indicator) returns the input structure with the heading, type of variables, and other details of the data set columns.

  • Summery the input using R

The command sum(THIS - the indicator) returns a full data set summary.

  • Fix the input using R

This command (Fix) is the most excellent tool that can be used, such as excel software, to browse, modify, and edit any column or cell of data set instantly. 

Maximum size of data supported by R 

The R system vector (data frame) size is around two billion. R holds all data in the virtual memory, and it makes it to store data with 64-bit builds OS (Windows OS, it's different in Unix system). The 32-bit OS imposes a limit of 4GB, whereas the 64-bit OS can reach 128Tb. (Rdrr.Io, 2020) A vector's maximum length or maximum elements in R is 2*10^9, as they hold in singed integer. Thus, R can be called a medium-size data analyzer, not a large-size supporter.

The best solution for large data sets would be "disk.frame", which handles manipulating "date.tables" as chunks written and read from FST files (not from memory, and not depending on the size of memory). Suppose you want to try a massive data load, in case you are of your curiosity. In that case, I encourage you to download the 17 years of data containing 37 million loans with over 1.89 billion rows in the performance dataset. (Diskframe.Com, 2020)

The Excel program supports 1,048,576 rows by 16,384 columns and will reject the rest if it's bigger than that, do not try it in Excel if it's bigger than that size unless your results will be shady.

Reference

Diskframe.Com. (2020b). Benchmarks 1: disk.frame beats Dask! disk.frame beats JuliaDB! Anyone else wanna challenge? Diskframe.Com. Retrieved 2022, from https://diskframe.com/articles/vs-dask-juliadb.html

Youtube, A. (2012, April 15). An Introduction to R - A Brief Tutorial for R {Software for Statistical Analysis}. YouTube. Retrieved 2022, from https://www.youtube.com/watch?v=LjuXiBjxryQ

Rdrr.Io. (2020). Memory-limits: Memory Limits in R. Rdrr.Io. Retrieved 2022, from https://rdrr.io/r/base/Memory-limits.html

No comments:

Post a Comment

Big Data migrates to hybrid and multi-cloud environment

 IDC research predicts that the Global Datasphere will grow to 175 Zettabytes by 2025, and China's data sphere is on pace to become th...