Aug 3, 2022

Big Data migrates to hybrid and multi-cloud environment

 IDC research predicts that the Global Datasphere will grow to 175 Zettabytes by 2025, and China's data sphere is on pace to become the largest in the world. IDC also predicts that Chana's data sphere is expected to grow by 30% of all regions by 2025 (Reinsel, et., 2018).

There are two factors to forecast the future of big data: a rising number of users doing everything online and billions of embedded systems and connected devices collecting and sharing data sets. With such data in the future, experts predict that multi-cloud environments and hybrid environments will come to the forefront of all technologies and businesses (Azhar, 2025). The multi-cloud environment combines public and private cloud computing from more than one cloud system (Seagate.Com, 2022). Multicloud system gives future users access to a combination of private and public data from multiple vendors. Conversely, they can simultaneously access public clouds from multiple vendors combined with the first group of vendors. A hybrid cloud system gives the ability to use one public and one private simultaneously.

The UN 2030 plan on Big Data for Sustainable Development

The united nation planned to take advantage of big data for development and humanitarian action by 2030, which is called the Sustainable Development Goals (SDGs) (United Nations, 2022). The Sustainable Development Goals (SDGs) is the global blueprint that calls on all countries to end deprivation, fight inequalities, and tackle climate change, promising that no one is left behind. The analysis shows that they need an integrated action on social, environmental, and economic challenges, focusing on inclusive of no one left behind, to achieve this goal. The following figure shows how data science and analytics can contribute to sustainable development. 

Conclusion

We need to use big data as the most valuable data to help green life, humanitarians, and public safety for the future. However, it does not look like the technology is going the right way, as it is made, defined, and used by giant corporations for more profits. My big concern is not how much IoT or other devices and social giants get and use information, but it is how and when Big Data affects climate-related studies and helps a better and safer life for humans. Climate issues now and especially in the future require adaptive strategies, actions to incite social behavior, and the development of regulatory responses to economic life. Therefore, research in the future should use big data to focus on understanding the causes of climate change, developing predictive models, and mitigating solutions (Sebestyén, 2021).

Reference

Azhar, F. (2021, July 20). The Future of Big Data: Predictions from Way2Smile's experts for 2020–2025. Way2smile. Retrieved 2022, from https://www.way2smile.ae/blog/big-data-predictions-for-2020-2025/

Edwards, J. (2021, December 7). What is predictive analytics? Transforming data into future insights. CIO. Retrieved 2022, from https://www.cio.com/article/228901/what-is-predictive-analytics-transforming-data-into-future-insights.html

Leite, M. L., de Loiola Costa, L. S., Cunha, V. A., Kreniski, V., de Oliveira Braga Filho, M., da Cunha, N. B., & Costa, F. F. (2021). Artificial intelligence and the future of life sciences. Drug Discovery Today26(11), 2515-2526.

Reinsel, D., Gantz, J., Rydning, J. (2018). From Edge to Core. Retrieved July 28, 2022, from https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

Kharas, H. A. C. J. L. (2022, March 9). Using big data and artificial intelligence to accelerate global development. Brookings. Retrieved 2022, from https://www.brookings.edu/research/using-big-data-and-artificial-intelligence-to-accelerate-global-development/

Seagate.Com. (2022). Wat is multicloud? | Zakelijke opslag | Seagate Nederland. Seagate.com. Retrieved 2022, from https://www.seagate.com/nl/nl/blog/what-is-multicloud/

Sebestyén, V., Czvetkó, T., & Abonyi, J. (2021). The Applicability of Big Data in Climate Change Research: The Importance of System of Systems Thinking. In Frontiers in Environmental Science (Vol. 9). Frontiers Media SA. https://doi.org/10.3389/fenvs.2021.619092

United Nations. (2022). Big Data for Sustainable Development. Retrieved 2022, from https://www.un.org/en/global-issues/big-data-for-sustainable-development

The Intersection of Big Data and Privacy

The majority of people worldwide, especially in the United States, believe their online and offline activities are under watch and monitored by private and public corporate and government agencies. Research shows that six out of ten U.S. adults do not think that it is possible to go through daily life without being monitored and data collected by private, public, and government companies/agencies (Auxier, 2020). 66% of adults in the U.S. also believe that being watched by companies and government has no benefits for their life, research revealed. Research also identified public opinions in about 81% of U.S. adults who believe that the potential risk behind this private data collection outweighs the benefits. Research also revealed that the government is using 64% of data, and separately 79% by companies and corporations.

The PCAST believes that it is "the use of data (including born-digital or born-analog data and the products of data fusion and analysis) that is the locus where consequences are produced." (The White House, 2022) However, can PCAST do anything about the intersection of the big data security model in the event of complex applications? The big data security model is not suggested in the event of complex applications based on being compromised easily and getting disabled. (Jain, 2016)

How to protect public privacy for public safety

Big corporations breach public privacy almost every second of our life through sophisticated social media networks, financial chains of information, cell phone, computers, and IoT devices that are almost attached to our essentials. They break all ethical and lawful standards and policies by gathering our information through anything. They break our privacy by analyzing that stolen information again. However, they are cautious about reporting what people like to see, and lawyers do not find suits for them. Does the government do anything about it? From my point of view, No! Can we (people) do anything about it? From my point of view, Sorry! No!

Let me make an example to get more detail about my ideas about why we can not do anything about it. Research revealed that social profiling by giant companies such as Microsoft, Google, Apple, 

Let me make an example to get more detail about my ideas about why we can not do anything about it. Research revealed that social profiling by giant companies such as Microsoft, Google, Apple, Facebook, and Twitter does not even need a person to be registered at all for being profiled. Facebook collects data on you even if you do not have an account with them or any other networks. (Wagner, 2018) Facebook's "shadow profiling" system can detect a person's behavior with greater than 80% accuracy who even did not and do not have an account with them. A Shadow profile is an artificial intelligence algorithm that collects users' or non-users information without their consent. 

This does not look good! It is wrong! 


Reference

Auxier, B., Rainie, L., Anderson, M., Perrin, A., Kumar, M., & Turner, E. (2020, August 17). Americans and Privacy: Concerned, Confused and Feeling Lack of Control Over Their Personal Information. Pew Research Center: Internet, Science & Tech. Retrieved 2022, from https://www.pewresearch.org/internet/2019/11/15/americans-and-privacy-concerned-confused-and-feeling-lack-of-control-over-their-personal-information/

Jain, P., Gyanchandani, M., & Khare, N. (2016). Big data privacy: a technological perspective and review. Journal of Big Data3(1), 1-25.

The White House. (2022, April 18). President's Council of Advisors on Science and Technology. Retrieved 2022, from https://www.whitehouse.gov/pcast/

Wagner, K. (2018, April 20). Facebook collects data on you even if you don't have an account. Vox. Retrieved 2022, from https://www.vox.com/2018/4/20/17254312/facebook-shadow-profiles-data-collection-non-users-mark-zuckerberg

Big Data Analytic Tool and Comparisons

 Most effective in analyzing big data

Big data analysis is a crucial part of business analysis and business intelligence; therefore, one of the biggest concerns in choosing a big data analyzer is related to the benefits of a business system. A business intelligence analytics needs the following features and categories to analyze. (crm.org) (1) Deceptive analytics to query for what something is. (2) Diagnostic analytics that aims to query why something happened. (3) Predictive analytics that tells us what will happen in the future and forecasting. (4) Prescriptive analytics, which aims to what we should do.

Another essential pre-knowledge to choosing the best fit for data analytics is the core features of data management architecture. The core features of data analytics include data preparation, data mining, modeling, discovery, warehousing, data processing, data integration, and data transformation. These core features of big data analytics lead us to better understanding and a bigger picture of Volume, Variety, Velocity, Veracity, and Value.

Here are the pros and cons of the top data analytics in trend. (Baig, 2019)

Field
Data Analytic Tool
Pros

Cons
Reference
Data Storage
Hadoop with HDFS
- High bandwidth
- High scalability
- Write once and read many

- Cluster is difficault
- Join op is slow

(Lee, 2011)

Hbase
- Highly flexible
- Consistent
- Fault tolerate

- Not good for complicated applications
(Bakshi, 2011)
Data Processing
Hadoop
- Process huge volume of data very easily

- Hard to install
- Hard to organize
- Needs expert
(Mukherjee, 2011)


MapReduce
- Support Java Lang,
- Process independently

- it's just for batch-oriented processes
(Moon, 2014)

YARN
- Efficient maintain resources
- Continuety
- Scalability of process



(Ranjan, 2014)
Data Access
Pig
- Ensures the originality of data by decreasing replication and coding line
- Fast read/write operatoins

- No web interface
- No JDBC & ODBC network support
(Herodotou, 2011)

Hive
- Data accessibility
- Good loading
and querying interface
- Direct extracting data
- Can incorporate with Hbase

- Not support unstructured data set
- Not support complicated tasks
(Dhyani, 2014)

Cassandra 
- High throughout and efficient response time
- Supports ASID property


- Not support joint operations
- Not support sub-queries
- Limited storage space

(Abramova, 2013)

Mahout
- Support different data mining
- Support patterns
- Support huge volume of data

- Not decision tree algorithm
(Condie, 2013)

Jaql
- Support semi-structured data
- Support physical transparency

- Needs consistent format in select statement query and transform operation
(Rathee, 2013)
Data Management
Zookeeper
- Highly reliable
- Offers atomicity
- Offers synchronization
- Ensures availability of data

- Multiple stacks maintenance needed
(Fan, 2013)

Oozie
- Support execution of workflow in case of error
- Include web API services

- Not good for off-grid development
(Islam, 2012)



Reference

Abramova, V., & Bernardino, J. (2013). NoSQL databases: MongoDB vs cassandra. In Proceedings of the international conference on computer science and software engineering (pp. 14-22). ACM.

Baig, M. I., Shuib, L., & Yadegaridehkordi, E. (2019). Big data tools: Advantages and disadvantages. Journal of Soft Computing and Decision Support Systems, 6(6), 14-20. 

Bakshi, K. (2012). Considerations for big data: Architecture and approach. In 2012 Aerospace Conference (pp. 1-7). IEEE.

Condie, T., Mineiro, P., Polyzotis, N., & Weimer, M. (2013). Machine learning on big data. In 29th International Conference on Data Engineering (ICDE) (pp. 1242-1244). IEEE. 

crm.org. (2021, December 13). Top 15 Best Data Analytics Tools & Software Comparison 2022. CRM.Org. Retrieved 2022, from https://crm.org/news/best-data-analytics-tools

Islam, M., Huang, A. K., Battisha, M., Chiang, M., Srinivasan, S., Peters, C., & Abdelnur, A.(2012). Oozie: Towards a scalable workflow management system for hadoop. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies (p. 4). ACM.

Fan, W., & Bifet, A. (2013). Mining big data: current status and forecast to the future. ACM sIGKDD Explorations Newsletter, 14(2), 1-5. 

Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., & Babu, S. (2011). Starfish: A Self-tuning System for Big Data Analytics. In Cidr, 11(2), 261-272.

Lee, Y., Kang, W., & Lee, Y. (2011). A Hadoop-based packet trace processing tool. In International Workshop on Traffic Monitoring and Analysis (pp. 51-63). Springer, Berlin, Heidelberg. 

Moon, S., Lee, J., & Kee, Y. S. (2014). Introducing ssds to the hadoop mapreduce tool. In 7th International Conference on Cloud Computing (pp. 272-279). IEEE.

Mukherjee, A., Datta, J., Jorapur, R., Singhvi, R., Haloi, S., & Akram, W. (2012). Shared disk big data analytics with apache hadoop. In 19th International Conference on High Performance Computing (pp. 1-6). IEEE.

Ranjan, R. (2014). Streaming big data processing in datacenter clouds. IEEE Cloud Computing, 21(1), 78-83.

Rathee, S. (2013). Big data and Hadoop with components like Flume, Pig, Hive and Jaql. In International conference on cloud, big data and trust (Vol. 15). 

Dhyani, B., & Barthwal, A. (2014). Big data analytics using Hadoop. International Journal of Computer Applications, 108(12), 265-270.

RStudio, a professional data science solution

This is the first time I just saw the name R package, and as soon as I clicked and saw the first paragraph, I would instead download and install it immediately as it is a great software. Secondly, I found some excellent youtube videos explaining this powerful software. I learned that we could use professional mathematical and statistical commands directly in the command line and see the result instantly. We can load the whole file (CSV and other formats) into the R system and start querying in the command line, and the loading time is fascinating. (Youtube, 2012)

Editing the data set using R

  • Structure the input using R

The command str(THIS - the indicator) returns the input structure with the heading, type of variables, and other details of the data set columns.

  • Summery the input using R

The command sum(THIS - the indicator) returns a full data set summary.

  • Fix the input using R

This command (Fix) is the most excellent tool that can be used, such as excel software, to browse, modify, and edit any column or cell of data set instantly. 

Maximum size of data supported by R 

The R system vector (data frame) size is around two billion. R holds all data in the virtual memory, and it makes it to store data with 64-bit builds OS (Windows OS, it's different in Unix system). The 32-bit OS imposes a limit of 4GB, whereas the 64-bit OS can reach 128Tb. (Rdrr.Io, 2020) A vector's maximum length or maximum elements in R is 2*10^9, as they hold in singed integer. Thus, R can be called a medium-size data analyzer, not a large-size supporter.

The best solution for large data sets would be "disk.frame", which handles manipulating "date.tables" as chunks written and read from FST files (not from memory, and not depending on the size of memory). Suppose you want to try a massive data load, in case you are of your curiosity. In that case, I encourage you to download the 17 years of data containing 37 million loans with over 1.89 billion rows in the performance dataset. (Diskframe.Com, 2020)

The Excel program supports 1,048,576 rows by 16,384 columns and will reject the rest if it's bigger than that, do not try it in Excel if it's bigger than that size unless your results will be shady.

Reference

Diskframe.Com. (2020b). Benchmarks 1: disk.frame beats Dask! disk.frame beats JuliaDB! Anyone else wanna challenge? Diskframe.Com. Retrieved 2022, from https://diskframe.com/articles/vs-dask-juliadb.html

Youtube, A. (2012, April 15). An Introduction to R - A Brief Tutorial for R {Software for Statistical Analysis}. YouTube. Retrieved 2022, from https://www.youtube.com/watch?v=LjuXiBjxryQ

Rdrr.Io. (2020). Memory-limits: Memory Limits in R. Rdrr.Io. Retrieved 2022, from https://rdrr.io/r/base/Memory-limits.html

Conventional database management vs. Hadoop data systems

 Without going into the definitions of the traditional and Hadoop data management systems, the following are the top highlighted points of comparison between these two systems based on the parallel processing effect.

Hadoop architecture is designed based on parallel and distributed processing systems. Each feature selector of Hadoop can work separately into subtasks, and the subtasks can then be working and processed parallel. (Hodge, 2016) Multifeatured selectors in Hadoop also can be processed in parallel, allowing multi-feature selectors to be compared, and this is unique for Hadoop, whereas the conventional RDBMS does not have this feature. 

In the field of data mining, Hadoop or MapReduce algorithms run on a parallel processing system (parallelized) for individual feature selection, which enables Hadoop to analyze a vast scale of data mining. But other data mining tools such as Weka, Matlab, and SPSS are designed for small-scale data mining because of how their multiprocessing algorithms are developed.

YARN in Hadoop is highly configurable and can assign work to the nodes in a cluster where the Hadoop Distributed File System (HDFS) spans all the nodes in the Hadoop cluster with just a single namespace. YARN can run any existing application by superseding MapReduce in Hadoop, and this ability makes the Hadoop system a vast scale of parallel data processing. An ApplicationMaster controls the node management and cluster management in Hadoop. Its job is to negotiate resources from the central resource manager and nade manager agents to monitor and execute the works. On the other hand, the MapReduce procedures map the parallel processing in separate chunks, combine the results, and issue a signle output. So, the inputs and outputs are stored in the HDFS for future analysis.  HDFS in Hadoop uses parallel processing to link all file systems on many local nodes into an extensive file system for future analysis. This linking between resources in Hadoop creates a highly aggregate bandwidth across the cluster.

So, with all said, RDBMS is used for data storage, manipulation, and retrieval. In contrast, Hadoop is an open-source multiprocessing-based system architecture for storing data and running applications and processes parallel. (GeeksforGeeks, 2022) Therefore, from my point of view, Hadoop is a more highly parallel processing-based system than the conventional RDBMS.


Reference

GeeksforGeeks. (2022, February 25). Difference Between RDBMS and Hadoop. Retrieved 2022, from https://www.geeksforgeeks.org/difference-between-rdbms-and-hadoop/

Hodge, V. J., O’Keefe, S., & Austin, J. (2016). Hadoop neural network for parallel and distributed feature selection. In Neural Networks (Vol. 78, pp. 24–35). Elsevier BV. https://doi.org/10.1016/j.neunet.2015.08.011

Industrial use of BIG DATA



Industrial use of BIG DATA 

Sohiel Nikbin 





















































Big Data Analytics Industry

 Big data and HIPAA always have severe challenges, including the data structure, big data on cloud computing systems, big data storage, cyber security of big data (including stored and on-fly), data standardization, and most importantly, data governance. (Kruse, 2016) HIPAA wants everything clear for sensitive patients' health information and personal privacy. HIPAA requires all database management, including big data providers, to create national standards to protect the patients' health records. On the other hand, big data plays a huge role in the life of biomedical science that is massively used to analyze for extreme better care and the health of society. This article explores the challenges in specific more than all benefits of using big data in the health care industry that must comply with HIPAA.

Big data structure 

Big data usage is measured by Exabytes (1018) and beyond, and still, there are many challenges to handling that volume of information online. Therefore, the structure of storing that much volume of data in online storage is another challenge. Moreover, that structure must comply with NIST and HIPAA regulations and policies.

Big data on cloud computing systems

Cloud computing is a default platform for big data. They go together as the cloud systems are used for data storage to analyze with on-demand access to the information by whoever needs the outcomes. (Muniswamaiah, 2019) So, this relationship between big data and cloud systems are input, processing, and output models. Cloud system providers must comply with HIPAA to ensure that the security and privacy of patients and health providers' measures are all in place.

There are five HIPAA security measures for cloud systems using big data, including:

  1. Data must be encrypted in the cloud by the HIPAA policy.
  2. Cloud systems must have two-factor authentication for extra security measures.
  3. They must have access control in place as required by HIPAA policy.
  4. Cloud systems must have data classification tools for both organizing health information and protecting sensitive data.
  5. Cloud systems must keep all activity logs recorded by HIPAA regulations, including who is accessing data, what part of data has been accessed, when it has been accessed, for what purposes, authorized with who, and much other critical information. (Trends, 2021)

Reference

Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013, January). Big data: Issues and challenges moving forward. In 2013 46th Hawaii international conference on system sciences (pp. 995-1004). IEEE.

Kruse, C. S., Goswamy, R., Raval, Y. J., & Marawi, S. (2016). Challenges and opportunities of big data in health care: a systematic review. JMIR medical informatics4(4), e5359.

Muniswamaiah, M., Agerwala, T., & Tappert, C. (2019). Challenges of Big Data Applications in Cloud Computing. In 9th International Conference on Computer Science, Engineering and Applications (CCSEA 2019). 9th International Conference on Computer Science, Engineering and Applications. Air Publishing Corporation. https://doi.org/10.5121/csit.2019.90918

Trends, M. (2021). You are being redirected. . . Analyticsinsight.Net. Retrieved 2022, from https://www.analyticsinsight.net/hipaa-compliance-big-data-and-the-cloud-a-guide-for-health-care-providers/

Using Big Data Analytics in Genomics and Health Care Industry

 The primary purpose of using big data in all healthcare industries is to create competitive, comparative, and effective performance and improvements. At least three features of big data, volume, variety, and velocity, are widely used in this industry. (Roski, 2014, p. 1) On the other hand, the major challenge was the concept of consent, data ownership, and control of healthcare data in a big data platform. 

Big data architectural framework in the health care industry

There is a big difference between traditional big data analytics in the healthcare system and big data analytics. The significant difference "lies in how processing is executed." (Raghupathi, 2014, p. 4) The traditional data analysis was based on structured database systems, whereas the modern healthcare data analytics is related to the "business intelligence tool installed on a stand-alone system, such as a desktop or laptop. Because big data is by definition large, the processing is broken down and executed across multiple nodes." (Raghupathi, 2014)

This graph shows the complexity of big data analysis in the healthcare industry.


Big data and personal genomics in the health care industry

One of the most critical parts of healthcare data is Genomic information which includes "individualized strategies for diagnostic or therapeutic decision-making by utilizing patients' genomic information." (He, et., 2017, para. 1) The Genomic data structure looks like the following flowchart structure extracted from a peer-reviewed article entitled "Big Data Analytics for Genomic Medicine" (He, et., 2017)

Big data in Genomic data science uncovered many essential hidden elements of an individual's genomic information, including "hidden patterns, unknown correlations, and other insights through examining large-scale various data sets."  (He, et., 2017, para. 1) Therefore, big data is used as a powerful tool of study for healthcare researchers to "decode the functional information hidden in DNA sequences." (Genome.Gov, 2021)

The following image shows the amount of data collected from the Genomics project in reality, where each shark represents 100,000,000 GB of data. (Genome.Gov, 2021)

Software called "Aligners - Whole-Genome Alignment" is one of the best tools for healthcare big data Genome researchers that "determine where individual pieces of DNA sequence lie on each part of a reference genome sequence." (Genome.Gov, 2021)
Big data framework and analysis in Genome data science in the future can bring more hidden information inside humans, animals, and plants. From my point of view, Quantum computing using big data analytics in Genome data science will be the next healthcare revolution.


Reference

Genome.Gov. (2021). Genomic Data Science Fact Sheet. Genome.Gov. Retrieved 2022, from https://www.genome.gov/about-genomics/fact-sheets/Genomic-Data-Science#:%7E:text=The%20Big%20Picture,data%20within%20the%20next%20decade.

He, K., Ge, D., & He, M. (2017). Big Data Analytics for Genomic Medicine. In International Journal of Molecular Sciences (Vol. 18, Issue 2, p. 412). MDPI AG. https://doi.org/10.3390/ijms18020412

Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. In Health Information Science and Systems (Vol. 2, Issue 1). Springer Science and Business Media LLC. https://doi.org/10.1186/2047-2501-2-3

Roski, J., Bo-Linn, G. W., & Andrews, T. A. (2014). Creating value in health care through big data: opportunities and policy implications. Health affairs33(7), 1115-1122.



Starbucks: Using Big Data analytics and AI

Starbucks is one of the huge international companies that adopted big data a few years ago. They claim that the reason why they switched to big data is "to adjust its menu according to the customer's preferences." (Kumari, 2022) However, from my point of view, the main reason Starbucks switched to big data is to catch more business and more benefits in the market. The crucial question is how big data can generate more profits, leads, and business for ultra big corporations. We can go through many reasons, but these are the major ones that genuinely work for corporations to increase their domain internationally. 

First, Starbucks needs to clearly understand where, when, and why their customers buy their products. So, just big data can have an answer for them. Second, Starbucks needs to keep their current customers with some sort of loyalty program, which is also related to the first reason. Big data analysis can handle that. Third, Starbucks needs to seize cross-selling and upsell information as much as possible, and big data have an answer for that. Forth, targeted advertising is of the freshest and highest importance in the current market, and without big data, it is impossible to advertise to targeted customers for Starbucks. Fifth, just with big data, Starbucks could improve the inefficiencies of the supply chain, and we all know the importance of the supply chain for a company such as Starbucks. Sixth, future forecasting is the most crucial element for Starbucks to be alive and for business continues, and big data is the only tool for Starbucks to forecast the future and get the best out of it. Finally, innovation is the main pillar of being alive for a big corporation such as Starbucks. Big data is the best tool for innovation, discoveries, and addressing Starbucks' best sources of revenue.

On the other hand, Starbucks has many good reasons to switch to big data in terms of time and cost. Starbucks could reach an excellent plan of cost-saving and time-saving, understand the conditions of technical and logical aspects of the international market, closely watch the social media networks, boost customer acquisition and retention, solve advertisement issues, and more. (Team, 2021)


Reference

Kumari, R. (2022). Top 10 Companies that Uses Big Data | Analytics Steps. Analyticssteps.Com. Retrieved 2022, from https://www.analyticssteps.com/blogs/companies-uses-big-data

Team, T. (2021, June 16). Why Big Data – Benefits and Importance of Big data. TechVidvan. Retrieved 2022, from https://techvidvan.com/tutorials/why-big-data/

What makes big data different from conventional data that you use every day?

I learned a lot from the class meeting overview in which Professor Dennis explained the three V models plus another V (velocity, volume, variety, and veracity). The essential distinction between traditional and big data is that traditional data uses a centralized database architecture, while big data uses a distributed architecture. Therefore the traditional databases gather information, populate information, calculate, and report based on the predefined set of questions, queries, and data schemes. On the other hand, big data architecture is based on on-the-fly calculations on preexisted data sets. In big data, all schemas and variables create based on the questions later. Therefore, using traditional data vs. big data is related to the purpose of work and projects, and it does not mean that one is better than the other or vice-versa. So, there are cons and pros involved in using each framework. For example, we can get a more detailed analysis of the market and business models and forecasting from the big data, while the analysis is limited to traditional data architecture. Alternatively, we have more security and easier use in traditional data, while security is a massive issue in big data because of on-the-fly data manipulations and calculations. 

However, what is new in big data, or what makes the big data new? 

From my point of view, the new digital age where we are living now is the fundamental reason for going toward big data and making big data a new thing. Traditional data can not answer many new technical/digital economy problems, forecasts, and evaluations. As I mentioned above, it does not mean that big data is a good alternative for all purposes, and it has many problems that need to be solved, including scaling. "increasing data throughput, growing amounts of data, and streamlining technical systems to facilitate data processing." (Lugmayr et., 2017)

"Identify and discuss at least 2 public sites that provide free access to big data sets." (CTU, 2022)

AWS cloud is one of the most sophisticated, easy-to-use, and secured platforms in the big data market where there is a free tier available, and I am currently using both the free and paid AWS cloud systems in our company. AWS cloud is very cost-effective, runs high-performance queries on petabytes of both structured and non-structured data, and can generate compelling reports.

Microsoft is another prominent data service provider with high security and high-performance software and hardware infrastructure. Microsoft also has the free trier, which I did not try before.


Reference

CTU, (2022). Colorado Technical University. Student's restricted panel. Retrieved 2022, from Colorado Technical University restricted area of assignments.

Lugmayr, A., Stockleben, B., Scheib, C., & Mailaparampil, M. A. (2017). Cognitive big data: Survey and review on big data research and its implications. what is really "new" in big data? Journal of Knowledge Management, 21(1), 197-212. https://doi.org/10.1108/JKM-07-2016-0307

Big Data migrates to hybrid and multi-cloud environment

 IDC research predicts that the Global Datasphere will grow to 175 Zettabytes by 2025, and China's data sphere is on pace to become th...