What is big data, how it works and how it can be used

Il meaning of the words “big data” it is not unique: it is in fact large or very large quantities of data, which require technologies for this to be interpreted.

The operations of big data management e big data analytics they are basically used to transform data into value. To sort the information and extract it exclusively useful knowledge to an organization or a specific sector.

Yet in this historical moment many people they couldn’t answer the question: but big data what are they exactly?

And there are even more users who are unaware of how much it is diffusion of big data analysis it could create risks for them privacy online.

0. What is the meaning of the concept of “big data”

carlos castilla/Shutterstock

In Europen, the words “big data” can be translated as one “large amount of data”. At the same time, however, it is not easy to answer the question unequivocally: what is big data?

This is because there is no clear line that actually separates big data from other types of data. Indeed, we must take several aspects into accountwhich will be explored in greater detail in the following paragraphs.

In principle it is possible to talk about big data whenever the data group in question requires special tools: per analyze and catalog them. But also for understand them in their most specific aspects.

The objective of the human being in the face of big data is in fact transform information into knowledge. Develop technologies and analytical methods capable of “taming” these very large quantities of data, for example extrapolate from them all possible value.

Big data are very large quantities of information, which require ad hoc models and tools to be analyzed

Another aspect to take into consideration is that the meaning ei boundaries of the words “big data” they have changed over time.

Over the last 40 years the amount of digital information that circulates around the world has grown exponentially. Suffice it to say that, towards the late 80sthe data in circulation corresponded to circa 300 Petabyte: the equivalent of 3.000 terabyte.

In the early 2000s the figure was out 2,2 Exabytecorresponding to 2,200 Petabytes. Is in the 2014 international data traffic was estimated to correspond to more than 650 Exabytes.

Today count the amount of data circulating on the web it is even more complex. But to get an idea it will be enough to consider that in 2023 circa i two-thirds of the global population (corresponding to 5.4 billion people) have connected to Internet.
1. What are the five “Vs” of big data

ESB Professional/Shutterstock

An interesting study proposed by Douglas Laney helps to understand the phenomenon of data growth, taking into account three factors. Oh di three “V”which are still considered distinctive of big data today.

The first “V” is that of volume: the actual amount of information contained within a big data dataset. Taking into account the fact that the single dataset can be related to different sources: on the one hand there are datasets linked to social media, on the other those coming from groups of email addresses or databases. But also all the data recorded within the devices IoTor those of’Internet of Things. And then those who travel on the cloud.

The second “V” is that of variety: the typology and the diversification of information contained within a big data dataset. THE data those are simpler to manage structured: think for example of the information contained in a database or an Excel spreadsheet.

The five “Vs” of big data refer to volume, variety, velocity, veracity and value

Next are the semi-structured data: for example those organized according to a mix of fixed and variable criteria. Finally i unstructured data, like those who travel on the web. The greater the variety of datathe greater the difficulty understanding them and organize them completely.

The third “V” is that of speed: the amount of time what is needed to generate and record new information. A category that becomes fundamental especially when talking about sensitive information: for example those related to analyticsbut also those related to IT security and the privacy online.

Over the years, Laney’s famous three “V”s have been joined by two more. First of all the “V” for truthfulnesswith which analysts quantify the quality and reliability of data elaborate.

And then the “V” of valuewhich serves to understand in a more general way as much as a dataset of big data is capable of generating knowledge.

It should also be underlined that the system of three or five “Vs” is not the only one that allows us to reflect on big data. They also exist more conceptual modelssuch as the one that analyzes datasets based on the criteria Information, Technology, Methods and Impact (ITMI).
2. How big data generation and acquisition works

metamorworks/Shutterstock

Il big data life cycle is generally organized into two macro-categories: the first is that of big data managementwhich begins with a phase of data generation and acquisition.

The big data generation is in turn cataloged in generated data from being allegedly,generated data from machines and data related to business processes.

I dati human generated they are the fruit of user activities: for example the information uploaded to social networksui blog or on micro-blogging platforms. But also the reviews or feedback published on the e-commercenews sites or aggregators.

I dati machine generated are those produced by non-human sources. Think in this sense about the data of GPSor to those of the aforementioned IoT devices. But also to meteorological data, or to those elaborated by medical machinery.

Big data management is the first phase of the big data life cycle and begins with generation and acquisition

Infine i dati business generated they can be processed by both humans and machines. The important thing is that they fall within a scope of useful information vehicular the processes and the decisions of a corporate business.

Also the acquisition of big data can be categorized according to specific modalities. First we consider the recovered data passing through the web services APIs: from those of the aforementioned social networks, to those of search engines.

Then the recovered data is considered through software specific like those of web scraping: tools that perform various operations automatic collection of informationscanning the various documents on the web.

It is also possible to acquire data using so-called tools ETL: i.e. those who summarize the processes Extract (extraction), Transform (transformation) e Load (loading). This methodology can be applied to both relational and non-relational databases.

Finally, it is possible to use enabling technologies the continuous acquisition of data streams. Systems that capture single events and save them in databases in near real time.
3. How big data management ends

NicoElNino/Shutterstock

The phase of big data management ends with two processes: first extraction and cleaning of the dataset. This is followed by the storage and integration of the information.

The first problem to take into consideration is that i datasetafter being collected, they are not ready to be processed. Just think about the fact that often the different information of big data have their own personal representation.

The process of extraction and cleaning serve a collect groups of information consistent and others organize them so that they can be analysed.

The extraction mode of big data vary from case to case and from data to data. Another element to consider is the possible presence of false information. Here comes into play cleaningwhich is organized starting from specific models to check the validity of the data.

Extracting and cleaning the dataset prepares the subsequent storage and integration phases

The next phase of big data storage has the main objective of ensure its availability in time. Taking into account all the complications described so far.

To achieve storage we resort to peculiar mechanisms and toolswhich vary based on the type of reference database: for example i file system distributed as GFS (Google File System) e HDFS (Hadoop Distributed File System).

These file systems allow you to observe memory infrastructures where big data is stored. Data storage is done using specific languages as the NoSQL.

And it usually goes hand in hand with integration: another procedure, which intervenes on set in data to make it easier to analyze. For example, recognizing the various textual contents.
4. What does big data analytics consist of?

Deemerwha studio/Shutterstock

The so-called phase of big data analytics follows that of big data management described in the previous paragraphs. And it predicts different query processes of the data: starting from the descriptive analysis (descriptive analytics), up to the predictions (predictive analytics) or to the prescriptions (prescriptive analytics).

Big data analytics starts with modelingl’processing e data analysis. With the goal of starting to find useful information and value within the complexity of the datasets.

The analysis mode depends on the type of data on which it acts: structured, semi-structured or unstructured. And it develops based on variety of files present inside the set.

For example text analysis allows you to…