for the beginners and laterals . . .

Saturday, November 27, 2021

Difference between Dataproc, Dataflow, Dataprep and Datafusion.

Dataproc	Dataflow	Dataprep	DataFusion
It supports manual provision to clusters	It supports automatic provision to clusters	It helps to prepare and clean the data for future use.	It supports major Hadoop distributions (MapR, Hortonworks) and Cloud (AWS, GCP, AZURE) to build pipeline.
If systems are Hadoop dependent then good to use Dataproc. It is created as an extension service for Hadoop. Real-time data collection with Hadoop and Spark integration feature is more prominent in it.	It is based on Apache Beam, used for data lake data collection, cleaning and workload processing in parallel manner. It mainly merges programming & execution models.	It is only seen as a data processing tool helps in Visual analytics and processing data as a plus-point. Mainly used with Big table and Big query. If you are only looking to find any anomalies or redundancy in the data, then can use it.	It is based on CDAP, an open-source pipeline development tool. It offers visualization tool to build ETL/ELT pipelines.
If one prefers a hands-on Dev-ops approach, then choose Dataproc	if you prefer a serverless approach, then select Dataflow. Fully Managed and No Ops approach	On the other hand, it is UI-driven, Fully Managed and follows No-Ops approach.	In GCP it uses cloud dataproc cluster to perform jobs and comes up with multiple prebuilt connectors from to connect source to sink.
Simple easy to use	Relatively difficult to use	Easy to use	Very easy to use
Used for Data Science and ML Eco System	Used for Batch and Streaming Data processing	Used for UI driver data processing where as multiple source data integrations.	It gives you codeless pipeline development and enterprise readiness gives data lineage, metadata management much easier

Tuesday, August 3, 2021

Difference between Structured, Semi-Structured and Un Structured data

Properties	Structured data	Semi-structured data	Unstructured data
Basic	Data whose elements are addressable for effective analysis and organized into a formatted tables, schemas or repository that is typically a database.	Data is information that does not reside in a relational database but that have some organizational properties that make it easier to analyze. With some process, you can store them in the relation database.	Data is a data which is not organized in a predefined manner or does not have a predefined data model; thus it is not a good fit for a mainstream relational database. There are some alternative platforms for storing and managing,
Databases	RDBMS like Oracle, MySQL, PostgreSQL. Commonly data stored in data warehouses.	Non RDBMS / NoSQL databases like Mongo DB, Dynamo DB, Riak, Redis, etc. Follows Hadoop Methodology Commonly data stored in data lakes and data marts.	NoSQL databases like Mongo DB, Cassandra HBase, CouchDB, Dynamo DB, Riak, Redis, etc. Store’s character and binary data such as pictures, audio, video, pdf, log files, satellite images, scientific images, radar data, etc Commonly data stored in data lakes and data marts.
Scalability	Very difficult to scale DB schema. Can apply horizonal and vertical scaling	scaling is simpler than structured data	more scalable.
Transactions	Matured transaction and various concurrency techniques supports ACID	Transaction is adapted from DBMS not matured	No transaction management and no concurrency
Flexibility	It is schema dependent and less flexible Having predefined format of data Schema on write	It is more flexible than structured data but less flexible than unstructured data. Variety of data in shapes and sizes. Schema on read	It is more flexible and there is absence of schema. Variety of data in shapes and sizes. Schema on read
Query performance	Structured query allow complex joining	Queries over anonymous nodes are possible	Only textual queries are possible
Version management	Versioning over tuples,row,tables	Versioning over tuples or graph is possible	Versioned as a whole
Robustness	Very robust	New technology, not very spread	New technology, not very spread

Thursday, June 17, 2021

compare star schema and snowflake schema in datawarehouse modeling

Star Schema	Snowflake Schema
Simple Database Design.	Very Complex Database Design.
De-normalized Data structure.	Normalized Data Structure.
Query also run faster.	Query runs slower comparatively star schema.
It contains one fact table surrounded by dimension tables.	One fact table surrounded by dimension table which are in turn surrounded by dimension table
It is easy to understand and provides optimal disk usage	It uses smaller disk space.
Only single join creates the relationship between the fact table and any dimension tables.	It requires many joins to fetch the data.
Single Dimension table contains aggregated data.	Data Split into different Dimension Tables.
High level of data redundancy	Very low-level data redundancy
Cube processing is faster.	Cube processing might be slow because of the complex join.
Hierarchies for the dimensions are stored in the dimensional table.	Hierarchies are divided into separate tables.
Offers higher performing queries using Star Join Query Optimization. Tables may be connected with multiple dimensions.	It is represented by centralized fact table which unlikely connected with multiple dimensions.

Sunday, May 30, 2021

Standard SQL vs Legacy SQL - functions

Standard SQL	Legacy SQL
*#standardSQL* *SELECT* *repository.url* *FROM* *`bigquery-public-data.samples.github_nested`* *LIMIT 5;*	*#legacySQL* *SELECT* *repository.url* *FROM* *[bigquery-public-data.samples.github_nested]* *LIMIT 5;*
Numeric
SAFE_CAST(x AS INT64)	INTEGER(x)
SAFE_CAST(x AS INT64)	CAST(x AS INTEGER)
APPROX_COUNT_DISTINCT(x)	COUNT(DISTINCT x)
COUNT(DISTINCT x)	EXACT_COUNT_DISTINCT(x)
APPROX_QUANTILES(x, buckets)	QUANTILES(x, buckets + 1)
APPROX_TOP_COUNT(x, num)	TOP(x, num), COUNT(*)
MOD(x, y)	x % y
Datetime
TIMESTAMP_DIFF(t1, t2, DAY)	DATEDIFF(t1, t2)
CURRENT_TIMESTAMP	NOW
FORMAT_TIMESTAMP(fmt, t)	STRFTIME_UTC_USEC(t, fmt)
TIMESTAMP_TRUNC(t, DAY)	UTC_USEC_TO_DAY(t)
REGEXP_CONTAINS(s, pattern)	REGEXP_MATCH(s, pattern)
x IS NULL	IS_NULL(x)
Strings
SAFE_CAST(x AS STRING)	STRING(x)
SAFE_CAST(x AS STRING)	CAST(x AS STRING)
SUBSTR(s, 0, len)	LEFT(s, len)
SUBSTR(s, -len)	RIGHT(s, len)
STRPOS(s, "abc") > 0 or s LIKE '%abc%'	s CONTAINS "abc"
STRING_AGG(s, sep)	GROUP_CONCAT_UNQUOTED(s, sep)
IFNULL(LOGICAL_OR(x), false)	SOME(x)
IFNULL(LOGICAL_AND(x), true)	EVERY(x)
Arrays
ARRAY_AGG(x)	NEST(x)
ANY_VALUE(x)	ANY(x)
arr[SAFE_ORDINAL(index)]	NTH(index, arr) WITHIN RECORD
ARRAY_LENGTH(arr)	COUNT(arr) WITHIN RECORD
Url / IP Address Functions
NET.HOST(url)	HOST(url)
NET.PUBLIC_SUFFIX(url)	TLD(url)
NET.REG_DOMAIN(url)	DOMAIN(url)
NET.IPV4_TO_INT64( NET.IP_FROM_STRING( addr_string))	PARSE_IP(addr_string)
NET.IP_TO_STRING( NET.IPV4_FROM_INT64( addr_int64 & 0xFFFFFFFF))	FORMAT_IP(addr_int64)
NET.IP_FROM_STRING(addr_string)	PARSE_PACKED_IP(addr_string)
NET.IP_TO_STRING(addr_bytes)	FORMAT_PACKED_IP(addr_bytes)

Saturday, November 27, 2021

Difference between Dataproc, Dataflow, Dataprep and Datafusion.

Dataflow

Dataprep

DataFusion

Tuesday, August 3, 2021

Difference between Structured, Semi-Structured and Un Structured data

Thursday, June 17, 2021

compare star schema and snowflake schema in datawarehouse modeling

Sunday, May 30, 2021

Standard SQL vs Legacy SQL - functions