always curious about the world around me
to write down interesting findings while reading tech articles (such as data science or programming articles) or papers (e.g. machine learning papers)
Found a java libaray on recommender systems called libRec that I decided to pick up Java which I learnt as my first programming language during second year of my undergradaute study. One can install JDK (including JRE, see differencehere) by following the instructions here. For one single file, one can run javac helloworld.java and java helloworld to run the program. (See class file here). For packages, usually people use maven and one can download following link here. Read more about Maven on wiki page here and this tutorial. On Java Annoatations, such as @override, read more here. I don't know much about Java Web Start but maybe I should learn it later.
Read the article to know more about how to use static keyword. And pay attention that one should use namespace instead of static method for a class if it does not use static variables in the class at all. Use the unnamed namespace if one wants the function to last till the end of lifetime.
This article gives great introduction on database sharding. Sharding has horizontal and vertical two ways. Vertical means split tables into serveral tables but this is difficult and requires domain knowledge while horizontal means split rows (regions in HBase) and put regions into different physical nodes. "Consistent hashing is a special kind of hashing such that when a hash table is resized, only $K/n$ keys need to be remapped on average, where $K$ is the number of keys, and $n$ is the number of slots", accoring to wikipedia. It is used to ensure scalability and availability for data.
MD5 and SHA-1 are two widely used hash functions for cryptographic uses. See here for comparsion with SHA-256 and SHA-512. Althoughly not directly realted, history of RSA is fun to look at.
What I found it interesting is that about a few months ago, I have no idea what I am reading when I first read these two articles: article 1 and article 2, but after I started working in the clouds team, all of sudden I got a good grasp and interest in system design. This is pretty fascinating. For example, the log data I have with which I use to build an aduit tool for database can be thought as an infrastructure way to get to eventual consistency for the distributed database.
Three important requirements for designing distributed systems: CAP: Consistency, Availability, Partition Tolerance. Also see this slide and here for more details. Some very basic concepts can be found here: notes. Some basic facts on latency: access time(Warning: this could be out of date in a couple of years.)
REST APIs for twitter. Give it a try for fun later when I have more time.
Also Google's URL Shortener: here and it also uses a REST-style JSON protocol.
Recently at the company, I listened to a few talks, one is on k8s and one is on product by mesosphere. Well, it seems that sales engineer at mesosphere are not even that familiar with their own products...although the concepts of what they are doing is interesting: how to scheule containers optimally on the clouds (called DC-OS), I feel it is hard for this start-up to go IPO. Maybe time would prove me I am wrong.
There are two ways: one uses separate chaining (linked lists), the other uses open addressing (arrays). See this article. Btw, distinguish linear and quadratic open addressing. Also, search hashcode if you do not understand how it works. I think what you should know is how to dynamically increase and decrease the hashtable size (well, maybe decrease is kinda hard to implement). Java only has increase which is realized by using load factor ( see Java Documentation). The initial capacity should be increased as the number of elements are much more than the bucket size and decreased when it is far below. One specific question will be here: Design a data structure that supports insert, delete, search and getRandom in constant time. I kind feel the solution provided there is cheating. Well, at least storing twice in hashtable and dictionary sounds a little silly to me. I feel one should at least understand how hashmap is implemented in Java or C++ built-in libraries. If one understands how it is actually implemented, then using another array to index it is quite uncessary.
Here is a good link: comparison. I remember a few years back distributed computing is very very hot, with a lot of companies want to hire Hadoop or Spark Programmers. Now it seems that cloud computing is getting more and more popular, like Amazon AWS, which can be a public cloud for small companies and individuals.
The reason why I want to mention 30 common OpenMP traps is that due to one line in C++ legacy code which uses OpenMP '#pragma omp parallel for', I wasted quite a few days debugging my codes which turn out to be bug-free. What a lesson to remember. It is always easier to prevent this from the beginning compared to debugging in the hard way. In this case, it is espcially hard, since it is randomized algorithm, the bug is Heisenbug. And due to the complexity of all the codes, me and my adviosr wasted days debugging it and it is really a painful memory.
First, I was curious how sorting is implemented in different languages. In c++, it uses a three part hybrid sorting algorithm. see here. I wanted to summarize all the sorting algorithms in the world here...Now perhaps it seems to be too ambitious and can hardly spare time during the internship. But I promise, I will come back to finish this part.
I decide that next time I do an interview, I will ask the interviewer which language he wants me to use, C++ or Python. The main reason is that some people simply look down people who coded in Python and sometimes I find out interviewers are having a hard time understanding my Python codes (probably because Python is so flexible and can do many things in one line which C++/Java cannot possibly achieve). The benefits of using Python in text mining is enormous. Despite so, I find C++ to be more elegant. Oh, I like pointers and using them are great although the freedom sometimes comes with a price. It requires people who use it to know exactly what they are doing and have to be very very careful. Because sometimes bugs are hard to trace and compiler does little to help compared to the languages which use interpreter. After coding in C++ for the past July 4th holiday and last weekend, I find today's work coding in Python to be so much easier. Well, maybe because it is easier, I made a few stupid mistakes. I fixed a few run-time errors easily but logical errors are much harder to trace. Doing parallel computing in Python is also very painful. Sometimes I feel I have to do something stupid to get around some errors which probably won't even exisit in C++ or Julia.
Dynamic memory allocation: means using new to allocate on heap
std::tie in c++11
First, install php: follow the guide for mac. Linux is always very easy to install.
Read the following links to understand how to write production-quality codes in Python
PEP 8 -- Style Guide for Python Code
I have never realized how important coding style is until today: when another guy resigns from work and you need to take over the development work.
When one has submakes, use option -C to change directory first before running the submake, see explanation and example.
Tomorrow I need to extract data from another database called Vertica, which is different from the NoSQL I am using now.
According to wikipedia, A column-oriented DBMS (or columnar database management system) is a database management system (DBMS) that stores data tables by column rather than by row. Practical use of a column store versus a row store differs little in the relational DBMS world. Both columnar and row databases can use traditional database query languages like SQL to load data and perform queries. Both row and columnar databases can become the backbone in a system to serve data for common extract, transform, load (ETL) and data visualization tools. However, by storing data in columns rather than rows, the database can more precisely access the data it needs to answer a query rather than scanning and discarding unwanted data in rows. Query performance is often increased as a result, particularly in very large data sets.
To read more on Vertica, read the documentation here.
Here is the website I found when I tried to write some Javascript for my website: mozilla (MDN). Although I have only written very few lines of Javascript codes, I found it very hard to debug....Maybe I am not the materials for Javascript, lol.
Commands I don't use very often so hard to recall sometimes:
touch: The touch command is the easiest way to create new, empty files. It is also used to change the timestamps (i.e., dates and times of the most recent access and modification) on existing files and directories.
tar to compress: e.g. tar cvzf foo.tgz *.cc *.h
tar to extract a TarBall File: e.g. tar -xzvf file.tar.gz
To know the difference between tar and gzip, read the link. Basically on linux, gzip is for one file compression and tar is for mutiple files.
crontab to run script on a regular schedule
nohup to run script after logout, similar to screen somehow.
cat together with direct of flow
ln: need to understand difference between ln and link
man ls: to get more info about ls
virtulenvs for virtual environment of python
read more on apt_get
change color on mac bash: link
understand output of ls: link
$# Stores the number of command-line arguments that were passed to the shell program. $? Stores the exit value of the last command that was executed. $0 Stores the first word of the entered command (the name of the shell program). $* Stores all the arguments that were entered on the command line ($1 $2 ...). "$@" Stores all the arguments that were entered on the command line, individually quoted ("$1" "$2" ...).
Linux Shell Scriptinge Tutorial and this page seems good too: link here
awk: need to read more
Openstack is the framework that companies, including AT&T, use to build on top to have their own private cloud. Here is the slides I use to learn the basics: openstack 101
The following is taken from Quora.
OpenStack is a cloud computing platform. It provides infrastructure, essentially servers, from cloud-based resources. Hadoop and Spark are distributed data processing platforms which may be implemented in a local data center (on premise) or in the cloud.
OpenStack has a number of components that manage and serve the pieces of a cloud-based data center: Nova for compute, Neutron for networking, Cinder for block storage, Keystone for identity, Swift for object storage, etc.
Hadoop has components for operating a distributed compute service: the Hadoop File System (HDFS) for storing data in a replicated manner; YARN for distributed computing (including the MapReduce paradigm); Hive, Impala, SparkSQL, Presto, HAWQ (and others) for SQL on Hadoop; HBase for columnar data store; etc.
Spark also provides distributed compute, and has libraries for Machine Learning, Streaming, SQL, and others. Spark can sit atop Hadoop or other data stores, for example Cassandra.
I need to learn how to use ElasticSearch at work. And probably the most common NoSQL database is Mongodb. For relational database, I think one should be able to easily understand postgreSQL once one learns SQL.
The following answer is taken from Quora by Ewan Leith, which really helps me understand my role as intern a lot.
Infrastructure as a Service (IaaS)
Providing the fundamental building blocks of computing resources, IaaS takes the traditional physical computer hardware, such as servers, storage arrays, and networking, and lets you build virtual infrastructure that mimics these resources, but which can be created, reconfigured, resized, and removed within moments, as and when a task requires it. The most well known IaaS provider, Amazon Web Services, offers a variety of options, including their “EC2” computing platform, and “S3” storage platform.
Platform as a Service (PaaS)
Operating at the layer above raw computing hardware, whether physical or virtual, PaaS provides a method for programming languages to interact with services like databases, web servers, and file storage, without having to deal with lower level requirements like how much space a database needs, whether the data must be protected by making a copy between 3 servers, or distributing the workload across servers that can be spread throughout the world. Typically, applications must be written for a specific PaaS offering to take full advantage of the service, and most platforms only support a limited set of programming languages. Often, PaaS providers also have a Software as a Service offering (see below), and the platform has been initially built to support that specific software. Some examples of PaaS solutions are the “Google App Engine” system, “Heroku” which operates on top of the Amazon Web Services IaaS system, and “Force.com” built as part of the SalesForce.com Software as a Service offering.
Software as a Service (SaaS)
The top layer of cloud computing, Software as a Service is typically built on top of a Platform as a Service solution, whether that platform is publicly available or not, and provides software for end-users such as email, word processing, or a business CRM. Software as a Service is typically charged on a per-user and per-month basis, and companies have the flexibility to add or remove users at any time without addition costs beyond the monthly per-user fee. Some of the most well known SaaS solutions are Google Apps, Salesforce.com, and Microsoft Office 365
The one thing that frustrates me most at the beginning few days at work is that I don't have access to do almost everything, and LITERALLY EVERYTHING. What makes it worse is that I am not even that familiar with how to use shell in Windows. I tried Putty at first. I thought Putty with sublime can do the job. To install those softwares, I have to request temporary admin rights even. However, it is still not very efficient to code in such environment. Finally, I realize one way to get around all of this which I have always hesitated to do on my mac is to install Ubunbu on virtual machine on top of the company computer (host machine). And VirtualBox is one way to do this. To know more, please Google. I had a lot of links on this including troubleshooting but those links are on company laptop. Just one thing to bear in mind: set up shared folder after installing guest additions to copy things between host machine and VM. One lesson I learnt is practice makes perfect. I am getting better at shortcuts every day because I need those, unlike on mac, I don't really need most of those. Oh, I miss trackpad SO MUCH !!!
For Java, see link. To summarize: (make sure you understand what passing a reference by value actually means) Another explanation will be Java object references are also call by sharing: Read wiki page. And for primitives, it is call by value. Personally, I feel this explanation makes more sense and it is consistent with the way Python is defined.
For C++, see Pass By Reference vs. Pass By Value and Pass By Address(Pointer) Basically, pass by value won't affect the original arguments while pass by reference does. Pass By Address offers another method of allowing us to change the original argument of a function (like with Pass By Reference). Don't pass in the argument itself -- just pass in its address.
For Python: read call by sharing. If you pass immutable arguments like integers, strings or tuples to a function, the passing acts like call-by-value. The object reference is passed to the function parameters. They can't be changed within the function, because they can't be changed at all, i.e. they are immutable. It's different, if we pass mutable arguments. They are also passed by object reference, but they can be changed in place in the function. If we pass a list to a function, we have to consider two cases: Elements of a list can be changed in place, i.e. the list will be changed even in the caller's scope. If a new list is assigned to the name, the old list will not be affected, i.e. the list in the caller's scope will remain untouched.
For Julia: I think julia has the same property (call by sharing) as Python although doc says it is pass by reference but it is the same as Python. Please point it out if you find difference.
Here are some examples I wrote:
When $a \ne 0$, there are two solutions to \(ax^2 + bx + c = 0\)
and they are $$x = {-b \pm \sqrt{b^2-4ac} \over 2a}.$$
will give you the what you want in latex. \( \), $ $
is for inline equations and $$ $$
is for equations in block. To read more, refer to the link here: mathjax.
To hightlight the code in syntax, one can easily use
<pre><code>
code .....
</code></pre>
to highlight the syntax of the language. Below is an example code I took from leetcode:
class Solution {
public:
string convertToBase7(int num) {
if (num < 0) return "-" + convertToBase7(-num);
if (num < 7) return to_string(num);
return convertToBase7(num / 7) + to_string(num % 7);
}
};
To read more, refer to the link here: highlight.js (To display html tag, one actually need to replace left and right bracket with special characters, see link).
Have you ever wondered how some websites checked the validality of your credit card numbers while you were typing? Some websites may try to charge you $0.01 to make sure your card is valid to use for reservation or deposit, while others simply use this algorithm to check if you accidently mistyped some digits: Luhn algorithm. It is super easy to comprehend. So now you understand what the use of last digit on your credit card is. If you know previous 15 numbers, say, then you can easily determine the last digit using this algorithm. If you still have questions after reading the wiki page, please let me know via email and we can discuss more.
Template design by Andreas Viklund