Main Page

From Johnny Logic's Notebook

Jump to: navigation, search

Welcome to my notebooks. The content of this site derives from my paper notebooks and various other documents I have collected over the years. The initial focus will be professional, but is bound to deviate.

My current focus is putatively termed data science.
Figure 1: Data science as a Venn diagram.
This discipline is subject to ongoing terminological dispute, but whether you call it data science, analytics, data shaping, or something else, it lives at the conjunction of hacking, mathematics and statistics, and domain knowledge.

Current Projects

Past Projects


Contents

Theory, Math and Statistics

Linear Algebra

Linear algebra is a branch of mathematics that studies vector spaces, also called linear spaces, along with linear functions that input one vector and output another. Such functions are called linear maps (or linear transformations or linear operators) and can be represented by matrices if a basis is given. The matrix theory is often considered as a part of linear algebra. Linear algebra is commonly restricted to the case of finite dimensional vector spaces, while the peculiarities of the infinite dimensional case are traditionally covered in linear functional analysis.

Probability Theory

Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single occurrences or evolve over time in an apparently random fashion. If an individual coin toss or the roll of a die is considered to be a random event, then if repeated many times the sequence of random events will exhibit certain patterns, which can be studied and predicted. Two representative mathematical results describing such patterns are the law of large numbers and the central limit theorem.

Statistics

Mathematical statistics is the study of statistics from a mathematical standpoint, using probability theory as well as other branches of mathematics such as linear algebra and analysis.

Important concepts and results include:

Numerical Analysis

Numerical analysis is the study of algorithms that use numerical approximation (as opposed to general symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics).

Learning Theory

  • Statistical learning theory
  • Computational learning theory
  • Formal learning theory
  • Algorithmic information theory

Hacking, Computer Science and Software Engineering

Theoretical foundations of information and computation and practical techniques for their implementation and application in computer systems.

Theory of computation and formal languages

What can be (efficiently) automated? What can be computed and what amount of resources are required to perform those computations. Computability theory examines which computational problems are solvable on various theoretical models of computation. Computational complexity theory studies the time and space costs associated with different approaches to solving a multitude of computational problem.

Information and coding theory

The quantification of information, fundamental limits on signal processing operations such as compressing data and on reliably storing and communicating data. Coding theory is the study of the properties of codes and their fitness for a specific application. Codes are used for data compression, cryptography, error-correction and more recently also for network coding.

Algorithms and data structures

  • Algorithms: an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning.
  • Data structures: data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.

Programming and Software Engineering

  • Programming language theory is a branch of computer science that deals with the design, implementation, analysis, characterization, and classification of programming languages and their individual features.
  • Software engineering (SE) is a profession dedicated to designing, implementing, and modifying software so that it is of higher quality, more affordable, maintainable, and faster to build. It is a "systematic approach to the analysis, design, assessment, implementation, test, maintenance and reengineering of software, that is, the application of engineering to software."

Databases and information retrieval

A database is intended to organize, store, and retrieve large amounts of data easily. Digital databases are managed using database management systems to store, create, maintain, and search data, through database models and query languages.

Machine Learning and Artificial Intelligence

  • Machine learning: a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data

Data Wrangling and Analysis Tools

Scripting languages, queering languages, DBMS, DB, Platforms, APIs, etc. Composites of these may comprise analysis stacks, such as the SharePoint 2010 BI Stack, or various combination of open source software in an Open Source Analysis Stack.

Data Collection

APIs Data Sources Scraping

Data Storage and Retrieval

  • PL/SQL
  • PostgreSQL
  • SQL (ANSI Standard)
  • TOAD

Analysis and Data Mining

  • Matlab
  • Octave
  • Python
    • NumPy
    • SciPi
  • R
  • RapidMiner
  • Weka

Graphing and Visualization

  • Python
    • Matplotlib
  • WebFOCUS

Scripting, Programming and Regex

  • Perl
  • Python

Analysis, Data Mining and Machine Learning

Topics can be arranged in many ways:

By CRISP-DM Process Stages:

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment

Meta

Infoboxes

I have got Infoboxes working-- now to get a workable template for Template:Infobox

Graphing

Enabled Google Chart API, via allow HTML. Need to think about security implications (see HTML Purifier).

Data Science

Math Add-On


  \operatorname{erfc}(x) =
  \frac{2}{\sqrt{\pi}} \int_x^{\infty} e^{-t^2}\,dt =
  \frac{e^{-x^2}}{x\sqrt{\pi}}\sum_{n=0}^\infty (-1)^n \frac{(2n)!}{n!(2x)^{2n}}

Math notation.

Helpful Links

Personal tools