Disaster Tweets Classification
ABSTRACT
In this proposed system primary task is to distinguish between tweets that about a disastrous event and normal tweets. A potential application that this can have is to inform law enforcements about urgent emergencies while ignoring other irrelevant tweets. The proposed system using Logistic Regression from linear model.
Social media platforms such as Twitter provide valuable information for aiding disaster response during emergency events. Machine learning could be used to identify such information. However, supervised learning algorithms rely on labeled data, which is not readily available for an emerging target disaster. While labeled data might be available for a prior source disaster, supervised classifiers learned only from the source disaster may not perform well on the target disaster, as each event has unique characteristics (e.g., type, id, and tweet) and may cause different social media responses.
To address this limitation, the propose system make use of domain adaptation approach, which learns classifiers from unlabeled target data, in addition to source labeled data. This approach uses the Logistic Regression, together with an iterative Self‐Training strategy. Experimental results on the task of identifying tweets relevant to a disaster of interest show that the domain adaptation classifiers are better as compared to the supervised classifiers learned only from labeled source data.
Suppose user comment related to disaster that tweets is print 1 otherwise non
related to disaster that show in 0. In this Proposed system use binary classification. So its used come with Natural Language Processing classification Problem.
In this propose system gave approximately 80% accuracy in predicting disaster versus non-disastrous tweets, with bag-of-words doing a little better over TF-IDF.
INTRODUTION
Motivation
Social media provides a powerful lens for identifying people’s behavior, decision-making, and information sources before, during, and after wide-scope events, such as natural disasters. This information is important for identifying what information is propagated through which channels, and what actions and decisions people pursue. However, so much information is generated from social media services like Twitter that filtering of noise becomes necessary.
In this proposed system classification methods for filtering tweets relevant to the disaster, and categorizing relevant tweets into fine-grained categories such as preparation and evacuation.
Existing Systems and Solutions
Inherently, a huge research focus is currently on how to make sense of the social media data that are pouring into databases and how to extract important information. In this existing system examines the identification of informative tweets from social media data, particularly during natural disasters when, being informed, is essential to people’s safety. Twitter, a microblogging site, serves as an immediate form of broadcasting information to the world; it is a place where “people digitally converge during disasters”. Were there a system to filter informational from conversational tweets, relief efforts would have an enormous advantage in deciding what to do and where to focus relief efforts. In combination with Geo-location, sentiment analysis, and other social media data mining research, informative social media filtering can lead to more correct decisions leading to fewer casualties or harmed bystanders. In this existing system, our goal is to design novel features that can be used as input to machine learning classifiers in order to automatically and accurately identify informational tweets from the rest in a timely fashion.
Product Needs and Proposed System
In this project the system presents classification methods for Filtering tweets relevant to the disaster, and Categorizing relevant tweets into fine-grained categories such as preparation and evacuation. This type of automatic tweet categorization can be useful both during and after disaster events. During events, tweets can help crisis managers, first responders, and others take effective action. After the event, analysts can use social media information to understand people behavior during the event. This type of understanding is of critical importance for improving risk communication and protective decision-making leading up to and during disasters, and thus for reducing harm
Description of the Data
The data given from the sample dataset is in the form of comma separated values files with “tweets” and their corresponding sentiments extracted from the source named “Kaggle” and “data.world”.
Used Disaster tweets Dataset features, they are
id — a unique identifier for each tweet
text — the text of the tweet
location — the location the tweet was sent from (may be blank)
keyword — a particular keyword from the tweet (may be blank)
target — in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)
Source and Methods of Collecting Data
The dataset is of type keyword, tweet where the tweet_id is a unique integer identifying the tweet, and tweet is the text enclosed in quotations(“”). The dataset is a mixture of words, symbols, URLs and references to people as seen usually on twitter. The words are also a mixture of misspelled words or incorrect words, extra punctuations, and words with many repeated letters. The “tweets”, therefore, must be preprocessed to standardize the dataset. The provided sample dataset have 10000 tweets with the attributes tweet_id, disaster tweets, author and content respectively
PREPROCESSING AND FEATURE SELECTION
Overview of Preprocessing Methods
Preprocessing:
In data collection not needed every data is clean so we remove some noise through data preprocessing.
In NLP, text preprocessing is the first step in the process of building a model.
The various text preprocessing steps are:
1. Tokenization
2. Lower casing
3. Stop words removal
4. Stemming
5. Lemmatization
Installation: The python library I’ll be using to implement the text preprocessing tasks is nltk
Tokenization:
Splitting the sentence into words. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.
Lower casing:
Truecasing is a NLP problem of finding the proper capitalization of words within a text where such information is unavailable. Truecasing aids in NLP tasks, such as named entity recognition, automatic content extraction, and machine translation.
Stop words removal:
The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.
Stemming:
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding natural language processing .
Lemmatization:
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
Overview of Feature Selection Methods
Raw tweets scraped from twitter generally result in a noisy and obscure dataset. This is due to the casual and ingenious nature of people’s usage of social media. Tweets have certain special characteristics such as retweets, user mentions, etc. which should be suitably extracted. Therefore, raw twitter data must be normalized to create a dataset which can be easily learned by various classifiers. The objective of this step is to clean noise those are less relevant to find the sentiment of tweets such as punctuation, special characters, numbers, and terms which don’t carry much weightage in context to the text. In one of the later stages, will be extracting numeric features from the Twitter text data. This feature space is created using all the unique words present in the entire data. Here applied an extensive number of pre-processing steps to standardize the dataset and reduce its size. Some general preprocessing steps processed on tweets which are as follows:
• Sort the extracted data based on Date
• Removing User @ reference
• Remove Hashtags from tweets
• Convert the tweet characters to lowercase alphabet.
• Remove punctuations and special characters
• Tokenize each word in the dataset
• Remove Stopwords from the dataframe
• Perform Stemming and Lemmatization
Sorting the dataset
Sorting is the process of arranging data into meaningful order so that it can be analyzed more effectively. Here sorted the dataset based on the time the tweets are posted. This is performed by sorting the column with name Date and resetting the index to ascending order.
User Mention
Every twitter user has a handle associated with them. Users often mention other users in theirtweets by @handle. Here replaced all user mentions with the white spaces. The regular expression (regex) used to match user mention is @\w+.
Hashtag
Hashtags are un-spaced phrases prefixed by the hash symbol (#) which is frequently used by users to mention a trending topic on twitter. Here removed all the hashtags symbols. For example, #hello is replaced by hello. The regular expression used to match hashtags is #(\S+).
Normalization
Normalization generally refers to a series of related tasks meant to put all text on the same level. Converting text to either upper case or lower case, removing punctuation marks, special characters, white spaces will remove basic inconsistencies. Normalization improves text matching.
Tokenizing
Tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions are in-built into the nltk module itself and can be used anytime. Here turned the tweets into tokens as words separated by spaces in a text.
Stopwords removal
Some words do not contribute much to the machine learning model, so it’s good to remove them. A list of stopwords can be defined by the nltk library, or it can be business-specific.
Stemming
Eliminating affixes (circumfixes, suffixes, prefixes, infixes) from a word in order to obtain a word stem. Porter Stemmer is the most widely used technique as it is very fast. Generally, stemming chops off end of the word, and mostly it works fine. Example: Working will be stemmed as Work
Lemmatization
The goal is same as with stemming, but stemming a word sometimes loses the actual meaning ofthe word. Lemmatization usually refers to doing things properly using vocabulary andmorphological analysis of words. It returns the base or dictionary form of a word, also known as the lemma. Example: Better will be converted as Good
Preprocessing and Feature Selection Steps
Feature selection is an important part of machine learning which refers to the process of reducing the inputs for processing and analysis, or of finding the most meaningful inputs. A related term, feature engineering (or feature extraction), refers to the process of extracting useful information or features from existing data.
Feature selection is critical to building a good model for several reasons. One is that feature selection implies some degree of cardinality reduction, to impose a cutoff on the number of attributes that can be considered when building a model. Data almost always contains more information than is needed to build the model, or the wrong kind of information.
Even if resources were not an issue, should perform feature selection and identify the best columns, because unneeded columns can degrade the quality of the model in several ways:
• Noisy or redundant data makes it more difficult to discover meaningful patterns.
• If the data set is high-dimensional, most data mining algorithms require a much larger training data set.
In short, feature selection helps solve two problems: having too much data that is of little value, or having too little data that is of high value. The goal is to select a compact feature subset from the exhaustive list of extracted features in order to reduce the computational complexity without scarifying the classification accuracy.
Here in this project, selected the features or the columns “Date”, “Tweets” and the “Cleanedtweet” as the effective features for classification of tweets. The Date column can be more frequently used for exploratory analysis of data based on the time of posting which is an efficient column to segregate the train and test data. The features Tweet and Cleanedtweet acts as the major component for model building and evaluation. In order to create the train and test dataset onemore column known as “Disaster” are also need to be selected.
MODEL DEVELOPMENT
Algorithms Applied
The corpus-based method, also called machine learning method, typically trains disaster tweet classification with machine learning algorithms. Several supervised learning algorithms have been used to classify the sentiment label into positive or negative. Among them, Logistic Regression (LR). These algorithms require handcrafted features, such as social media-driven features and labels. The lexicon-based method usually determines the sentiment or polarity of opinion by evaluating the sentiment words in the document or the sentence. Now, Train a language model on a large unlabeled text corpus then Fine-tune this large model to specific NLP tasks to utilize the large repository of knowledge the model has gained. Ever since the transfer learning in NLP is helping in solving many tasks with state-of-the-art performance.
Logistic Regression
Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm). In 1972, Nelder and Wedderburn proposed this model with an effort to provide a means of using linear regression to the problems which were not directly suited for application of linear regression. Infact, they proposed a class of different models (linear regression, ANOVA, Poisson Regression etc) which included logistic regression as a special case. The fundamental equation of generalized linear model is:
g(E(y)) = α + βx1 + γx2
Here, g() is the link function, E(y) is the expectation of target variable and α + βx1 + γx2 is thelinear predictor ( α,β,γ to be predicted). The role of link function is to ‘link’ the expectation of y to linear predictor. GLM does not assume a linear relationship between dependent and independent variables. Instead, it uses maximum likelihood estimation (MLE). Errors need to be independent but not normally distributed.
Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
Advantages: Logistic regression is designed for this purpose (classification), and is most useful for understanding the influence of several independent variables on a single outcome variable.
Disadvantages: Works only when the predicted variable is binary, assumes all predictors are independent of each other and assumes data is free of missing values.
Training Overview
The observations in the training set form the experience that the algorithm uses to learn. In supervised learning problems, each observation consists of an observed output variable and one or more observed input variables. The test set is a set of observations used to evaluate the performance of the model using some performance metric. It is important that no observations from the training set are included in the test set. If the test set does contain examples from the training set, it will be difficult to assess whether the algorithm has learned to generalize from the training set or has simply memorized it.
Here the collected dataset is further divided into train and test set where 80 percent of the data forms the train set and 20 percent of the data forms the test data. Using Train-Test-Split method, classified the train and test set and these data are used further in the above-mentioned algorithms
EXPERIMENTAL DESIGN AND EVALUATION
Experimental Design
The objective of propose system classification methods for filtering tweets relevant to the disaster, and categorizing relevant tweets into fine-grained categories such as preparation and evacuation.
Experiment-1: Clean the Text
Experiment-2: Lemmatize and remove Stopwords from text
Experiment-3: convert text to sequence with the help of tfidf technique and Vectorized
Experimental Results
Experiment-1: Convert sequence into array
Experiment-2: Logistic Regression for classification
Experiment-3: Save classifier and tfidf to load in our flask server
MODEL OPTIMIZATION
Overview of Model Tuning and Best Parameters Selection
Model Tuning:
When Our default parameter is not efficient so obviously developer goes to tune the parameter in this proposed system apply some parameter tuning. Here use various kind of Logistic Regression algorithm but all algorithm are not effective when apply model tuning algorithm got high accuracy.
Best Parameter Selection:
Some dataset have many features but no need to use all features. If suppose use all features or unwanted features prediction is not effective. If user want high and correct accuracy feature selection is must in tweet disaster classification prediction.
In this proposed system also select the best features this Disaster predict the future rate of commodity so firstly system can extract the features.
Model Tuning Process and Experiments:
One particular challenge with this task is that both disastrous and normal tweets contain a similar set of words, so this work have to use subtler differences to distinguish between them using natural language processing.
Suppose user comment related to disaster that tweets is print 1 otherwise nonrelated to disaster that show in 0. In this Proposed system use binary classification. So its used come with Natural Language Processing classification Problem.
PRODUCT DELIVERY AND DEPLOYMENT
User Manuals
Need of the User manual:
After develop the proposed system, developer obviously deliver the product. When delivery of the product user manual is mandatory. Because of user have no prior experience of product because they are end users. If suppose any issue when user use this product they can easily solve that issue through this user manual.
What are the details are available in User manual:
In user manual file have process of developing and deploying but importantly user manual have how to use this product like how to get URL and paste the URL, how to handle the main file like how to enter tweet text, and then the user have to click on the predict disaster button to check whether the given text depend to disaster or not disaster.
In below what are the problem solving are available in that manual:
Value error when click predict
Application error
Parse error
Internal server error
Value error When Click Predict
Suppose the user enters the input in numbers or any other special characters the model will not be able to predict the output, so there arrives an error. And developer solve this type of issue through manual and very easily through in our design page.
Application error
If suppose user have some prior knowledge about development they are update some features they are solving some issue so developer give solution this type of issue.
This type of issue is nothing just version mismatching of some of the library missing when deploy the product so user go to change or update the requirements.txt file these all solution are provide by user manuals.
Parse and Internal Server
This type of error occurs when user change the data or file so user handle the file very securely and carefully user directly use given URL without file but developer give the file also.
Deployment Process
When the product is ready to run local server then obviously go to deploy the project because when deploy the product user can easily used.
Local Server running:
In local server running we can run localhost Running on (http://127.0.0.1:5000/)
CONCLUSION
Summary
In this project into disaster tweets in real or not?. First, I have analyzed and explored all the provided tweets data to visualize the statistical and other properties of the presented data.Next, I performed some exploratory analysis of the data to check type of the data, whether there are unwanted features and if features have missing data. Based on the analysis, I decided to drop ‘location’ column since it has most the data missing and really have no effect on classification of tweets. The ‘text’ columns is all text data along with alpha numeric, special characters and embedded URLs.The ‘text’ column data needs to be cleaned and pre processed and vectorized before it can be used with a machine learning algorithm for the classification of the tweets.
After pre processing the train and test data,the data was vectorized using Count Vectorizer and TFIDF features and then it was split into training and test data, and then various classifiers were fit on the data and predictions were made.Out of all classifiers tested, Logistic Regression performed the best with the test accuracy of 80%. An effort was made to tune some hyper parameters of the final classifier to see if accuracy can be improved.It turned out that classifier with default parameters preformed little better than tuned model .
Limitations and Future Work
Limitations:
In this proposed system have some limitations such as given below:
- In this system cannot predict the multi-language
- In this proposed system using Disaster tweets
Future Work
Future work will be directed to investigate the specific contributions of each preprocessing procedure, as well as other settings associated with the tuning, so as to further characterize the language model for the purposes of Disaster Tweets. Moreover, possible future extensions of this work include the application of the proposed approach for similar Disaster-related tasks like irony detection and subjectivity classification, in order to validate its effectiveness with particular focus on the pre-processing step. Further the work might be extended to more fine tuning of the model so as to derive the highest accuracy of the trained model. Finally, the proposed approach will also be tested and assessed with respect to other datasets, Multi-languages and social media sources, such as Facebook posts,like instagram, in order to further estimate its applicability and generalizability